Crafting Data Solutions: Shrinking Pie and Leveraging Insights for Optimal Data Learning - ML 176

In today’s episode, Michael and Ben are joined by industry expert Barzan Mozafari, the CEO and co-founder at Keebo. He delves deep into the evolving landscape of data learning and cloud optimization. They explore how understanding data distribution can lead to early detection of anomalies and how optimizing data workflows can result in significant cost savings and unintended business growth. Barzan sheds light on leveraging existing cloud technologies and the role of automated tools in enhancing system interactions, while Ben talks about the intricacies of platform migration and tech debt.

Hosted by:

Ben Wilson •

Michael Berk

Special Guests:

Barzan Mozafari

RSS Spotify Apple Podcasts YouTube Amazon Music

Show Notes

They dig into the challenges and strategies for optimizing complex data pipelines, the economic pressures faced by data teams, and insights into innovation stemming from academic research. The conversation also covers the importance of maintaining customer trust without compromising data security and the iterative nature of both academic and industrial approaches to problem-solving. Join them as they navigate the intersection of technical debt, AI-driven optimization, and the dynamic collaboration between researchers and engineers, all aimed at driving continuous improvement and innovation in the world of data.

So, gear up for an episode packed with insights on shrinking pie data learning, cloud costs, automated optimization tools, and much more. Let’s dive right in!

Socials

LinkedIn: Barzan Mozafari

Transcript

Michael Berk [00:00:05]:
Welcome back to another episode of Adventures in Machine Learning. I'm one of your hosts, Michael Burke, and I do data engineering and machine learning and other stuff at Databricks. And I'm joined by my wonderful cohost, Ben Wilson.

Ben Wilson [00:00:17]:
I investigate, server column length issues, at Databricks.

Michael Berk [00:00:23]:
Today, we are speaking with Barzan. He studied computer science at both UCLA and MIT and then moved the University of Michigan as a professor of professor of computer science. He still teaches to this day, but, recently founded a startup called Kibo, which is a fully automated cloud optimizer, and they're most famous for their Snowflake integrations. So Barzan, as Databricks employees, myself and Ben, we understand how powerful it is to automate back end infrastructure, cluster provisioning, that type of thing. But I'm curious as an academic, how did you enter this world? Were you using Snowflake and had pain points? Or what was the origin story?

Barzan Mozafari [00:01:00]:
That's a good question. So I think it's it's probably easier to just start from, like, the word Kibo. So it actually, our first ideas were we're all about how we're gonna speed up queries. Right? So in Japanese means hope. So the idea was, like, when you've tried everything else and all other hope is lost, like, what else can you do? So it actually, is an interesting intersection with the Databricks founders. We are actually with the some of Databricks as founders who are working on approximate query engine. The idea was because of Moore's law so for those of you are not familiar, Moore's law is predicting how fast hardware's price is dropping or hardware speed is improving. And then we were seeing that data volumes were growing growing at a faster rate than Moore's law.

Barzan Mozafari [00:01:47]:
Right? So it was pretty scary because, like, if you are a computer scientist or you know math, you know that when you have 2 exponential curves, once you fall back fall behind, you're never gonna catch up. So if the rate of data growth, has already surpassed Moore's Law, it means if you're happy with your performance, database performance, you're gonna be sad next year. And if you're sad this year, next year, you're gonna be depressed. So the idea was like, okay. You know what? What is it that we're gonna do to close that gap? So there's a lot that's been done in the computer science community and in the industry, database industry. Like, whether it's indexing, data compression, you know, panelism, all of that stuff, and that's all great and you have to do all those things. But the idea is that all those optimizations are actually linear speed up. So if you compress your data by 10 x, you're only getting 10 x speed up.

Barzan Mozafari [00:02:39]:
None of these linear speed ups is gonna basically help you eventually catch up or get ahead of that exponential curve. So that's where we we start looking to statistical solutions to this problem. And then very quickly, we realized actually the problem is not just about Moore's law or speed. You know, you the bikes of Databricks or or, Snowflake and and other players in the state in the space have done an amazing job of lowering the adoption barrier to sort of analyzing your data and getting insights. But what's happened is that because it's so much easier to, you know, get it you know, get up and running and start analyzing data. Now there's a lot more users and applications that's tapping into data and then tapping into a lot more data, and they're basically combining a lot more data sources. So now the cost of this infrastructure is going through now, And it just wasn't humanly possible for anyone or even to this day. Like, it's not possible for you to look at you know, squint your eyes and stare, like, in a 1,000,000 queries a day and say, you know what? I think here's how I'm gonna reduce the overall cost.

Barzan Mozafari [00:03:42]:
Right? So that's where the story originated. We saw a real problem and we're academics, and we thought about, like, how we're gonna create a solution. And, I was always an outsider within academia, to be honest with you, because a lot of academics are just excited about coming up with theoretical solutions that are complicated and they can publish it. But for me, it was less satisfying. It was more about how can we create a solution that also gets widespread adoption. So that's how Keyword started. We started actually creating this data learning platform where we train AI models, from how users and applications interact with the data in the cloud. And then we start our agents, the you know, for those of you who know reinforcement learning, which is essentially a major step in LLMs, Very similar concept.

Barzan Mozafari [00:04:28]:
We learn from how those interactions happen, and we start actually pulling different levers in real time and and start optimizing it. And and, you know, we came up with this pricing model of which I'll talk about during the thing the presentation if you guys are interested. But that's how the whole story started. Like, we said, you know what? Whatever money we save the customer, we take a small percentage of that so that the incentive is outlined. So that's how the whole story started.

Michael Berk [00:04:52]:
That's that's a super interesting sort of paradigm shift from founders we usually talk to because it's usually born out of a frustration in a professional space. They're like, ah, this is too slow. This is too expensive. And instead, you guys went from a philosophical and, like, academically based law approach. I was just wondering if you have seen other start ups be founded from those sets of principles or if it's usually more of, I hate this thing. I'm a go fix it myself.

Barzan Mozafari [00:05:21]:
It's it's it's, it's a little bit of a you know, it's a little bit of both, to be honest with you. I think what happens is, you know, it happens both ways. Sometimes, like, you know, to your point, someone's been working in the travel industry for 20 years. Like, this thing is way too complicated. I'm gonna just solve this industry. Right? So, like, I call them, like, founders on a mission. We're like, hey. You know, I just wanna start a company.

Barzan Mozafari [00:05:41]:
Here's a space I understand. Let me work on some cool ideas there. Right? I think in our, you know, our story was, like, we're talking to a lot of customers. Like, I, you know, I wish I could tell you a fancy story of one morning I woke up, I had this epiphany. But the truth is actually a lot of interesting solutions come the other way around. Like, you basically are looking at the really important problem. We try to figure out, like, what is it that what you know, what's it gonna take to bring this solution to the market? Like, you know, people tell you, oh, I have these 10 problems. And then you can't just go and solve it and then hope that when they come back, they're gonna pay for it.

Barzan Mozafari [00:06:14]:
Right? So you're gonna figure out what is it that drives them, like, what are the characteristics of that problem, or the solution that will be acceptable to them? So short answer is no, actually, but, Databricks has a very similar story. Right? So your founder's site, which, I personally know, like, they were seeing that map reviews was really slow. It didn't make any sense. They come out with this really cool idea of, hey. What if we kept the data that we're running iterative competition on in memory? And, you know, there you go. That's how Spark was born and got rapid adoption, and people went from there.

Michael Berk [00:06:49]:
Alright. Okay. Cool. That that's a very interesting origin story. Now how much of the internals can you disclose?

Barzan Mozafari [00:06:58]:
You know, enough Let's hear it. No. I mean, we have patents in this space. We're actually publishing a peer review publications in this space. I can you know, I won't be able to get into any gritty details of how we, you know, train those models and whatnot, but I can tell you, like, you know, the high level workflow of, you know, how the the whole system works end to end, what are the design principles, and whatnot. And we have, you know, a dozen of different algorithms on the underneath. So even if I wanted, I don't have enough time to get the details of every single one of them.

Ben Wilson [00:07:28]:
Yeah. Having personally seen the source code for AQE and Spark, it would take several weeks, I think,

Barzan Mozafari [00:07:36]:
to discuss some of that. Exactly.

Ben Wilson [00:07:41]:
So do you find that it's a generalizable solution to get, say, 80% of the way there based on the types of operations that different customers or views are doing? Like, do you see, okay, 80% of people who are adhering to, you know, utilizing CTEs when querying data that structure the data and the lazily evaluated instruction set that's submitted, then they can say, okay. We can optimize that really well, and it it works pretty darn good. What do you do with the long tail of, like, somebody writing something? Almost you look at it, and you're like, are you intentionally trying to break this? And what do what do the optimizers do with that?

Barzan Mozafari [00:08:27]:
I think that's a, that's a good really good question, actually. You know, like, I all I've done in my entire career pretty much is, like, I've been teaching databases, building databases, and selling databases. Right? So, like, I know databases, but, like, you know, when you're saying someone's writing a really bad CTE, like, there are queries we see it. Like, I'm looking at that query. I'm like, I've spent all my career looking, you know, writing SQL queries. I can't optimize this myself. Right? So I you know, that's our inside joke is like, you know, we wanna be that infinitely competent, infinitely patient DBA. Right? But, the short answer to your question is yes, actually.

Barzan Mozafari [00:09:05]:
And the interesting part is we don't even see the customer's queries. And that was a very intentional decision we made from early on is that we wanted to a lot of people think that the hardest part about AI is the technology, which used to be, like, a decade ago. But now I think we are, as a field, at the place where the technology is not the barrier in many cases. Sometimes it still is, but in many cases, it's not. It's the the often barriers that are basically stopping us. Right? People worried about paid implementation, the auto I, privacy slash security, maintenance, tuning, you know, hallucination, all of that stuff. So one of those decisions that we made intentionally, Ben, was that because I, you know, I was involved in another startup before TiVo, and I was seeing how difficult it is to convince I mean, think about it. Like, a cloud data warehouse is where you're keeping the most precious digital assets of an enterprise.

Barzan Mozafari [00:10:01]:
Now you're a start up. You're going in and say, hey. I have this really cool solution. I'm gonna slash your bill by 30%, 40%, 50%, which is a lot of money as as I'm sure you guys are aware of, like, you know, for cloud data warehousing, it's a very expensive solution. But, you know, they're not gonna trust you with the data. So one of the decisions we made was that it has to be a no brainer from a security perspective. And what it meant was that our models can only learn and train on performance telemetry metadata. So not only do we not store any customer data, we don't even see it, including the query text.

Barzan Mozafari [00:10:32]:
We hash the query text. And the beauty of machine learning is that it can actually we don't have to make assumptions about what is that workload. Like, is this an ETL? Is it the BI? Is it reporting? Is it ad hoc? Is it data science? Is it machine learning? It's it's just a bunch of numbers. Machine learning looks at this and says, hey. Whenever I see this kind of pattern, I see this kind of behavior. I see this kind of cost. And like any other, you know, human clever human, the agents learn. They pull a lever.

Barzan Mozafari [00:11:04]:
Right? And if they basically manage to save the customer money without causing a slowdown, the agent gets rewarded and learns from that. And whenever it does something that doesn't, lead to cost saving, it gets penalized and learns from that. So the answer to your question is, surprisingly yes, actually. We we don't know, we have not seen to this day a single customer for which we've not been able to save some money. But what's that percentage? It depends on a bunch of progress. How under provisioned they are, how optimized their workload is in the first place, how open they are to, you know, letting the models, get more aggressive. Some of them ask the agent. We have a slider that tells the agent where it needs to be conservative or aggressive.

Barzan Mozafari [00:11:48]:
Whether they save 80% or 20%. It just varies from customer to one customer to another, but it does actually generalize pretty well.

Michael Berk [00:11:57]:
Cool. I have a really saucy question.

Barzan Mozafari [00:12:00]:
Go for it.

Michael Berk [00:12:00]:
Bouncing around. The slightly different topic. So Databricks has been working on this thing called predictive IO, and it seems similar to what you guys do. And all these just cloud things

Barzan Mozafari [00:12:14]:
Mhmm.

Michael Berk [00:12:14]:
Have a bunch of data and a bunch of resources to build something similar.

Barzan Mozafari [00:12:19]:
Mhmm.

Michael Berk [00:12:19]:
How do you guys differentiate, and how do you guys avoid becoming super surpassed by a cloud specific solution like Predictive IO?

Barzan Mozafari [00:12:28]:
Mhmm. No. That's a very good question. So, look, like, we basically what we do, like, we're, like, super laser focused on just being a data mining platform. We're not trying to replace Snowflake. We don't go to a Snowflake customer and say, you know what? You should go to Databricks, and we don't go to your customers and say you need to migrate to Snowflake if you want x y z. Smart. But we're telling people that's whatever.

Barzan Mozafari [00:12:49]:
Exactly. Otherwise, that would be on this call. We basically tell people whatever data stack that you have, already invested in, that's great. Keep that. That with Keywords, orders of magnitude faster and significantly cheaper than that thing that you're already using without Keywords. So one of our other design principles has been, like, we should not acquire any data migration, any infrastructure migration. So whenever the, cloud provider or the cloud data warehouse has certain functionality, we actually leverage that. So So Snowflake, for example, also has a bunch of really clever, internal mechanisms.

Barzan Mozafari [00:13:26]:
What we do is that we never reinvent the wheel. Because at predictive IO, people will actually try to leverage that to some extent. And it's not just the optimization. We actually provide, FinOps. We have a new technology, on the same platform called smart query routing. Right? So you could potentially use your own predictive, AI or to figure out where to route those queries. And so we we use that to decouple the the customer's application logic from the application performance. So they, you know, the user, the customer can just focus on their use case without worrying about cost, without worrying about performance, and just decouple those decisions.

Barzan Mozafari [00:14:04]:
So you just send those queries to the smart query logger, and they will decide, hey. Maybe, you know, I need the small, the square needs a large, this one needs a medium, and so on. So short answer to your question is we don't invent the wheel. We're not trying to replace the underneath, the technology underneath. We take advantage of whatever primitive and functionality that's in there, whether it's for better insight, better recommendations, or better actions.

Michael Berk [00:14:30]:
But it seems like a sort of shrinking pie. At for instance, Databricks has invested heavily in serverless, and they don't really expose knobs. So, there's less that can be tuned. And so what are the sort of sticking points that you anticipate there will be optimizations for the next 5 10 years?

Barzan Mozafari [00:14:47]:
Well, that's a very good point. Like, look. You know, when you're thinking about people are trying to so if if you sort of just, like, go back and look at, for example, Bitcoin, another player in this space. Right? You can just, you know, use spot instances or you can go completely, like, you know, here's a flat rate or you can say, I'm just gonna send you the query. You figure out whether you're gonna run it and whatnot. There's always a cost performance trigger. Right? You can you know, when you're working with several, someone else is making that decision. You're hiding the knobs and you're automating those knobs.

Barzan Mozafari [00:15:17]:
Right? So I don't think I think the idea of a shrinking pie for knobs is a valid question. But I don't think it's just data learning is not just about knobs. Because at the end of the day, I'm sitting next to the customer's, most valuable digital asset. I am understanding the data distribution because that's just the first app that doesn't see the data. We have additional apps that once the customer is in the platform, they actually see the data. They see the query text. They see all of that stuff. If you if I'm sitting next to your cloud data warehouse, I actually understand your data distribution more intimately than any single individual in your organization.

Barzan Mozafari [00:15:57]:
So when something's out of the ordinary when it comes to data quality, I'm actually the one like, me meaning the agent, right, is the one that actually finds out first that, hey. Listen. This column never had null value. Suddenly, you have a lot of null value. So we're working with a a pretty large customer and turned out that, like, one of the really important columns had become null for several months and no one really noticed. Right? So we understand those drastic changes in in the data distribution. We can actually see certain KPIs. So, like, you know, warehouse optimization, which is what you're referring to, is just one use case.

Barzan Mozafari [00:16:29]:
But even that use case, I don't think is gonna go away because you are actually creating something serverless. Now people are creating more. That basically that just, like, moves the bar a little bit. Like, people are not, you know, worried about what's the server size I'm gonna be using, but they're gonna stitch it with 5 other data tools and then build a more complex pipeline. And now the question is, how do you optimize that pipeline? One of the major, drivers of cost in the cloud these days is dbt models. Right? Like, you know, you created, like, 180, 500 dbt models. You know, good luck optimizing that. You could be serverless.

Barzan Mozafari [00:17:05]:
Like, you can remove all the knobs you want. At the end of the day, the question is, the customer has to pay x dollars. What can you do to reduce that cost? Sometimes the solution is changing the knob. Sometimes the solution is changing, the query. Sometimes the solution has changed the way that you're actually querying your data. But there's a lot more, like, you know, warehouse optimization, workflow intelligence, smart way routing, data quality alerts. There's a lot of different ways you can expose that. You can leverage the understanding that you have of the customer's, usage behavior and expose it to them at different parts in that in that data stack.

Ben Wilson [00:17:43]:
Alright. Yeah. I I couldn't agree more with your perception of that. As somebody who many, many years ago, back when I was doing data science work and, like, data engineering work, several times at companies I worked with or customers I was working at when I was, in the field of Databricks, you always hit that point where you've migrated to a new platform, people start using it, and sort of bad processes have propagated to the new platform. And you open it up, you're like, alright. It's GA. Everybody can use it. And they issue your first query.

Ben Wilson [00:18:22]:
You're like, man, this is slow. It's it's faster than it was on our old platform, but it's still slow.

Barzan Mozafari [00:18:28]:
Yeah.

Ben Wilson [00:18:28]:
And then you're like, alright. We need to take an entire quarter or 2 quarters and redo the data model properly and get rid of all that old tech debt. And every time that I've been a part of a team that's done that, you open up a whole different problem right after you get all that fixed. They're like, okay. The query that used to take 4 hours to run now executes in 10 seconds because we actually put the data where it should be and optimize it, put indexes on stuff, everything.

Barzan Mozafari [00:18:55]:
Exactly.

Ben Wilson [00:18:57]:
The but you still hit that. There's a finite resource limit that's placed in any business, which is the CTO gives you a budget Feel like there's so much money you can spend on this stuff. And when you fix all those problems, people just start issuing more queries.

Barzan Mozafari [00:19:12]:
Exactly. They do more.

Ben Wilson [00:19:13]:
More and more.

Barzan Mozafari [00:19:14]:
Exactly true.

Ben Wilson [00:19:15]:
So then you have to, like, okay. The queries aren't optimized. How do we tee and then, like, what you're tackling is the thing that's every time I've tried to do it or been part of an organization that's tried to do it, it's the hardest thing to fix, which is how do you teach people how to interact with an optimized system properly. And no matter how much effort you put into it, you're never gonna be as good as an automated service that can do that.

Barzan Mozafari [00:19:40]:
That's that's 100% true. And that's actually one of the common questions that sometimes people ask us is like, aren't you afraid that, like, the snowflakes of the world, like, feel like, you know, you're reducing their revenue. And I'm like, no. Actually, I'm just their we're just their unpaid customer success department because Yeah. You know, we're just letting, you know, those customers get more worked out with less money. So at the end of the day, people, like, to your point, when you optimize their workload, they end up actually doing more. You know? It's not that they go back to the CTO. Sometimes they do.

Barzan Mozafari [00:20:10]:
But more often than not, you know, the bar just shifts somewhere else. Like, now they're gonna send more queries. They're gonna stitch it with 5 other data sources, and that story seems to be, you know, repeating everyone.

Ben Wilson [00:20:23]:
Yeah. That's what we we see with Databricks customers all the time. It's like they start off with with ETL. They get all their data in their warehouse, and then they they then move on to BI, And they had all the query suck because of tech debt, and they fix all that. And then it unlocks the ML side. They'll hire a data science team.

Barzan Mozafari [00:20:42]:
Hopefully,

Ben Wilson [00:20:43]:
somebody knows what they're doing on that team. And then they'll they'll get some stuff into a, you know, maybe staging and validate it and eventually to production. And you look at the account usage over time. They're like, hang on a second. Like, yeah, they've increased 10 x as our customers, but they weren't doing any of this stuff before. And if they're a publicly traded company, you can kinda look at them and be like, jeez. Over the last 4 years, like, they've doubled in revenue. Is that because of us? You know, you you kinda wanna take credit a little bit for that.

Ben Wilson [00:21:13]:
Like, maybe that was 2% us. And sometimes they'll say that. Like, yeah, this unlocked our our business insights since we can now compete against our competitors.

Barzan Mozafari [00:21:22]:
No. That's that's that's so true, actually. You know, one of the, I wouldn't say saddest thing that the one that at least most interesting thing that we're seeing is that there's just still, a lot of data teams that basically are constantly, like, spinning their wheels trying to reinvent the wheel. They're trying to sort of they're too too like, because, like, they have a finance budget. Right? And I understand that they're trying to sort of manually patch things up. Like, hey. You know, I'm gonna do x y z. I'm gonna reduce the cost.

Barzan Mozafari [00:21:52]:
Like, just imagine how much value would be unlocked if if you actually shifted those research into growing your business to your point. Right? Like, you know, those all those smart engineers. Like, for example, you're a you're a major game gaming company or game development company software software games. And and and, you know, your data team is just spinning their wheels trying to sort of figure out how to leverage Snowflake more efficiently or how to leverage how how to optimize the Databricks, workload. Whereas they're, you know, games game development company, they should be focused on, like, bringing the nuts and and and better version of that game and increase the top line instead of being so focused, which is very common these days because of the economy, obviously. Mhmm. But you're right. Like, I think when when you free up, you know, the, resources from, being consumed by all these, you know, cost saving and and things that are not the core business of that that that customer, to your point, you go back and look at it and say, I'm glad that those engineers are actually focused on their growing their business rather than how do we pay, you know, a little less to this particular tool that was supposed to free our time off instead of just tending us into, you know, optimizers for this other tool.

Michael Berk [00:23:11]:
So another question. Back to sort of your origins from academia. What are some of the skills and concepts that have been essential in founding Kibo? Like, what are the things that you learned in academia that translate really well to being a start up founder, specifically in such a technical space?

Barzan Mozafari [00:23:31]:
That's a good question. I think, I don't know how much of this would generalize for every start up, but I can talk about, like, the kind of start ups that look like Kivo. I, you know, sometimes jokingly say that, listen, the reason why Kivo has been successful, like, we've been going out of it pretty quickly, is not because we have really charming sales reps. The reality is, like, our product or, you know, I shouldn't take credit for it, but our product team has built a product that's outsmarting the other solution out there. Right? So the reason why we can do it is because it's just we we're not just looking at, hey. I usually give the example of, key value stores. Right? Like, there there was a there was an era where every other week, there'll be a new key value store out there. Right? At some point, they ran out of they ran out of names for these companies.

Barzan Mozafari [00:24:17]:
Right? Because it it was very easy to build a new key value store. You would just and and the nice thing about it is, like, there's 100 plus different key value stores, so you never get stuck on anything. You don't know how to implement x y z. That's why there's 99 other products you can look at. But whenever you're trying to create something for the first time and it is truly innovative, like, now it's just a better user interface. It's just a slightly more optimized version of what everyone else has been doing for the past 20 years. That requires research skills. Right? So one of the nice things about academia is that, you know, in the industry, right, like, if you wanna pitch an idea to your boss or to a company, they think about risk.

Barzan Mozafari [00:24:55]:
So oftentimes, they try to sort of not put all those apples in one basket. They say, you know what? That's a high reward, high risk, which is usually shorthand per se. We're not gonna do it. Right? Like and you you hear this a lot at at, you know, in in, in the industry. But, like and I think it's the opposite. You get rewarded for taking on hairy, big problems and considering solutions that no one else is dealing with because when even when you fail, you learn from it and you go do something. That's because that's what academia has made for. Right? Like, for for for people to go and freely innovate and push the boundaries and things of that sort.

Barzan Mozafari [00:25:31]:
So I think research skills, which doesn't mean you need to have a PhD, but, like, the ability to take on an open ended problem, think outside the box, come up with a solution that maybe no one else has has has thought about and and and kind of execute on it. I think that's definitely one area. And the other idea is, this, like, this whole thing about failing fast. Right? We keep talking about failing fast, but that's pretty much what happens in in academia. Right? Like, so the stealth cycle in the industry, right, if you're you're thinking about, for example, b to b software. Right? Like, you have to come up with a you know, usually, an MVP takes at least 2 quarters. Right? After that, you work with a beta customer that's another quarter. And then then we talk about a really fast, like, product to market kind of cycle.

Barzan Mozafari [00:26:15]:
And then, you know, you have to train the sales team, and you start selling some, like, early customers, getting traction and whatnot. In academia, you write you live life writing 1 paper at a time. So if you have an idea, you submit it to a conference, you know, as soon as, like, you have some, you know, compelling results. You write up a paper. Your code could be complete crap. But you just have a proof of concept. You write up a paper. You run a bunch of experiments to see if it works or not.

Barzan Mozafari [00:26:40]:
You don't have to go hire sales people. You don't have to go, you know, spend 1,000,000 of dollars on marketing. You just basically go out there and and and that paper and that idea gets peer reviewed. And if it's a bad idea, you'll find out in, like, most conferences in computer science, you hear that within 2 months, 3 months later. So you have to fail fast. This idea of being scrappy, you know, and and making sure that, you know, you see some results before you invest too much into it. I think those 2 things, from our company DNA really did help help us out a lot at TiVo.

Ben Wilson [00:27:17]:
Must have been something in that lab that you are in because that exact approach has actually carried over into Databricks r and d.

Barzan Mozafari [00:27:25]:
It is pretty common. This is just

Ben Wilson [00:27:29]:
But I've heard from other people that have come from FAANG companies, into Databricks. And their remarks are like, I can't believe we're allowed to, like, do a spike. And, like, yeah, we have to do design docs and stuff, but we get time to do a prototype. And sometimes somebody will give us, like, hey. Go see if you can figure this out. Like, take these like, you five people from all these different teams. Just just take 6 weeks and and play jazz. Figure out what you can come up with.

Ben Wilson [00:27:57]:
And sometimes it's a failure, like an abject failure. We'll we'll even release it to private preview, get, like, 20 customers trying it out, and the response is like, we don't know about this. And then 4 months later, we have version 2 point o that's in public preview, and people are like, this is amazing. Where was this all my life? But, yeah, that that iterative process of just failing, like, failing really hard sometimes is critical to, like, how we release products the way that we do. But a lot of companies in the tech space just don't do it.

Barzan Mozafari [00:28:31]:
They don't do it. No. You're spotter. And I think that's just also a little bit about, like, getting people with a research mindset. Because, like, you know, I like, as someone, I've been writing code from an early age. Right? Like, I I was a program before I was a researcher. But, like, if I had to confess, like, researchers usually don't write the best quality code. Right? Sometimes we write crappy code because we're just trying to prove that concept and walk down.

Barzan Mozafari [00:28:54]:
And that drives solid engineers and experience, you know, Technically, it's sometimes crazy. Right? Like, you know, helping for something like this. Right? But, like, I think if you can create an environment where people like, researchers can bring their research skills, solid architects can bring their expertise, and, like, can help, like, transition once those ideas are divested or tried out of transition to product that scales, right, like something that's robust in production quality. I think a lot of amazing things happen. Like, researchers on their own can either create something that actually, you know, works, at scale. But, like, if you can pair them with with engineering teams that, you know, are are solid and can take those ideas and transition them. Like, obviously, that means both camps have to get out of their comfort zone a little more. Right? So I think when you when when you have an environment that's conducive to that kind of collaboration, just amazing things happen to your point.

Barzan Mozafari [00:29:48]:
But

Ben Wilson [00:29:51]:
Thank you. Been some, some comfort zone transitions, and but it's exciting. Like, everybody gets so enthused about it on both sides because you get the researchers a lot of people come from, that we've hired. They have, like, 10 plus years post grad. They've been doing research at Berkeley or Stanford, MIT or something. And they come in, they're like, woah. This code's complex. And, you know, the engineers are like, oh, what are you working on? I wanna I wanna see it.

Ben Wilson [00:30:23]:
And there's no there's not like like, I think there's a a brief moan of panic on both sides, but then everybody's like, hey. Let's work together and, like, let's team up and let's make this awesome. And you just see everybody grow together because you're expanding the minds of engineers to see, like, what is theoretically possible, and it unlocks a lot more creativity on their their side. And then the r and d researchers eventually, they're writing, like, production grade code within a year or so. So you're like, yeah, it's a win win all around.

Barzan Mozafari [00:30:54]:
No. Spotter. Exactly. So it's it's, it's exactly like how you describe it, man.

Michael Berk [00:31:01]:
Which process do you both like more? Do you like research spikes or more engineering focused work?

Barzan Mozafari [00:31:06]:
I think we do both, but I, you know, I think it depends what you're trying to do. Right? I think No.

Michael Berk [00:31:12]:
But personally, like, which do you enjoy more?

Barzan Mozafari [00:31:14]:
Oh, I definitely enjoy research spikes. I think it's just like you know, like I said, like, you could never pay me enough to go and create another key value store. I like, I'm just the kind of person who's like, life is too short. Like, if I wanna do something, I wanna be the 1st person doing it. Right? So research spikes usually have that kind of, flavor where, like, hey. You know? This is an idea. I might come back and say, guys, that's that's not promising. You know? That's not gonna work.

Barzan Mozafari [00:31:40]:
But, you know, when you do come back and you come up with a, you know, new solution no one else has thought about and actually works, You get a, you know, big, you know, spike of dopamine. And

Ben Wilson [00:31:53]:
and

Barzan Mozafari [00:31:53]:
it's that makes it all worth it, at least personally for me.

Ben Wilson [00:31:59]:
I enjoy 3 distinct points in that development process. The first one is I love seeing all of my dumb ideas fail in the beginning because it it just it shortens the path to getting something that might work. Mhmm. And it's also kinda fun. I like seeing, like I think I was telling you, Michael, the other day. I was doing something late at night and getting some CI set up and a package that I'm working on, and I wrote some really terrible code, because it was, like, 12:30 in the morning. Pushed it to GitHub actions, and then, I crashed the runners, like, killed them all. Basically, effectively, like a Stack Overflow.

Ben Wilson [00:32:46]:
And I just looked at it. I was like, I'm going to bed. But I kinda chuckled to myself. Like, it's been a while since I've broken something like that. Then the next morning, I look at what I actually submitted. I'm like, yeah.

Barzan Mozafari [00:32:59]:
Don't code any of that tire, dude.

Ben Wilson [00:33:02]:
It fixed it, and then it it passed. I'm like, alright. Sweet. But I also love the transition from the proof of concept works, and buy in has been signed off. Like, it's been effectively peer reviewed amongst peers at the company. And then banging out that first, like, production grade version of it, I love that experience of, like, okay. I know how bad my code was. How do I make this actually usable and extensible and maintainable? And how do I just kill all of this complexity that I had to build in the script that I wrote? That's very enjoyable.

Ben Wilson [00:33:38]:
And then finally, the release. Not not the response. I don't really care about that. I actually look for people's, like, who use it that then tell me why it's broken because I love fixing the bugs on the, like, the first few iterations. I love that experience. It's like, I know this code because I wrote this crap. I and I and, like, yeah. I'm gonna totally fix that.

Ben Wilson [00:34:01]:
That's that's my dopamine hit.

Barzan Mozafari [00:34:03]:
No. I love it. Like, I I also like how you kind of like, the 3 stages. Like, you're seeing the true value of each of those 3 stages, like and and liking it for what it, you know, what it is. Like, hey. I would not get from a to c if they didn't have point b in the middle. No. I I I that's that's that makes a lot of sense.

Michael Berk [00:34:22]:
Yeah. I think my response is I really like the research aspect, but it's sort of a product of my job because I don't have the opportunity to build really complex extensible frameworks that have, like, cool designs. Like, I'm writing a 1,000 lines of code, maybe 2,000 for, like, a typical project. And, the really fun thing is trying the art of the possible and seeing, like, can we make this work? Like, what creative ideas for attacking a problem in a from a different direction can I employ to make it it successful? So, yeah, it's interesting, but they both have their pros and cons. And it's interesting, Barzun, that your your angle is research because I feel like computer science is very fundamentally implementation optimization focus. Would you agree, or do you think there's a lot

Barzan Mozafari [00:35:09]:
of innovation? There's a lot of what?

Michael Berk [00:35:12]:
Or would or do you think there's a lot of sort of innovation and, like, groundbreaking, like, far out there ideas?

Barzan Mozafari [00:35:19]:
It actually depends on what discipline you're looking at. Right? So, like, I might get into trouble for saying this, but for example, if you just look at databases as a field, like, which is my own field, so I feel like I'm allowed to say things like this. I think the field has kind of plateaued. And you go to AWS you know? You go to the Invent. You go to, like, these places where they talk about innovation, and you're looking at this and saying, like, that's really cool that, like, that ship now has x. Like, actually, Oracle had that, like, 30 years ago. Right? Like, hey. I'm I'm so glad that you guys do auto indexing here, but that happened here.

Barzan Mozafari [00:35:54]:
Or, like, you have this storage optimized thing here that uses compression. Well, you know what? Actually, Vertica had that, like, 20 years ago. So it's and they feel this pathway. It doesn't mean there's no innovation. But, like, if you're just trying to build another database, a lot of it has been tried. And I I'm not saying there will never be enough innovation. I'm just saying the number of new ideas that are, like, radically new and actually are effective is we're running out of those ideas. Like, the field has matured, which is a good thing.

Barzan Mozafari [00:36:24]:
Right? That means we can go and build the next set of, you know, AI enabled a AI enabling applications on top of what we've learned. Like, now we're, we wrote a paper a few years ago, one of my former PhD students about database learning. So the idea was, like, okay. Now let's say you do have a database that's optimized, but every time you know, if I keep asking you the example I give is about cars. Right? If I let's say that you're like me and you don't know anything about cars. Right? Then if I keep you know, if I ask you a question about this particular model of Ferrari, like, you're gonna go online and look it up and give me the answer. If I keep asking you questions about cars, you're gonna keep, like, you know, googling it. But after 2, 3 days, you're gonna pick up a few things.

Barzan Mozafari [00:37:06]:
You're gonna learn. It's gonna take you less and less time to come up with an asset to car related cars. Right? Because we humans. Like, we learn. But databases don't learn. You know? Aside from, like, very basic things like, hey. I cached this data. I cached that result before the data changed.

Barzan Mozafari [00:37:21]:
You every time you go submit a query that at least does a bunch of work, send you the results back. For the most part, that work is lost afterwards. You go back and it starts like the databases don't learn. So, that the vision that we basically presented and then we actually, built a proof of concept on it was like, how can we build a database that actually learns over time and becomes smarter every time that you quit it? Because think about it. Like, if I ask you, hey. What's the average sales for this particular region per department? And then tomorrow, that's another question, that kind of overlap. Like, maybe said, hey. What's the total number of transactions per region for the entire country? The fact that I know something about that region should help me come up with an answer to the second question a little bit faster.

Barzan Mozafari [00:38:09]:
Right? So I think there is still innovation, but it's very, field specific. Certain sub disciplines within computer science are well researched. People either have moved on or they need to move on. There are people who still haven't moved on, and they still, like, you know, keep iterating over similar ideas. Hey. Actually, I found this corner case where I can make the indexing, like, 5% more efficient. But there's a lot of interesting things, especially, like, the time we're living in with other ones, with machine learning, with, you know, hardware acceleration that we can we can still actually come up with pretty cool ideas. Like, human mind doesn't run out of cool ideas.

Barzan Mozafari [00:38:46]:
That's a nice thing. It's just like you know, maybe you fix something, you go and create a new discipline.

Michael Berk [00:38:53]:
Curious for both of your guys' opinion. What are the the frontiers that you're excited about or the new piece of technology that in the database and data querying space that you think are gonna be game changing?

Barzan Mozafari [00:39:06]:
Ben, why don't you start?

Ben Wilson [00:39:09]:
I think for data querying, the ability to map to an entire data warehouse or entire system of RDBMS, like, implementations that exist at a in an organization. And for you to be able to talk to an agent and ask a very complex question and you get the accurate response from all of that without you having to build all of the interfaces to that. Because today, you can theoretically do that. Right? You can create a bunch of tools that all issue all of these different queries to all of these different platforms, or you can, you know, like, basically fine tune the model on the metadata of your tables and your databases. And I don't think that anybody's gonna pick that up to get to, you know, a high, like, a high 90% accuracy response rate. Like, we offer something called Genie, right, at Databricks, and that's language model interface to query unity catalog tables. And in demos, it's incredible, like, amazing. I played around with it.

Ben Wilson [00:40:29]:
I'm doing integrations with it. I'm like, man, this is so cool. The fact that I can, you know, put a 100 column table with a 1000000 rows, and I can ask it just plain language questions, and it figures it out. And I can do this with 5 different tables, and it'll generate those queries for me, and it's pretty performant because of that optimized engine in the background. But then I'd point it to our internal tables or data that I had written 2 years ago in the in the Unity catalog during the demo days of that. And I usually the same query, and it loses its mind. And then I'm I'm, like, looking at it. Like, why why does it work so well on these tables that I created, you know, last month? And that my old data, it's it's just not good.

Ben Wilson [00:41:18]:
And then I just go into the UI and I'm like, oh, yeah. There's no metadata here. Like, there's no comments anywhere explaining what this table is, what's in it, or the the conditions for the ETL that that is actually putting the data in. And then the column names are almost intentionally obfuscated because I was just doing shorthand nonsense, and I have no parameter comments anywhere, of, like, what this column contains. So it's making guesses. It's inferring from what metadata it actually has. And I'm just like, okay. There's gotta be a better way to do this.

Ben Wilson [00:41:55]:
So I think that the golden goose out there is for, Arzan and his team to figure out how do I do that with the table? How do I generate the metadata that is highly accurate that is contextually relevant to this business in a way that, you know, interfacing with an agent will work properly? Yeah. But

Michael Berk [00:42:21]:
oh, just real quick. Before you jump in, I have so much beef with Genie right now. The account teams at Databricks have sold the proof of concept to, like, 5 different customers, and then the customers are like, oh, great. So now you're gonna build me this agent. I'm on 3 of those projects right now, and we just have to, like, lower the expectations, 3 orders of magnitude, because it's just not there yet. It's a really cool technology, and it will be there soon. But, the the demo is not what it is in reality. Right.

Michael Berk [00:42:49]:
Yeah. Over to Barzan. No.

Barzan Mozafari [00:42:50]:
I think, data explanation is one of those areas I'm also really excited about. But, like, if I'm kinda zooming out, like, it's very easy. Like, if you ask me what's like if I had a magic wand and I could solve any problem, I would obviously say world hunger and, like, cancer. Right? Like, I'm also very realistic about my own skill set. Right? So I think the most important thing, like, the way I'm looking at it as someone who's excited about innovation, but, also, like, I wanna make sure it's practical and gets adoption. Right? And part of adoption, like, to me has 4 4 likes, and one of it is that it has to work. Right? Not just beyond the demo to Ben's point. Right? Like, it has to work.

Barzan Mozafari [00:43:26]:
Otherwise, you get the most people who are really excited and get really frustrated, which I think is a big, one of the barriers with to some extent with AI is, like, people if the if the level of excitement doesn't match that expectation, then they get burned out. And I don't know when's the next time that the CIO was gonna sign off on something at at the word 11 minute. Right? So, if I'm looking at it from that perspective, I think I I think the key to success would be to focus on what the intersection of what can be automate opt, automated and what should be automated. Sometimes people try to automate things that shouldn't be automated and or the things that should be automated but cannot be automated with today's technology. And because they're inaccurate, they're inefficient, unreliable, all those reasons. Right? So if I'm looking at that intersection of what can and should be, you know, automated, one thing that's actually we're working on, that thing is very exciting is, like, with elements, we've seen massive success with actually query rewriting. Like, as someone who's been, like, just dealing with queries, for the past 20 years, it can actually rewrite queries that we never thought possible. But it's not just like, hey, chat GPT, can you please rewrite this query for me into a more efficient form? Because actually, 4 out of 5 times, or I should say 8 out of 10 times, it actually gets equated either doesn't even compile Mhmm.

Barzan Mozafari [00:44:51]:
Or it actually compiles, but it gives a incorrect answer or it compiles, gives the correct answer, but it's actually slower than the one I started with. So, like, we've created this, framework around it. Like, the paper is out there for those who are interested in the audience. It's called the Gen Rewrite. We actually, we've created this really cool, cycle where we basically get, we we're interacting with the LLM to actually come up with what we call human, readable rewrite rules. So, like, when we rewrite it, we actually turn it once we validate it, turn it into a rule. And then when a query comes and we use those rules as actually, as hints to the LLM. So now we're basically pretty accurate, like 90 plus percent accurate with in the sense that we can actually whenever we rewrite the query, we have pretty high confident that's actually correct, and, it's actually more efficient than the original query.

Barzan Mozafari [00:45:49]:
And the nice thing is that this database of human, I I forgot what we called in the paper. I think it's human understandable or human finite rules, something like this, h r two l, something like that. That database actually keeps going. So it's like more along this vision of creating a database that keeps getting smarter over time. Like, chat gbt, the more people are interacting with it, it's also getting smarter and smarter. So, like, creating a system that gets smarter over time than when we use it, I think, is also super exciting. But I think we will get to a place, like, when that we will be able to, explain a lot of interesting things. Like, hey.

Barzan Mozafari [00:46:25]:
Why did my sales negotiate department at this particular Walmart store, you know, go down last month compared to the, you know, other stores, comparable stores. Right? And and we will never be able to fully automate, experimentation and causality, but at least we will be able to show the most likely causes to the domain expert who will then have that domain expertise, which can should not be automated or cannot be automated at least today, to tell us, hey. You know what? These are the top three reason. I think it's because we had too many people, you know, out of office or there were this local event at this other place that was not here. So I think that's that's the line that I'm really excited about, just working on things that can and should be automated.

Ben Wilson [00:47:11]:
Yeah. That example brought to mind, an old example that I used to use when when teaching new data scientists to teams, at past companies about the difference between correlation and causality and intelligent systems. And I'm like, here's this model, and I had this dataset that I would always use that was it was basically, like, year round temperature at a park in New York City,

Barzan Mozafari [00:47:37]:
and

Ben Wilson [00:47:37]:
then the another column was, like, amount of ice cream sold. And you build a a very simple model, a regression model, and then use explainability tools and causality tools on that data. And, of course, it it's like, hey. I wanna optimize sales. And what does it come up with? It's like the thing that you need to change is just increase the temperature, and that's gonna definitely

Barzan Mozafari [00:48:00]:
I was like, I'm trying

Ben Wilson [00:48:01]:
to teach people, like, hey. Be careful of how you interpret things that come out of, you know, an algorithm.

Barzan Mozafari [00:48:09]:
Exactly.

Ben Wilson [00:48:10]:
I think that the thing with, like, the explosion of Gen AI and its popularity and its democratization, the only thing that I see as potentially, disillusioning in that as these these capabilities become greater and greater over time. Like, hey. I can query all my data, and I can ask whatever question I want. And I can bolt on to this tool that's gonna do this causality analysis for me. And somebody's like, inevitably, a system is gonna be built that has those features that can do these sorts of things, and it can query the right data. And then somebody's gonna say, how do I make my sales go up? And they're gonna ask that to the system, and the system's gonna go and it's not gonna say increase the temperature of the planet Earth, in January, but it'll it could do something similar to that in their business, and they might not know. Or, like, oh, maybe if I, yeah, focus my efforts here, turns out you're cannibalizing from another part of your business and, you know, creating chaos or whatever. I I think with incredibly intelligent and reliable systems, it could create, like, trust issues with people, with those systems.

Ben Wilson [00:49:28]:
Is that something that in academia people are thinking about?

Barzan Mozafari [00:49:32]:
I think so. Not as much and not as many as but there are some, actually. Like, we we you published a paper called DB Sherlock a few years ago. It wasn't using LLMs, but the idea was, like, how we can actually incorporate causal models into a system that can show the most likely causes and then use it as a causal model to actually help users so that we don't tell people what caused the rain was that, you know, your wife took the umbrella. But we say, hey. These are most correlated with each other, and then we can use causality models so that system learns over time. But, I think there's some people who are looking into it, but not as many, to be honest with you. Not as many as I, you know, wish, these days.

Ben Wilson [00:50:14]:
So I've got a silly question for you. You've been in the space for a while and been doing research for a very long time, And we're likely exposed to the things that everybody thinks is is pure magic nowadays about, like, oh my god. Chat gbt is the best thing ever. It's it's so smart. Anybody who's been in AI like, dealing with advanced computer science for decades is gonna look at that, like, yeah. We had these, like, a while ago. They've been around a while. Maybe not transformers models.

Ben Wilson [00:50:45]:
They're they're slightly more advanced, but they're growing off of the shoulders of giants that that came before. Were you doing, like, a table slap or knee slap with a bunch of other professors saying, I called it. I knew it was gonna happen this year, where my grandma knows the name of something that is involved with artificial intelligence.

Barzan Mozafari [00:51:07]:
That's a very good question. I think when you spend a lot of time in the space, you actually see, like, certain things that become trivial, like, or would become certain to you, become clear to you. Right? But, like, to outsiders okay. That's all you know. Right? Like, if all you've done all your career is, like, this very narrow area, which is, you know, sadly, the situation with with a lot of us in academia is, like, we know everything about a very little narrow topic. Right? So it becomes pretty clear. But to outside, it looks like magic. So, yeah, I would say, like, you know, I mean, I had students who worked on, like, transformer models and whatnot.

Barzan Mozafari [00:51:43]:
So, like, we were seeing the advances that are coming. But, you know, I think what surprised all of us is how quickly the public kind of was impressed with it. Right? Like, we go to a conference. We say, hey. We improved this accuracy by, like, you know, half a percent, and we clap for each other, and we get excited. Right? But, like, eventually, when it becomes good enough, then everyone else is certainly also gets excited about it because they're not there in the journey where, like Right. We're just going little by little, little by little, where it's harder to see it. Like, they saw, like, hey.

Barzan Mozafari [00:52:13]:
There was, like, sci fi movies, and now this is actually here. So

Ben Wilson [00:52:17]:
Right. Yeah. We even got to see that over the last, you know, 8 years or so at Databricks with even traditional ML where you look and, like, the the first couple of months or, like, probably the 1st year that MOflow is out and you're looking at the the statistics of, like, how many people are saving, what types of models. You're like, oh, yeah. We've got, like, a 100 users that saved SK learn models and deployed them, and a bunch of people doing XGBoost. Like, this is exciting. And then you look now and you're like, how many millions of these were were saved in the last week alone? And it's become so commonplace. Like, every business has these things.

Ben Wilson [00:52:59]:
And not just one, but you look an account might they might be hitting that API for logging that thing 500,000 times a week. It's like, wow. That's crazy. It it's so commonplace. But 10 years ago, that would have been like, woah. This is state of the art. And then people that have been doing that stuff for a long time like, when I came in and would talk to to, like, new accounts that we got on. They're like, we wanna learn more about this this new thing called data science, and we wanna, like, understand that, like, new thing.

Ben Wilson [00:53:33]:
This has been around for a long time, like, but way before I was born, they're like, what? No. No. We just heard about this thing that you can do. I'm like, yeah. The the paper for that was written, like, before year 1, before computing. So that speed, like, what you talked about, it it surprised me a little bit. I didn't think it would hit site guys' level of, like, hey. Everybody knows this thing, and everybody's got an account on this thing.

Ben Wilson [00:54:03]:
It's exciting, but it's also very surprising.

Barzan Mozafari [00:54:06]:
No. It's far off. Yeah. No. Exactly.

Michael Berk [00:54:10]:
Cool. So I know we're coming up on time. I'll quickly summarize. Really interesting conversation. Some things that stood out to me are research skills are very valuable for innovation. And in academia, you can typically learn the fundamentals of research, at least one would hope. And then also fast failure is essential. Sort of at a macro level, a lot of organizations are turning off knobs, so there's less configuration and customization you can do.

Michael Berk [00:54:38]:
But despite that, there will always be additional layers of infrastructure to optimize. People will start using those as discrete blocks in more complex systems. And then some future areas of innovation that we're excited about are agenda querying and query rewriting. And if you guys are curious about the paper, it's called query rewriting via large language models. So, Barzan, if you wanna learn more about you or your work, where should they go?

Barzan Mozafari [00:55:01]:
If you Google my name or go to keyboard.ai, that's that's our company. You can get live demos of what we do. You're using, you know, Snowflake or in your cloud data warehouse in any capacity, and you're interested in auto optimizing it, you know, diverting some of that manual effort or infrastructure build to some other areas of your business. It's out at keyboard.ai or Google my name and look at my academic, my home page, my bunch of papers, or reach out to me via LinkedIn.

Michael Berk [00:55:30]:
Cool. Thanks so much. Alright. Well, until next time. It's been Michael Burke and my cohost,

Ben Wilson [00:55:35]:
Ben Olson.

Michael Berk [00:55:36]:
And have a good day, everybody.

Barzan Mozafari [00:55:37]:
Thank you so much. Bye.

Crafting Data Solutions: Shrinking Pie and Leveraging Insights for Optimal Data Learning - ML 176

0:00

55:42

Playback Speed: