AI Deployment Simplified: Kit Ops' Role in Streamlining MLOps Practices - ML 159
In today’s episode, they dive into the intricate world of MLOps with Brad Micklea, a seasoned expert with extensive experience in software infrastructure and leadership roles at Eclipse Shay, Red Hat, AWS, and Jozu. Brad shares his journey of founding Jozu, an MLOps company that stands out with its commitment to open standards such as the OCI standard for packaging AI projects. Alongside Jozu, they explore KitOps, an innovative open-source project that simplifies version control and collaboration for AI teams.
Special Guests:
Brad Micklea
Show Notes
In today’s episode, they dive into the intricate world of MLOps with Brad Micklea, a seasoned expert with extensive experience in software infrastructure and leadership roles at Eclipse Shay, Red Hat, AWS, and Jozu. Brad shares his journey of founding Jozu, an MLOps company that stands out with its commitment to open standards such as the OCI standard for packaging AI projects. Alongside Jozu, they explore KitOps, an innovative open-source project that simplifies version control and collaboration for AI teams.
Join them as they discuss the challenges in integrating AI models into production, the importance of monitoring API usage, and the critical role of automated rollback systems in maintaining operational excellence. They also touch on the cultural differences in operational approaches between giants like AWS and Red Hat and hear first-hand experiences on the significance of transparency, trust, and efficient risk management in both startups and established companies.
Whether you're a DevOps professional, MLOps practitioner, or data scientist transitioning to production, this episode is packed with valuable insights and practical advice to help you navigate the complexities of AI project management. Tune in to discover how Brad and his team are tackling these challenges head-on and learn how to set up your projects for success from the ground up!
Socials
Socials
Transcript
Michael Berk [00:00:05]:
Welcome back to another episode of adventures in machine Learning. I'm one of your hosts, Michael Burke, and I do machine learning and data engineering at Databricks. And today, I'm, of course, joined by my lovely and wonderful cohost, Ben Wilson. I do slide deck presentations at Databricks. Today, we are speaking with Brad. Throughout his career, he's focused on software infrastructure, and more specifically, he's held leadership roles at, Eclipse Shay, where he was the project lead. Eclipse Shay is Eclipse on Kubernetes. He's also been the VP and GM at Red Hat, which is an open source, organization focusing on developing open source for enterprises.
Michael Berk [00:00:42]:
And then finally, he's been the GM of API Gateway at AWS, which is this like, I think it's like a cloud company or something. I think I've heard of them. And then most recently, he's worked at Jozu, which is an MLOps company, and he is the founder and CEO. So, Brad, what differentiates Jozu from all the other MLOps tools out there?
Ben Wilson [00:01:04]:
Oh, wow. Yeah. It's a good question. It's a fair question because there's, what, 1,000, 10000 MLOps tools out there right now. Yeah. It's like there's about a 1,000 every new every day. So I'll actually start by talking about Kit Ops, which is the open source project because that's kind of where everything started. My last startup, we did a bunch of things with analytics and machine learning.
Ben Wilson [00:01:29]:
This is not this one, like, Gen AI type, you know, fancy stuff. These were more standard ML models, I guess you'd call them. But even within that narrowed scope or the smaller scope of those of those models, Having come from 20 years of software engineering, myself and my coworkers were were kinda shocked at how kinda kludgy things felt when you're trying to coordinate versions of models with versions of datasets, with versions of parameters, with deployment configs, all these things you needed to do in order to get a model from Jupyter Notebook into production running helping customers. And we just thought that's this seems harder than it needs to be. And we looked at the end to end tools, both open source and and proprietary, and they definitely helped. But the thing that we didn't love was that all of them seemed to have a very proprietary format and coming from so much of my career being open source one of the things I feel super strongly about is open standards. Even more than open source, it's open standards. I hate it, for example, that there's a different plug and voltage electricity standard in North America than in Europe than in Asia, and it seemed meaningless, pointlessly different.
Ben Wilson [00:02:42]:
And so I I see so much of that in software that I'm like, let's not do that again. So what we did with Kit Ops is we said, well, if we were going to take a group of artifacts for something as important as an as an AI project, we don't want to change the tools people are using in development because those are necessary, they make sense. But when it comes to packaging all that together and then moving that through the development lifecycle and into production, why wouldn't you use something like the OCI standard which is already used for containers? Pretty much every enterprise in the world uses it as a core part of their pipeline and infrastructure. Now we didn't want to containerize datasets because that, for example, doesn't really make a lot of sense, you don't gain much by containerizing datasets, you don't want to containerize your parameters, that also doesn't make sense. But interestingly, the OCI spec is actually much broader than just containers. And so we realized you could create an an OCI compliant package. It was not a container but I think of it as being kind of like a chest of drawers. There's like a drawer for your model, a drawer for your datasets, a drawer for your parameters, a drawer for your, config.
Ben Wilson [00:03:52]:
You can package all that together and you can say okay it's version 6 of my model, version 2 of my dataset, version 5 of my parameters. But the whole thing is version 1.2 of my project. So now as a leader, I can say what version are we at? What's in production? What's 1.2. What's in development? 1.4. What's in staging? 1.3. Oh, that's interesting. Now I know where kind of everything is and I don't need to worry about what the subversions of all those individual things are in there. But for a data engineer who's just working on the datasets they can easily see that.
Ben Wilson [00:04:22]:
And if they only want to pull down the dataset because they don't need the model, because they don't need the CURP config, they don't need the params, they can do that. So they don't need to haul these giant, you know, 100 megabyte, packages back and forth in order to get the convenience of having a unified package. So that was kind of the root that that kind of kicked it off for us and and we've been fortunate we launched the we kind of started on this in February of this year, launched 5 weeks later, thanks to just an awesome team of of folks I've worked with before who work really fast. And we've already got thousands of downloads, lots of people, some interest from some very large enterprises who are struggling with this problem. So it's been it's been good.
Michael Berk [00:05:07]:
Got it. So so just checking if I understand it, you're looking to develop sort of a more modular set of components
Ben Wilson [00:05:15]:
I would pitch? I I would think of it it's not so much a modular set of components. So the part of why we called it kit ops is because, you know, the role that that Git plays for traditional software is that it is where you put all the all the code. It also ends up being the changes to that git repository are the events that trigger an indication of we're finished development now we're into, you know, building and testing and then now we're into deploying and now we're into retiring the old thing which is no longer valuable to us. Git kind of covers that for software but it doesn't do that very well, at least in our opinion, for serialized models. It doesn't do that very well for huge datasets when we're talking about training like Gen AI especially. It's kludgy, yeah you can use GitLFS but honestly it's not fun. Nobody enjoys that and after a couple years of making changes to huge datasets like that in the git repo, the git repo itself becomes painful. So I think of kittops as being kind of a I I hesitate to say replacement because you would still use git for your code of course but rather than drop your giant datasets and your serialized models and git as a way to version control it and workflow it, you put that in gitops.
Ben Wilson [00:06:36]:
And so you still have that connection between gitops and git, for example, between gitops and MLflow or any other any other tool even with KITOps and in a Jupyter Notebook, for example. Does that make it clear, Michael?
Michael Berk [00:06:48]:
Yes. Crystal clear. What are the types of lineage and versioning features that, customers really like and that you guys are focusing on?
Ben Wilson [00:06:58]:
So I think, you know, it's nothing particularly innovative. I think, again, the kind of core of what we're trying to do is stick as close to possible with what has been proven to work in enterprises already because AI is a huge change for enterprises and we didn't feel like changing every part of a toolchain was a good idea when you've got these big changes happening anyway. Some of them you need to, but some of them you don't. So typically you would still version your code in git. You could use DDC to version your your datasets if you want. But rather than have somebody say, oh well which version of the dataset goes with which version of the model? Or let's take a really scary scenario. Let's imagine you're an enterprise and you realize that accidentally shouldn't have happened somebody managed to put sensitive or PCI data into a dataset to train a model at one point. Now the question comes, okay, which models were trained with that dataset? That's actually a really hard question to answer if you've got your models in NVIDIA, your data your code in Git, your data in DVC.
Ben Wilson [00:08:01]:
I don't know. I think maybe these based on time frame, but I think based on time frame is not the kind of response you wanna have to a compliance question.
Michael Berk [00:08:11]:
Got it. That's super cool. And, what type of customer adoption have you seen? Has it been sort of slow and steady? Has it been a lot of connections maybe so you have an initial sort of uptick and then it's flattened out a bit? What does it look like?
Ben Wilson [00:08:27]:
It's been very steady. Yeah. And, admittedly, we're on, I think, week 8. So, I've gotta I've gotta kind of temper that with, we'll we'll see how things go in another 8 weeks. But in fact, things have been accelerating. The last 2 weeks straight, we've been setting records for the number of downloads, and each time more than the last. So that's been really exciting. In all honesty the biggest response we've gotten has tended to be from either DevOps practitioners or MLOps practitioners.
Ben Wilson [00:08:58]:
Data scientists themselves I think get it and certainly the ones that have worked more with models that have transitioned to production they seem to really get it. But I think, you know, there's a there's a large portion of the data science community that for a very good reason, they became data scientists. They didn't become, you know, production, you know, dev SREs for for a reason. They look at they live in a in a Jupyter Notebook and a Jupyter Notebook is is their world, you know, or something similar. And so for them, you know, how do you version this in production and how are you have confidence that, you know, you can roll back to the correct version if there's a problem? That's somebody else's issue, not theirs. So I think it's more those folks who have responsibility for production that are starting to see these things happen or see them coming down the the road and they're going, oh, I don't wanna be the guy with my hands on the keyboard in production trying to figure out how to roll back to a model when I don't know which versions correspond to what. Like, that is never a good place to be at a bad time.
Michael Berk [00:09:58]:
Yeah. So, Ben, I'm curious. What is your take on how this philosophy differs from that of MLflow?
Brad Micklea [00:10:08]:
It's it's interesting. I looked at the demos of it, and I was like, man, this is cool. But I when I looked at it and said, man, this is cool and this looks familiar, it wasn't because of work on the MLflow side. MLflow takes a different approach that is geared more towards those like, the data science crew and saying, we give you a registry. We give you the bill ability to tag and version control your models and artifacts, and it's more of a data scientist tool. But for machine learning engineers, exactly as you said, Brad, It's those people that are kinda left in the lurch in industry right now, because there's a lot more data scientists out there or a lot lot more people that are creating models, producing them, tracking them properly than there are people deploying. And a lot of places that I've talked to, like, a lot of companies, even internally at Databricks, It's not the people doing that aren't the people that a lot of people think are doing that. It's not like, oh, this is a data scientist that has learned how to code who's now, like, doing deployments.
Brad Micklea [00:11:23]:
Like, no. It's a software engineer who doesn't know anything about the ML side of it because that's not what they did. They're more concerned with how do I package and deploy software and version it and make sure that I know what I'm pushing to staging. I can't interpret the results from staging. I have to get a data scientist to look at that. I'm like, is this predicting what you thought it was? So it's it's, like, inefficient, and those people generally get frustrated. And they they run off, and they built their own solution for it. And what what I looked at when I was seeing the demo of of what you built, I was like, man, this this kinda sorta has this feel of, like, our build tool.
Brad Micklea [00:12:07]:
So it kinda feels like Bazel, and you're like, okay. I'm building a bunch of, you know, container images, and I get all these m d five checksum hashes from, like, here's the conditions of everything that went into this build container that's gonna deploy. There's validation steps. I can see the lineage and the history of what went into building this. But it it's specifically targeted towards the machine learning engineers out there, which I think it's cool. And I I kinda knew, like, we were talking about this 2 years ago internally in the team, but, like, man, somebody's gonna, like, go after this and do this right. Like, somebody's gonna take software development, best practices and apply this to the MLE side. This is cool.
Brad Micklea [00:12:53]:
Like, great job, on your company working on this.
Ben Wilson [00:12:56]:
Thank you, Ben. That's, that's great to hear. And I think it's it's something that that we've heard from others. I was speaking to a data engineer at a at a big, recreation focused software company, that that everyone would know. It starts with a z. And, they were basically saying that the development process for them around the AI specific side of what they do was a bit of a struggle, not on purpose, but it was just that there was the AI team that really, really understood the model but didn't really understand production. And there was a production software team that really, really understood production but did not really understand models, which are not deterministic. And you think about, like, the core of everything you learn about data science is determinism.
Ben Wilson [00:13:40]:
There is no, like, let's deal with this totally nondeterministic system now. That's not, like, something something that people spend a lot of time on in in most computer science stuff. So it's a weird world. And, you know, he he said, like, this is really you're creating a middle ground now, a place where the AI team can work and the software teams can work and they can actually understand each other. They can use a single relatively simple tool and kinda get something out of it. And and I think that was was really nice to hear because that was exactly what we had wanted to do. It's So it's nice when you hear people kind of seeing things that way.
Michael Berk [00:14:17]:
And why do you think you guys are some of the first people to tackle this in a seemingly correct way? Is it hard? Is it is it just the stars aligned?
Ben Wilson [00:14:27]:
I think some of it some of it is hard, but but it's not this is not like, you know, this is not like building the first gen AI. It's not it's not hard like, the hardness is not the is not the big thing here. I think that the big thing here is that myself and the cofounders, you know, we've each got 15, 20 years of experience building dev tools, and DevOps tools. We've spent time in some of those big organizations like AWS, like Red Hat, like Docker. And so we're really, really familiar with the way that enterprises process products. And I think it's funny, my father was the head of research and development for Nabisco Brands and so growing up, he would bring home these cardboard boxes, unmarked cardboard boxes, of cookies or cereals that were tests that they would develop in the test kitchen. And we would do a b tests when I was a little kid, and he'd say, like, which one of these, you know, jam filled shreddies do you like or shredded wheat do you like better than than the other one? And in his for him, because he was kind of on that sciency side, that research side, once they had kind of chosen the product, he's like, it's done. But reality is that that was just the start.
Ben Wilson [00:15:47]:
Like, that's still just a brown box with no name and some hopefully delicious thing inside. It had to go through massive amounts of, like, market testing and branding and positioning and where the shelf space is gonna go. Who get whose shelf space do we steal to put this new cereal on the on the on in the grocery store. And I think if it's the same way that you have a lot of strong strong very smart researchers and technologists in the AIML world right now. Super smart people. Some of the smartest people around all working together on these problems, which is why you see AI growing so fast. But most of those folks are not familiar with that whole 70% that comes after you found the great model. We lived in that world of the 70%, but had spent so much of our time focused on the new and the innovative.
Ben Wilson [00:16:37]:
I mean, my CTO, worked with Kubernetes very, very early. 2 startups ago, great friend of mine, Tyler Jewell, who now runs CEO from Lightbend, he had a web IDE company that he'd started and saw containers in 2014, I think, maybe 2015, and said, we should make these the basis for all of our web IDE's. And so I joined him and we tried to do it. And it was a nightmare. Containers were not super stable. I tried to run thousands of them at the same time for something like a Web IE where responsiveness has to be instantaneous or people are just like, this is a waste of time. Man, that was hard. So hard.
Ben Wilson [00:17:21]:
And so but we got there and we sold the company to Red Hat and it was amazing. And you realize that, like, sometimes you just need to understand that enterprise world and then look for the new things that are not really mature enough and figure out how to make them mature enough to fit into the enterprise world. And that's really what we're trying to do with kittops and JOSU.
Michael Berk [00:17:44]:
That's fascinating. Can you dig a little bit deeper about what that, for lack of a better word, productionizing a feature is. So it's built. You've AB tested it. You know that the shredded wheat tastes amazing. How do you go about adding the label onto the box and then doing all those additional steps that you mentioned to ensure that it's a scalable product that many people will use? Is there a formula? Are there steps? How do you
Ben Wilson [00:18:10]:
think about it? It depends a little bit on what the what the industry is and what the product is, of course. To to bring it back to the to this particular podcast, though, rather than, food, which actually, to be honest, I don't know as much about, other than than than maliciousness. You know, you look at a a model and some of the things that we struggled with certainly were how are you going to test it? How are you gonna test it and validate it before it goes out to make it? And then now the good news is that there's more and more tools that allow you to do this, you know, including including MLflow. But then you've got to figure out, okay, how are we going to log it? It? What's the right data to log? How are we going to identify where a problem is coming from? What are the signals that would tell us that we need to roll back to an older version of the model versus those signals that tell us that no, we've just got to figure out a way to shape traffic or do other things to kind of give the model a break or or better balance things with the model. How do we handle drift, and at what point does drift become bad enough, again, to say we need the new model tomorrow to correct this or no. We'll we'll do something else. You know, at AWS, it was amazing for me working with API Gateway because API Gateway's success is not about features. And being someone who spent a lot of my time in product management, you get so used to this idea that you grow your your product grows because you add features.
Ben Wilson [00:19:45]:
And the more features you add, the more growth you have. And it's the biggest fallacy in all of PM and I I fight it in every one of my organizations whenever I have a chance. And I saw no better proof of that than an API gateway. It had less features than probably most of the other gateways out there, and I don't think they would argue with me. I'm sure most people wouldn't. But it was wildly successful and it's a 9 figure business on its own and that's because it does the hard stuff of that 70% really really well. You can scale it massively. You can shrink it massively.
Ben Wilson [00:20:22]:
You can do those things repeatedly and incredibly small amounts of time. It is super secure. Like you can imagine the constant stream of attacks, and and other black hat initiatives that were being put against API gateway every single day. It's the face of AWS. And the product was bulletproof from from that perspective. And so you think about those things in terms of models, it's not different. Like you put a model out there, you've got to be able to protect it. You've got to be able to scale it.
Ben Wilson [00:20:53]:
And scaling doesn't just mean going up, scaling means kind of coming down. How are you going to can you horizontally scale it? Or does it have to be vertically scaled? These are all questions that need to be answered and dictate a different way that you have to test it, that you have to deploy it, that you have to manage it and operate it, that you have to maintain it, etcetera.
Michael Berk [00:21:14]:
What's your market research strategy like? Do you go and interview people? Do you go and, like, just put it out there, see what happens?
Ben Wilson [00:21:24]:
Yeah. Great question. So I guess it's a bit of both. I'm a big believer in nothing tells you more than getting the thing out there. You can talk to people until you and they are both blue in the face, and you might get to the truth, you might not. Not because people are lying to you, but just because there's no truth like usage. That is the ultimate truth. If I use something, I need it.
Ben Wilson [00:21:55]:
If I think I need it, but I never actually use it, and we all have these. You're just looking around your own house. If you were to move tomorrow, there's probably boxes of things that you would be able to leave behind and you might not even remember. So to me, in some ways, it's not even about purchase because people purchase things they don't need. But they but no one uses something they don't need. That just doesn't happen. We don't have time in our days for that. So usage to me is the ultimate.
Ben Wilson [00:22:24]:
We start I always start by, yes, asking lots of people and specifically asking people who are at the edges of what I think are my user base. So, obviously, you've got to talk to the core of the user base, like, for example, those DevOps folks that have been a bit alienated. That's key, of course. But once you kinda get enough head nodding from those folks, now you gotta talk to the folks a little bit on the outside. Like, is the product manager gonna care about this? Is the tester gonna care about this? Is the ML engineer gonna care about this? And is the data scientist gonna care about this? If you get enough kind of good signals from that, then you create the thing, you push it out, and ultimately that's where you really really learn. CodeNvy was fascinating with the the the web IDE because we built it. It was also, in addition to being the first one that used containers under the hood, it was also if not the first certainly one of the first that could actually do compiled languages. Back in 2015, web IDs pretty much only did interpret languages.
Ben Wilson [00:23:26]:
And so we built it very purposely to do compiled languages because we said well Java is the biggest language at the time, In enterprises, it's compiled. You know, all the languages, frankly, that were used in enterprises were compiled at the time. And so this will be an enterprise tool. We use it, sell it to enterprise teams building software. That's how we're gonna be successful. Absolutely. And then it did very well, but not amazingly. And we noticed a couple, IoT engineers, embedded systems guys using it.
Ben Wilson [00:23:59]:
We're like, that's weird. And so called them up and talked to them about it and they're like, oh, yeah. It's a nightmare trying to configure 5, 6, 7 different embedded system platforms, VMs or containers, doesn't matter, on my laptop. That's annoying. It's hard. It's frustrating. The fact that I can do this in the cloud now with your IDE is amazing. I'm so much more productive.
Ben Wilson [00:24:26]:
Tyler, myself, no one who founded the company ever had any idea about these use cases because none of us had ever worked in embedded systems or I or IoT. But as soon as that person said that, I was like, oh, this makes a lot of sense. Shortly thereafter, Samsung was our biggest customer, and it was for all their embedded system stuff. We would never have thought to go after Samsung, for or at least that division with something like this had we not had that that outcome. And that was about usage. We just followed the usage.
Michael Berk [00:24:59]:
Interesting. That makes sense. Ben, I'm curious how this correlates or doesn't correlate to the MOflow philosophy because I know you guys like shipping things and seeing if they stick. What are your thoughts?
Brad Micklea [00:25:13]:
We're no different. I mean, anybody who has read all the books and has learned over time about product management, and takes it kinda seriously, then they say the same thing that Brad just said. Just like, don't fool yourself into thinking that, hey, I talked to 50 people. They all say this a really good idea. Let's sink in 6 months of work and and 80 people's time into this. It's it's not a good idea, or using your gut instinct. It's all about ship something. If if you have a greenfield idea that you wanna test out, take 2 people in 5, 6 weeks, build a prototype that you can ship out there, market as experimental, whatever.
Brad Micklea [00:25:58]:
And, yeah, talk to people, get their feedback. But while you're talking to them, your system had better been be monitoring how many times they're calling your APIs. We don't collect PII data about any of that stuff. I have no idea who it is that's using it. I'm just getting an integer increment every time somebody hits the API. So I can track over time and say, woah. Like, we thought this was kinda okay and useful. Why are there 13,000 people using this a day? Are we onto something here? And then you double down and continue to collect data.
Brad Micklea [00:26:33]:
Well, verify that you're collect collecting the correct data, and then it's not a false signal. And then that's how you kinda get directional information about where to go. But, yeah, I've I've seen it many times in previous companies where idea people are just they're so certain that what they're building, whether it be in the ML space or the pure software space or just the product space in general at a company, they really convince themselves that their idea is great, and then they launch it. And then you you look and collect data after the fact. They're like, oh, you know, sales didn't go up the way we did. Let's check on engagement now. Pull the data. Like, we don't have instrumentation set up.
Brad Micklea [00:27:20]:
Like, why wasn't this thought of earlier? And it's a scramble to say, like then you get the c suite involved who are saying, did we just waste a ton of money and time on this? Who's to blame? Like, think about instrumentation and data collection before you build anything, and then it sets you up for success, I think.
Ben Wilson [00:27:40]:
I couldn't agree more, Ben. And it's interesting because I it's one of the things that is hardest. I love open source software, but I think the instrumentation is one of the hardest parts about it in open source software. I think it's been it's getting better now, you know, but it's really only been in the last few years that I think open source communities have become a little bit more comfortable with the idea that anonymous tracking, that kind of instrumentation is okay inside of an open source project. I I remember it was yeah. I'm I'm getting old. It was probably it doesn't feel long ago. It was maybe, you know, 8, 10 years ago, where we had to fight to add that kind of, data to the early versions of Eclipse shay because people were didn't like it.
Ben Wilson [00:28:26]:
They just were like, it's open source. It shouldn't have that stuff in it. Even though it was visible to everybody, didn't matter. But but that stuff is critical.
Brad Micklea [00:28:35]:
Yeah. For for open source and MLflow, we don't do that. Mhmm. We we look at download statistics to just kinda get a a general idea, and we talk to open source, you know, specific users and open source that happen to be Databricks customers, but, they don't wanna use the Databricks managed offering. They built their own. Like, hey. We're running MLflow on Kubernetes with, you know, 600 pods because we have, you know, 8,000 models being built today. Like, woah.
Brad Micklea [00:29:11]:
That's crazy. Yeah. You can do that on Databricks. They're like, yeah. But we have to, like, integrate with all these other systems, so we need to modify in Buffalo. Cool. Yeah. That's what it's there for.
Brad Micklea [00:29:21]:
It's open source. Do you need any tips on on what to do? And we we love working with them because we get information, and we we wanna know what features are you using. But on the Databricks side, that's where we collect telemetry. So we know every single API. We don't collect any data about what they're doing with it or who they are. But we can see every 10 minutes, we get we can see a window of snapshot of, like, what APIs are being hit on anything involving MLflow, and you can track over time by region globally. It it's interesting things sometimes. Like, you look, you're like, alright.
Brad Micklea [00:30:01]:
We're down to the last 5% of countries on the planet where people haven't used this particular API. You know? And you just see what resonates with people. And, you know, looking at your product, it it really brings to mind a lot of issues that I've seen in, you know, us internally doing, like, building demos and something or or doing stuff within internal teams where everything's when you look at MLOps through rose colored glasses, you're like, okay. I'm doing a happy path, which a lot of people like to focus on, which is, k. I get my dataset. I did all my statistical validation. I've done my feature engineering. I've got this dataset.
Brad Micklea [00:30:45]:
It's versioned. I've got everything set. I've cleaned it PII data, whatever I needed to do. And now I wanna go into model tuning and hyperparameter tuning, and I generate this this artifact that I can use to run predictions. I do all that. When you're doing the first deployment, it's relatively easy. When you're doing the 7th deployment, it's relatively easy provided you never had to roll back. And what you said earlier in the talk really resonated with me because I've had to do that back when I was a data scientist slash MLE in previous life was, okay.
Brad Micklea [00:31:26]:
We just deployed something that's totally broken because we didn't even know that this was a problem. We didn't have tests for this because everything was compartmentalized. We had
Ben Wilson [00:31:36]:
Mhmm.
Brad Micklea [00:31:36]:
Dataset, data engineering style, quality checks, and then we had ML, like, verify the predictions and, like you said, nondeterministic results. So doing statistical validations of, are we, like, classifying stuff to the right proportions? Are we is our regression results within these bounds of of SPC? And then you go and do quality checks, you know, unit test and integration test with the deployment side of the house, But there's nothing that had everything built into one artifact that you deploy to a system to say, here's all of the stuff that I'm doing to each of these these different systems and then test it end to end. That was always manual. In your first five or so deployments, you're really scrutinizing everything of that step of like, hey. I'm manually checking these these 10,000 results that I ran through for for validation and just making sure because I'm scared. You know? Like, is this junk or not? And then running through statistical validation, of the outputs and saying, okay. We're good. We can flip the switch or we can transfer traffic over to that.
Brad Micklea [00:32:48]:
It's good. And then slowly, you know, monitor it, and then people are looking in real time at data coming back and making sure, like, yeah, sales are up or, like, engagement's up. We're good. Let's let's flip the switch. But when you get to, like, the 37th release, 10 months after you started that project and everything has kinda gone well up until that point, you find a huge issue in your deployment. You realize, I think this has been around for, like, at least 7 weeks now. What state do we go back to to restore and get a like, avoid this problem? And now you're you waste 2 weeks or it's sometimes, like, personally, I've had to shut off a system that is making the company money in order to scramble to rebuild a a previous known last good state, and that's all manual. And, you know, like, on hour 53 nonstop without sleep and your 17th cup of coffee, you're wishing that you had something that had like, why are we not building containers here? Or something where I can have a build manifest that I can just say, yes.
Brad Micklea [00:33:57]:
This is an a known Alaska state. So it's it's cool that you guys are solving this. It it's it is a fundamentally massive problem for everybody involved and for a company.
Ben Wilson [00:34:11]:
Yeah. I think that's the thing, and I couldn't agree with you more about it. I think, you know, nothing you haven't actually operated a service until you've had to roll back the broken service. Like, that's the point at which you finally operated the service. Up until then, it's it's like riding a bike downhill. Like, sure. How hard is that? And you come to go up the hill and you're like, oh, wait. This is a lot harder than I thought.
Ben Wilson [00:34:37]:
I'm actually not as fit as I thought. So yeah. No. That's exactly it. And and I think that the company risks are often, in some of it because they're unknown, but but kinda underappreciated by the folks working in those roles. I was speaking to a a major global retailer, And this is, quite a few years ago now because they've been using AI for a long time. But they had a group of data scientists who built a model to predict the different types of clothing that should be distributed to warehouses in different regions and locations based on, you know, what clothing was popular in each of those areas. And it was very, very good.
Ben Wilson [00:35:30]:
It worked extremely well. And then as you can imagine, it's really important for the thin profit margins of these kinds of companies to get that right, because the more inventory they hold that isn't sold, the that's just lost money. So 1st year or so it was out, fantastic. Now part of the problem was that the data science team that had built it didn't have any responsibility for it in production. The engineering team that had taken what the data science had built and kind of built it into a service that was actually usable by a customer. Because generally it's not like a customer can just send an API request to the model. Like, you have to have a set of services that kind of broker between what the customer is trying to accomplish and the model. And so the software engineers built that and built the plugs into the model and so they owned that front end system but they didn't know anything about how the model worked and so they didn't feel ownership about the model either.
Ben Wilson [00:36:25]:
Well, the model started to drift and its recommendations got worse and worse and worse and it snowballed. And this happened unfortunately kind of during the pandemic when maybe people's eyes were not quite as on the ball as they would otherwise have been. But it ended up being actively bad for their business after a couple years. And everyone is looking everywhere else and the fingers are pointing in all the different directions. And that's the stuff that's scary and you realize that when sometimes when I'm talking to people and I say like this is going to give you a central way to track and control your AI projects, from an operational standpoint you're like okay that that sounds nice. But I'm like, I don't think you understand how critical this is. Yeah. You will.
Ben Wilson [00:37:12]:
The day you recognize how critical this is, you will really, really wish you could come back in time to this conversation and go, oh, yeah. Well, no. We need to do this now, because it will be too late then, and it will hurt. And that was just exactly what this particular company got to is they were like, wow. We've lost untold 1,000,000 of dollars because we didn't know where the thing was and how to roll back and where the good version was and where the bad version was. All those things. They're not the things that most people think about first.
Brad Micklea [00:37:42]:
And that's the sort of wisdom. Like, Michael and I have talked about it on the podcast a couple of times and in private as well, because he's very curious about software development as he should be. But some of the the bits of, you know, wisdom that I've given him, it's never the rose like, the rosy fun stuff. Like, oh, you get to build a super cool feature, and and it's really exciting. We never talk about that, do we, Michael? We always talk about You're always complaining. Yeah. I mean, that's the that's the that's the role, really. It's like, hey.
Brad Micklea [00:38:17]:
Don't think that this is I mean, it is exciting, and you are building cool stuff, but you're you're also knowing that it's what happens after it's built that matters way more. And there's nothing like that sinking feeling of you realize that you didn't write enough tests or you didn't validate something or some sub zero regression got shipped, and now you're in recovery mode. And it it's a learning experience, and it it it teaches the team and the organization new skills on how to recover from calamity, and it's going to happen. The only people that don't experience it are people that are, you know, in the basement building their models and hoping that someday it'll be used for the you know, by the business, which I think a lot of that is still present in the ML community. People are, oh, this is a really cool project that the data science team is gonna work on, and it does this really neat thing with AI. I was on teams that that did stuff like that. I built things like that. Like, it'd be really cool if we did this and built it, show it off, and people like, that's amazing.
Brad Micklea [00:39:30]:
And then you got one person, I guess, it's principal software engineer after it comes up and, like, how are you gonna ship that to prod? Like, I don't know. Let's not talk about that right now. That's next quarter. And they're like, come come talk to me. I'm gonna let's let's plan this out. And you learn from people, and then you really learn by messing it up. Yeah. And that that whole, like, ship something, have it break, panic, fix it, it teaches you how important all of this stuff is and that that ability to to recover.
Brad Micklea [00:40:06]:
I imagine I don't know if I haven't really dug into your tool yet, so I apologize for that. But automatic rollback, of something like, hey. I can flip a switch in either a UI or through the command line to say restore last known good version. That I think that was built 3, three and a half years ago at Databricks, where it's an automated system now, where in like, full full integration tests that are done in, you know, through container validation and integration tests if that prebuilt in staging environment fails. Or if if a region ship, like, we deploy code to regions like AWS regions or Azure regions. If something fails there before that gets switched on for customer usage, so, like, image, of our platform, it'll just automatically go to the last known good. You get a lot of pages when that happens to let you know, like, time to update your tests, or or fix your build process. But but it becomes so critical because if you don't do that, the manual recovery process, you could wipe out 100, if not 1,000 of human hours with just one regression if you don't have anything set up.
Brad Micklea [00:41:25]:
So I I see tools like this becoming more and more the mainstream for people that are serious about actually shipping their their basement projects.
Ben Wilson [00:41:36]:
Yeah. And that's, and I think, interestingly, Ben, you've given me kind of the perfect segue from Kit Ops to Josu. So, I mean, Kit Ops is very much focused on that packaging and the versioning and that single source of truth and be able to get people to, yes, use whatever tool you want because your tools are important to you. But this should be the location where is the canonical that's where the project is, and all the tools can pull from it. All the tools can push to it, but but we know. We know. And you have your tags there for, you know, champion, challenger, latest, you know, stable, whatever whatever makes sense for you and your in your process. Zhou Zhu, is going to be much more of a kind of control plane for AI projects is how I think of it, which is a a nerdy way of saying it perhaps.
Ben Wilson [00:42:23]:
But, you think about operationally there are so many of those steps that you outlined, Ben, and that I I know you've been involved personally in. And the way that those get done needs to be standardized. One of the things that I really, really took away from AWS was there was this strong focus on mechanisms which was really just a fancy word for process, but it had to be automated. And so people would relentlessly go and kind of pick away at anything where humans had to make a decision and say, do we need the human to make a decision or can we do something automated so that in the pressure of a moment when everything is happening is going wrong, where there's fires all around you, you're not waiting for somebody to make a decision at a time when their decision making capacity is going to be compromised. And so you think about a control plane as having a way to encode all those good decisions. And like you said, you're gonna get pages when the tests fail and it rolls back automatically to the last good version. But at least it rolled back automatically to the last good version. Didn't push forward
Brad Micklea [00:43:35]:
with the bad one.
Ben Wilson [00:43:36]:
Like, that's a way worse scenario to have. Oh, yeah. I once had a a junior engineer who made one mistake in coding some deployment code and ended up taking out an entire region for only about 5 minutes, but it was was scary. And it was a good life lesson and this is why we encode things not, you know, do it manually. And they've since gone on to to be an awesome engineer with AWS, but, it was a fun fun start.
Brad Micklea [00:44:08]:
Yeah. I'm curious. Do you get more of a an understanding of how serious things can be because of working at the companies that you have worked at? Because I can't imagine table stakes being higher for anybody more than those 2 companies that you mentioned, Red Hat and AWS. I mean, more than half the Internet runs on both of those companies. I mean, you're talking 1,000,000,000,000 of dollars in the global economy. So it can't really ship, yeah, it looks good enough, you know, version of Linux or something that enterprises use. Or, you know, if some sort of CVE gets issued and you just don't respond to it, don't fix it, don't don't ship out a build, it's more than just money for those for that company. It's reputation.
Brad Micklea [00:45:10]:
It's like, hey. This is the safe and secure place to run your code. Did did you absorb all of that that rigor around how to properly deploy and manage versions of code, like, based on just being there?
Ben Wilson [00:45:24]:
I think that that was a hugely important learning experience, and it's part of why I I really have always wanted to work at AWS and why it was important to me that in my career I spent time there because I've always respected culturally, I think we all kind of are taught through media and everything else that it's the idea person that is the thing. Like, Steve Jobs was amazing because he came up with the idea of the iPhone. Well, the iPhone wasn't a unique idea really, there had been other attempts at it. It was actually just operationally far better than anything else before and that's how I look at it anyway. And so when I look at AWS they didn't I mean you can argue about whether they invented cloud computing or not but but certainly they have proven year in year out to be the best at operating cloud environments. And getting a chance to run API Gateway, it was a privilege because you got to see that firsthand. Like, it it really melted my mind probably, like, the first day when I took on and we had a SEV 2 issue at about 2 o'clock in the morning. And I'm so I'm paged.
Ben Wilson [00:46:27]:
I jump on. I don't know much about the system yet because it is day 1 for me. One of my engineering managers on, she's amazing. Thankfully, she was on, and we talked through this issue, and we were in a rough spot. We to fix our problem would have meant breaking a downstream service. And so we had a choice, are we gonna break the downstream service or are we gonna continue to degrade API gateway for everybody else? And as we're trying to diagnose this, one of the things I asked was, well, have we done any recent deployments? And she kinda paused and she's like, well, we're always deploying. And I was like, what what do you mean you're always deploying? And she's like, there is always something deploying an API gateway somewhere. Like, it's literally nonstop.
Ben Wilson [00:47:18]:
And you don't think about that. You don't think about the idea that this could require literally 24 by 7 deployments to be happening in order for you to get all these things out. And so I think, yeah, there's nothing that can prepare you to, you know, for that kind of scale, and it does force it deep into your bones, because of how how deep you have to think about it. Red Hat was very different. Red Hat, of course, wasn't didn't have that immediacy because you ship enterprise Linux and people download it and then they install it and then they run it. It had a different challenge, which was that where AWS, like any cloud company, I mean, mistakes will get made and then you fix them in the moment and you measure your outages in terms of, you know, minutes or or hours if you're unlucky. Red Hat was different. I mean, it takes time to build a RHEL image.
Ben Wilson [00:48:12]:
Even a RHEL patch takes time. And one of the things that I loved about about Red Hat, which is they would really only productize a product, at least certainly when I was there, I imagine it's still true, if they had committers in that open source project. Because if there was a CVE that came in, they need to know that they had full autonomy to fix that CVE and deploy that patch as quickly as they wanted to do it. They did never wanted to be in a position where they had to wait on some other committer with an unknown schedule and an unknown set of priorities to merge a fix for a problem that was critical to their customer base. Mhmm. You've gotta own that ability to help your customer. If you can't own that then you should not be in the business. And that was really what what hit me because that's not a given in open source.
Ben Wilson [00:49:05]:
There's actually and probably, you know, this doesn't happen so much now but certainly back then there were tons of open source based companies that didn't have committers and couldn't directly control, the pace at which a security issue went out And, you know, Red Hat in doing that really made themselves made themselves unique, and that's why they're so trusted even today.
Michael Berk [00:49:29]:
Were there any practices or cultural elements that you have tried to avoid that you saw at AWS or Red Hat?
Ben Wilson [00:49:40]:
Yeah. I mean yeah. There's always there's always things that that you kind of, you know, don't agree with. I don't think that there were any systemic things, though. Some of it just comes down to to kind of personal approach. So and a lot of it has to do with the scale. In my companies, I've always believed in being extremely transparent with the rest of the team. And not just the exact team, I mean, everybody.
Ben Wilson [00:50:15]:
I talked to my team about the highs and lows of fundraising and where our finances are and what's going on legally or anything else. That is risky. It it is actually like a it it can pose a risk to the business to do that. And so in a company well, because people can share that information even though it is confidential, people can panic about the information and not share their panic with you. So you may not realize that the way you've said something came across much scarier to somebody else than it is to you because you have a lot of context and you know it's not that scary. But maybe for them, they hear it and they're thinking, oh my god. We're gonna be bankrupt tomorrow. And that's not what that meant but you don't know that and if they don't tell you how can you know? And so it's a double edged sword.
Ben Wilson [00:51:03]:
I have lost good people because they have gotten the wrong impression about things and thought that they were in a in a risky position when they weren't and and left to go somewhere else and that's unfortunate. It hasn't happened very often. But I believe it creates a level of trust and I think that one of the things that I saw within my team in API gateway and I can't speak to others within AWS, it could be very different. Within my team at API gateway, we developed a great deal of trust in each other and it meant that we were able to work in those sev 2 production out type scenarios much more efficiently because everyone understood that if they say, you know, Bob, can you do this? That Bob was gonna get it done. Janet, can you do that? Janet's gonna get it done. And if they need help, they're gonna ask for help, which is one of the most important things because a lot of times people think they gotta do it on their own. And if they just asked for help, could've gotten it done so much faster and easier. But, again, to create that trust, sometimes you gotta share a lot and it's not possible.
Ben Wilson [00:52:09]:
It's not responsible to share to that same degree in AWS, which ultimately was part of why I left not because they silenced me or anything like that, it's not it wasn't like that, they're they're awesome. It's really just that I realized that I couldn't be the kind of leader that I like being and be legally and fiscally responsible within AWS. And you should never be in a role if you can't do both of those things at the same time. So that was kind of part of what made me realize now I wanna go back to the startup world where I can kind of put my arms around the whole team, if you will, and, kinda make things happen that way.
Michael Berk [00:52:46]:
And do you hire for people with a potentially higher risk tolerance?
Ben Wilson [00:52:52]:
So it's interesting. There's a a perception that I disagree with slightly that that startups are riskier than non startups. I think they are very different. It's just different types of risk. And it comes down to what your personal kind of comfort level is with different types of risks. Within the start up partly and again startups are different, you know, from type from place to place. But certainly in in my startups, when we hire somebody new, almost well everybody is aware that we're doing it. Everybody has access to look at the LinkedIn profiles of the people were who are applying and everybody is welcome and encouraged to provide any feedback.
Ben Wilson [00:53:35]:
Either this looks awesome, this person looks amazing, or why? What? Those kind of questions too. We try and have people as many people as possible from across different or, areas participate in the interview process. What that means is that I tell people the level of risk that you're going to get a crappy teammate or an a hole boss is really really low because you are an active participant in hiring them. In a big big company that risk is actually really high, in my experience because in big companies, hiring tends to be made optimized for efficiency. And that means small loops with a small number of people given a lot of power and typically not a lot of cross team, you know, input, anything like that. So that's you get that risk. But you trade, yes. Startups have a certain amount of money, and if they can't generate more of it, then they go bankrupt.
Ben Wilson [00:54:38]:
And there are more startups that go bankrupt than big companies, but big companies go bankrupt too. So, you know, it's it it just depends on on what kind of risks you are willing to take. You know?
Michael Berk [00:54:53]:
That that's a a good way to put it. Yeah. It's been interesting. I've been at Databricks for about 2 years, and even I don't know the growth and number of employees, but it's been big, especially in the field engineering space. And no comment directly on quality, but I've seen hiring practices have shifted a bit. And it's the the loop has been a bit more concise. Mhmm. And there's pros and cons to that, of course.
Ben Wilson [00:55:23]:
No. That's exactly it. And so I think I think that's you know, it just comes down to what people want, in, you know, in their career and their life, and and that's that's reasonable.
Michael Berk [00:55:34]:
Yeah. Cool. Well, that was a nice little closing, topic. So with that, I will wrap and summarize. Lots of really cool topics today. A few things that stuck out to me is adding features does not always lead to growth. If you wanna do something really well, that can often be the most effective path to getting users. Another great way to develop features is just put it out there.
Michael Berk [00:56:04]:
See if it sticks. See if there's actual usage. Purchasing is not always a great indicator of whether there's actually value within that feature. And then when you're doing market research, ensure you have a representative sample of your user base, and also think about the tails of the distribution, not just your target user. When things are going down, you probably wanna have as many decisions automated as possible, so try to create infrastructure where decisions are automated. And, yeah, lots of other things. So if you wanna relisten, definitely do that. There are a few other things that I I wanted to discuss, but, unfortunately, we didn't have time.
Michael Berk [00:56:40]:
So maybe next time. So, Brad, if people wanna learn more about you, Jozu, or Kit Ops, where should
Brad Micklea [00:56:47]:
they go?
Ben Wilson [00:56:49]:
Now Kit Ops, you can go to kitops.ml or look for GitOps on GitHub, of course, and just visit the repo. Jozu. Is it jozu?com? Jozu.com. It's actually Japanese for something which is skillful or, or very well done. So, that's, that's where you can find those, and you can find me on LinkedIn. I think I'm the only Brad Micklegh because I think it's a that last name was probably made up when my ancestors came to this country. So
Michael Berk [00:57:21]:
Cool. Alright. Well, until next time, it's been Michael Burke and my cohost, Matt Olsen, and have a good day, everyone.
Brad Micklea [00:57:28]:
We'll catch you next time.
Welcome back to another episode of adventures in machine Learning. I'm one of your hosts, Michael Burke, and I do machine learning and data engineering at Databricks. And today, I'm, of course, joined by my lovely and wonderful cohost, Ben Wilson. I do slide deck presentations at Databricks. Today, we are speaking with Brad. Throughout his career, he's focused on software infrastructure, and more specifically, he's held leadership roles at, Eclipse Shay, where he was the project lead. Eclipse Shay is Eclipse on Kubernetes. He's also been the VP and GM at Red Hat, which is an open source, organization focusing on developing open source for enterprises.
Michael Berk [00:00:42]:
And then finally, he's been the GM of API Gateway at AWS, which is this like, I think it's like a cloud company or something. I think I've heard of them. And then most recently, he's worked at Jozu, which is an MLOps company, and he is the founder and CEO. So, Brad, what differentiates Jozu from all the other MLOps tools out there?
Ben Wilson [00:01:04]:
Oh, wow. Yeah. It's a good question. It's a fair question because there's, what, 1,000, 10000 MLOps tools out there right now. Yeah. It's like there's about a 1,000 every new every day. So I'll actually start by talking about Kit Ops, which is the open source project because that's kind of where everything started. My last startup, we did a bunch of things with analytics and machine learning.
Ben Wilson [00:01:29]:
This is not this one, like, Gen AI type, you know, fancy stuff. These were more standard ML models, I guess you'd call them. But even within that narrowed scope or the smaller scope of those of those models, Having come from 20 years of software engineering, myself and my coworkers were were kinda shocked at how kinda kludgy things felt when you're trying to coordinate versions of models with versions of datasets, with versions of parameters, with deployment configs, all these things you needed to do in order to get a model from Jupyter Notebook into production running helping customers. And we just thought that's this seems harder than it needs to be. And we looked at the end to end tools, both open source and and proprietary, and they definitely helped. But the thing that we didn't love was that all of them seemed to have a very proprietary format and coming from so much of my career being open source one of the things I feel super strongly about is open standards. Even more than open source, it's open standards. I hate it, for example, that there's a different plug and voltage electricity standard in North America than in Europe than in Asia, and it seemed meaningless, pointlessly different.
Ben Wilson [00:02:42]:
And so I I see so much of that in software that I'm like, let's not do that again. So what we did with Kit Ops is we said, well, if we were going to take a group of artifacts for something as important as an as an AI project, we don't want to change the tools people are using in development because those are necessary, they make sense. But when it comes to packaging all that together and then moving that through the development lifecycle and into production, why wouldn't you use something like the OCI standard which is already used for containers? Pretty much every enterprise in the world uses it as a core part of their pipeline and infrastructure. Now we didn't want to containerize datasets because that, for example, doesn't really make a lot of sense, you don't gain much by containerizing datasets, you don't want to containerize your parameters, that also doesn't make sense. But interestingly, the OCI spec is actually much broader than just containers. And so we realized you could create an an OCI compliant package. It was not a container but I think of it as being kind of like a chest of drawers. There's like a drawer for your model, a drawer for your datasets, a drawer for your parameters, a drawer for your, config.
Ben Wilson [00:03:52]:
You can package all that together and you can say okay it's version 6 of my model, version 2 of my dataset, version 5 of my parameters. But the whole thing is version 1.2 of my project. So now as a leader, I can say what version are we at? What's in production? What's 1.2. What's in development? 1.4. What's in staging? 1.3. Oh, that's interesting. Now I know where kind of everything is and I don't need to worry about what the subversions of all those individual things are in there. But for a data engineer who's just working on the datasets they can easily see that.
Ben Wilson [00:04:22]:
And if they only want to pull down the dataset because they don't need the model, because they don't need the CURP config, they don't need the params, they can do that. So they don't need to haul these giant, you know, 100 megabyte, packages back and forth in order to get the convenience of having a unified package. So that was kind of the root that that kind of kicked it off for us and and we've been fortunate we launched the we kind of started on this in February of this year, launched 5 weeks later, thanks to just an awesome team of of folks I've worked with before who work really fast. And we've already got thousands of downloads, lots of people, some interest from some very large enterprises who are struggling with this problem. So it's been it's been good.
Michael Berk [00:05:07]:
Got it. So so just checking if I understand it, you're looking to develop sort of a more modular set of components
Ben Wilson [00:05:15]:
I would pitch? I I would think of it it's not so much a modular set of components. So the part of why we called it kit ops is because, you know, the role that that Git plays for traditional software is that it is where you put all the all the code. It also ends up being the changes to that git repository are the events that trigger an indication of we're finished development now we're into, you know, building and testing and then now we're into deploying and now we're into retiring the old thing which is no longer valuable to us. Git kind of covers that for software but it doesn't do that very well, at least in our opinion, for serialized models. It doesn't do that very well for huge datasets when we're talking about training like Gen AI especially. It's kludgy, yeah you can use GitLFS but honestly it's not fun. Nobody enjoys that and after a couple years of making changes to huge datasets like that in the git repo, the git repo itself becomes painful. So I think of kittops as being kind of a I I hesitate to say replacement because you would still use git for your code of course but rather than drop your giant datasets and your serialized models and git as a way to version control it and workflow it, you put that in gitops.
Ben Wilson [00:06:36]:
And so you still have that connection between gitops and git, for example, between gitops and MLflow or any other any other tool even with KITOps and in a Jupyter Notebook, for example. Does that make it clear, Michael?
Michael Berk [00:06:48]:
Yes. Crystal clear. What are the types of lineage and versioning features that, customers really like and that you guys are focusing on?
Ben Wilson [00:06:58]:
So I think, you know, it's nothing particularly innovative. I think, again, the kind of core of what we're trying to do is stick as close to possible with what has been proven to work in enterprises already because AI is a huge change for enterprises and we didn't feel like changing every part of a toolchain was a good idea when you've got these big changes happening anyway. Some of them you need to, but some of them you don't. So typically you would still version your code in git. You could use DDC to version your your datasets if you want. But rather than have somebody say, oh well which version of the dataset goes with which version of the model? Or let's take a really scary scenario. Let's imagine you're an enterprise and you realize that accidentally shouldn't have happened somebody managed to put sensitive or PCI data into a dataset to train a model at one point. Now the question comes, okay, which models were trained with that dataset? That's actually a really hard question to answer if you've got your models in NVIDIA, your data your code in Git, your data in DVC.
Ben Wilson [00:08:01]:
I don't know. I think maybe these based on time frame, but I think based on time frame is not the kind of response you wanna have to a compliance question.
Michael Berk [00:08:11]:
Got it. That's super cool. And, what type of customer adoption have you seen? Has it been sort of slow and steady? Has it been a lot of connections maybe so you have an initial sort of uptick and then it's flattened out a bit? What does it look like?
Ben Wilson [00:08:27]:
It's been very steady. Yeah. And, admittedly, we're on, I think, week 8. So, I've gotta I've gotta kind of temper that with, we'll we'll see how things go in another 8 weeks. But in fact, things have been accelerating. The last 2 weeks straight, we've been setting records for the number of downloads, and each time more than the last. So that's been really exciting. In all honesty the biggest response we've gotten has tended to be from either DevOps practitioners or MLOps practitioners.
Ben Wilson [00:08:58]:
Data scientists themselves I think get it and certainly the ones that have worked more with models that have transitioned to production they seem to really get it. But I think, you know, there's a there's a large portion of the data science community that for a very good reason, they became data scientists. They didn't become, you know, production, you know, dev SREs for for a reason. They look at they live in a in a Jupyter Notebook and a Jupyter Notebook is is their world, you know, or something similar. And so for them, you know, how do you version this in production and how are you have confidence that, you know, you can roll back to the correct version if there's a problem? That's somebody else's issue, not theirs. So I think it's more those folks who have responsibility for production that are starting to see these things happen or see them coming down the the road and they're going, oh, I don't wanna be the guy with my hands on the keyboard in production trying to figure out how to roll back to a model when I don't know which versions correspond to what. Like, that is never a good place to be at a bad time.
Michael Berk [00:09:58]:
Yeah. So, Ben, I'm curious. What is your take on how this philosophy differs from that of MLflow?
Brad Micklea [00:10:08]:
It's it's interesting. I looked at the demos of it, and I was like, man, this is cool. But I when I looked at it and said, man, this is cool and this looks familiar, it wasn't because of work on the MLflow side. MLflow takes a different approach that is geared more towards those like, the data science crew and saying, we give you a registry. We give you the bill ability to tag and version control your models and artifacts, and it's more of a data scientist tool. But for machine learning engineers, exactly as you said, Brad, It's those people that are kinda left in the lurch in industry right now, because there's a lot more data scientists out there or a lot lot more people that are creating models, producing them, tracking them properly than there are people deploying. And a lot of places that I've talked to, like, a lot of companies, even internally at Databricks, It's not the people doing that aren't the people that a lot of people think are doing that. It's not like, oh, this is a data scientist that has learned how to code who's now, like, doing deployments.
Brad Micklea [00:11:23]:
Like, no. It's a software engineer who doesn't know anything about the ML side of it because that's not what they did. They're more concerned with how do I package and deploy software and version it and make sure that I know what I'm pushing to staging. I can't interpret the results from staging. I have to get a data scientist to look at that. I'm like, is this predicting what you thought it was? So it's it's, like, inefficient, and those people generally get frustrated. And they they run off, and they built their own solution for it. And what what I looked at when I was seeing the demo of of what you built, I was like, man, this this kinda sorta has this feel of, like, our build tool.
Brad Micklea [00:12:07]:
So it kinda feels like Bazel, and you're like, okay. I'm building a bunch of, you know, container images, and I get all these m d five checksum hashes from, like, here's the conditions of everything that went into this build container that's gonna deploy. There's validation steps. I can see the lineage and the history of what went into building this. But it it's specifically targeted towards the machine learning engineers out there, which I think it's cool. And I I kinda knew, like, we were talking about this 2 years ago internally in the team, but, like, man, somebody's gonna, like, go after this and do this right. Like, somebody's gonna take software development, best practices and apply this to the MLE side. This is cool.
Brad Micklea [00:12:53]:
Like, great job, on your company working on this.
Ben Wilson [00:12:56]:
Thank you, Ben. That's, that's great to hear. And I think it's it's something that that we've heard from others. I was speaking to a data engineer at a at a big, recreation focused software company, that that everyone would know. It starts with a z. And, they were basically saying that the development process for them around the AI specific side of what they do was a bit of a struggle, not on purpose, but it was just that there was the AI team that really, really understood the model but didn't really understand production. And there was a production software team that really, really understood production but did not really understand models, which are not deterministic. And you think about, like, the core of everything you learn about data science is determinism.
Ben Wilson [00:13:40]:
There is no, like, let's deal with this totally nondeterministic system now. That's not, like, something something that people spend a lot of time on in in most computer science stuff. So it's a weird world. And, you know, he he said, like, this is really you're creating a middle ground now, a place where the AI team can work and the software teams can work and they can actually understand each other. They can use a single relatively simple tool and kinda get something out of it. And and I think that was was really nice to hear because that was exactly what we had wanted to do. It's So it's nice when you hear people kind of seeing things that way.
Michael Berk [00:14:17]:
And why do you think you guys are some of the first people to tackle this in a seemingly correct way? Is it hard? Is it is it just the stars aligned?
Ben Wilson [00:14:27]:
I think some of it some of it is hard, but but it's not this is not like, you know, this is not like building the first gen AI. It's not it's not hard like, the hardness is not the is not the big thing here. I think that the big thing here is that myself and the cofounders, you know, we've each got 15, 20 years of experience building dev tools, and DevOps tools. We've spent time in some of those big organizations like AWS, like Red Hat, like Docker. And so we're really, really familiar with the way that enterprises process products. And I think it's funny, my father was the head of research and development for Nabisco Brands and so growing up, he would bring home these cardboard boxes, unmarked cardboard boxes, of cookies or cereals that were tests that they would develop in the test kitchen. And we would do a b tests when I was a little kid, and he'd say, like, which one of these, you know, jam filled shreddies do you like or shredded wheat do you like better than than the other one? And in his for him, because he was kind of on that sciency side, that research side, once they had kind of chosen the product, he's like, it's done. But reality is that that was just the start.
Ben Wilson [00:15:47]:
Like, that's still just a brown box with no name and some hopefully delicious thing inside. It had to go through massive amounts of, like, market testing and branding and positioning and where the shelf space is gonna go. Who get whose shelf space do we steal to put this new cereal on the on the on in the grocery store. And I think if it's the same way that you have a lot of strong strong very smart researchers and technologists in the AIML world right now. Super smart people. Some of the smartest people around all working together on these problems, which is why you see AI growing so fast. But most of those folks are not familiar with that whole 70% that comes after you found the great model. We lived in that world of the 70%, but had spent so much of our time focused on the new and the innovative.
Ben Wilson [00:16:37]:
I mean, my CTO, worked with Kubernetes very, very early. 2 startups ago, great friend of mine, Tyler Jewell, who now runs CEO from Lightbend, he had a web IDE company that he'd started and saw containers in 2014, I think, maybe 2015, and said, we should make these the basis for all of our web IDE's. And so I joined him and we tried to do it. And it was a nightmare. Containers were not super stable. I tried to run thousands of them at the same time for something like a Web IE where responsiveness has to be instantaneous or people are just like, this is a waste of time. Man, that was hard. So hard.
Ben Wilson [00:17:21]:
And so but we got there and we sold the company to Red Hat and it was amazing. And you realize that, like, sometimes you just need to understand that enterprise world and then look for the new things that are not really mature enough and figure out how to make them mature enough to fit into the enterprise world. And that's really what we're trying to do with kittops and JOSU.
Michael Berk [00:17:44]:
That's fascinating. Can you dig a little bit deeper about what that, for lack of a better word, productionizing a feature is. So it's built. You've AB tested it. You know that the shredded wheat tastes amazing. How do you go about adding the label onto the box and then doing all those additional steps that you mentioned to ensure that it's a scalable product that many people will use? Is there a formula? Are there steps? How do you
Ben Wilson [00:18:10]:
think about it? It depends a little bit on what the what the industry is and what the product is, of course. To to bring it back to the to this particular podcast, though, rather than, food, which actually, to be honest, I don't know as much about, other than than than maliciousness. You know, you look at a a model and some of the things that we struggled with certainly were how are you going to test it? How are you gonna test it and validate it before it goes out to make it? And then now the good news is that there's more and more tools that allow you to do this, you know, including including MLflow. But then you've got to figure out, okay, how are we going to log it? It? What's the right data to log? How are we going to identify where a problem is coming from? What are the signals that would tell us that we need to roll back to an older version of the model versus those signals that tell us that no, we've just got to figure out a way to shape traffic or do other things to kind of give the model a break or or better balance things with the model. How do we handle drift, and at what point does drift become bad enough, again, to say we need the new model tomorrow to correct this or no. We'll we'll do something else. You know, at AWS, it was amazing for me working with API Gateway because API Gateway's success is not about features. And being someone who spent a lot of my time in product management, you get so used to this idea that you grow your your product grows because you add features.
Ben Wilson [00:19:45]:
And the more features you add, the more growth you have. And it's the biggest fallacy in all of PM and I I fight it in every one of my organizations whenever I have a chance. And I saw no better proof of that than an API gateway. It had less features than probably most of the other gateways out there, and I don't think they would argue with me. I'm sure most people wouldn't. But it was wildly successful and it's a 9 figure business on its own and that's because it does the hard stuff of that 70% really really well. You can scale it massively. You can shrink it massively.
Ben Wilson [00:20:22]:
You can do those things repeatedly and incredibly small amounts of time. It is super secure. Like you can imagine the constant stream of attacks, and and other black hat initiatives that were being put against API gateway every single day. It's the face of AWS. And the product was bulletproof from from that perspective. And so you think about those things in terms of models, it's not different. Like you put a model out there, you've got to be able to protect it. You've got to be able to scale it.
Ben Wilson [00:20:53]:
And scaling doesn't just mean going up, scaling means kind of coming down. How are you going to can you horizontally scale it? Or does it have to be vertically scaled? These are all questions that need to be answered and dictate a different way that you have to test it, that you have to deploy it, that you have to manage it and operate it, that you have to maintain it, etcetera.
Michael Berk [00:21:14]:
What's your market research strategy like? Do you go and interview people? Do you go and, like, just put it out there, see what happens?
Ben Wilson [00:21:24]:
Yeah. Great question. So I guess it's a bit of both. I'm a big believer in nothing tells you more than getting the thing out there. You can talk to people until you and they are both blue in the face, and you might get to the truth, you might not. Not because people are lying to you, but just because there's no truth like usage. That is the ultimate truth. If I use something, I need it.
Ben Wilson [00:21:55]:
If I think I need it, but I never actually use it, and we all have these. You're just looking around your own house. If you were to move tomorrow, there's probably boxes of things that you would be able to leave behind and you might not even remember. So to me, in some ways, it's not even about purchase because people purchase things they don't need. But they but no one uses something they don't need. That just doesn't happen. We don't have time in our days for that. So usage to me is the ultimate.
Ben Wilson [00:22:24]:
We start I always start by, yes, asking lots of people and specifically asking people who are at the edges of what I think are my user base. So, obviously, you've got to talk to the core of the user base, like, for example, those DevOps folks that have been a bit alienated. That's key, of course. But once you kinda get enough head nodding from those folks, now you gotta talk to the folks a little bit on the outside. Like, is the product manager gonna care about this? Is the tester gonna care about this? Is the ML engineer gonna care about this? And is the data scientist gonna care about this? If you get enough kind of good signals from that, then you create the thing, you push it out, and ultimately that's where you really really learn. CodeNvy was fascinating with the the the web IDE because we built it. It was also, in addition to being the first one that used containers under the hood, it was also if not the first certainly one of the first that could actually do compiled languages. Back in 2015, web IDs pretty much only did interpret languages.
Ben Wilson [00:23:26]:
And so we built it very purposely to do compiled languages because we said well Java is the biggest language at the time, In enterprises, it's compiled. You know, all the languages, frankly, that were used in enterprises were compiled at the time. And so this will be an enterprise tool. We use it, sell it to enterprise teams building software. That's how we're gonna be successful. Absolutely. And then it did very well, but not amazingly. And we noticed a couple, IoT engineers, embedded systems guys using it.
Ben Wilson [00:23:59]:
We're like, that's weird. And so called them up and talked to them about it and they're like, oh, yeah. It's a nightmare trying to configure 5, 6, 7 different embedded system platforms, VMs or containers, doesn't matter, on my laptop. That's annoying. It's hard. It's frustrating. The fact that I can do this in the cloud now with your IDE is amazing. I'm so much more productive.
Ben Wilson [00:24:26]:
Tyler, myself, no one who founded the company ever had any idea about these use cases because none of us had ever worked in embedded systems or I or IoT. But as soon as that person said that, I was like, oh, this makes a lot of sense. Shortly thereafter, Samsung was our biggest customer, and it was for all their embedded system stuff. We would never have thought to go after Samsung, for or at least that division with something like this had we not had that that outcome. And that was about usage. We just followed the usage.
Michael Berk [00:24:59]:
Interesting. That makes sense. Ben, I'm curious how this correlates or doesn't correlate to the MOflow philosophy because I know you guys like shipping things and seeing if they stick. What are your thoughts?
Brad Micklea [00:25:13]:
We're no different. I mean, anybody who has read all the books and has learned over time about product management, and takes it kinda seriously, then they say the same thing that Brad just said. Just like, don't fool yourself into thinking that, hey, I talked to 50 people. They all say this a really good idea. Let's sink in 6 months of work and and 80 people's time into this. It's it's not a good idea, or using your gut instinct. It's all about ship something. If if you have a greenfield idea that you wanna test out, take 2 people in 5, 6 weeks, build a prototype that you can ship out there, market as experimental, whatever.
Brad Micklea [00:25:58]:
And, yeah, talk to people, get their feedback. But while you're talking to them, your system had better been be monitoring how many times they're calling your APIs. We don't collect PII data about any of that stuff. I have no idea who it is that's using it. I'm just getting an integer increment every time somebody hits the API. So I can track over time and say, woah. Like, we thought this was kinda okay and useful. Why are there 13,000 people using this a day? Are we onto something here? And then you double down and continue to collect data.
Brad Micklea [00:26:33]:
Well, verify that you're collect collecting the correct data, and then it's not a false signal. And then that's how you kinda get directional information about where to go. But, yeah, I've I've seen it many times in previous companies where idea people are just they're so certain that what they're building, whether it be in the ML space or the pure software space or just the product space in general at a company, they really convince themselves that their idea is great, and then they launch it. And then you you look and collect data after the fact. They're like, oh, you know, sales didn't go up the way we did. Let's check on engagement now. Pull the data. Like, we don't have instrumentation set up.
Brad Micklea [00:27:20]:
Like, why wasn't this thought of earlier? And it's a scramble to say, like then you get the c suite involved who are saying, did we just waste a ton of money and time on this? Who's to blame? Like, think about instrumentation and data collection before you build anything, and then it sets you up for success, I think.
Ben Wilson [00:27:40]:
I couldn't agree more, Ben. And it's interesting because I it's one of the things that is hardest. I love open source software, but I think the instrumentation is one of the hardest parts about it in open source software. I think it's been it's getting better now, you know, but it's really only been in the last few years that I think open source communities have become a little bit more comfortable with the idea that anonymous tracking, that kind of instrumentation is okay inside of an open source project. I I remember it was yeah. I'm I'm getting old. It was probably it doesn't feel long ago. It was maybe, you know, 8, 10 years ago, where we had to fight to add that kind of, data to the early versions of Eclipse shay because people were didn't like it.
Ben Wilson [00:28:26]:
They just were like, it's open source. It shouldn't have that stuff in it. Even though it was visible to everybody, didn't matter. But but that stuff is critical.
Brad Micklea [00:28:35]:
Yeah. For for open source and MLflow, we don't do that. Mhmm. We we look at download statistics to just kinda get a a general idea, and we talk to open source, you know, specific users and open source that happen to be Databricks customers, but, they don't wanna use the Databricks managed offering. They built their own. Like, hey. We're running MLflow on Kubernetes with, you know, 600 pods because we have, you know, 8,000 models being built today. Like, woah.
Brad Micklea [00:29:11]:
That's crazy. Yeah. You can do that on Databricks. They're like, yeah. But we have to, like, integrate with all these other systems, so we need to modify in Buffalo. Cool. Yeah. That's what it's there for.
Brad Micklea [00:29:21]:
It's open source. Do you need any tips on on what to do? And we we love working with them because we get information, and we we wanna know what features are you using. But on the Databricks side, that's where we collect telemetry. So we know every single API. We don't collect any data about what they're doing with it or who they are. But we can see every 10 minutes, we get we can see a window of snapshot of, like, what APIs are being hit on anything involving MLflow, and you can track over time by region globally. It it's interesting things sometimes. Like, you look, you're like, alright.
Brad Micklea [00:30:01]:
We're down to the last 5% of countries on the planet where people haven't used this particular API. You know? And you just see what resonates with people. And, you know, looking at your product, it it really brings to mind a lot of issues that I've seen in, you know, us internally doing, like, building demos and something or or doing stuff within internal teams where everything's when you look at MLOps through rose colored glasses, you're like, okay. I'm doing a happy path, which a lot of people like to focus on, which is, k. I get my dataset. I did all my statistical validation. I've done my feature engineering. I've got this dataset.
Brad Micklea [00:30:45]:
It's versioned. I've got everything set. I've cleaned it PII data, whatever I needed to do. And now I wanna go into model tuning and hyperparameter tuning, and I generate this this artifact that I can use to run predictions. I do all that. When you're doing the first deployment, it's relatively easy. When you're doing the 7th deployment, it's relatively easy provided you never had to roll back. And what you said earlier in the talk really resonated with me because I've had to do that back when I was a data scientist slash MLE in previous life was, okay.
Brad Micklea [00:31:26]:
We just deployed something that's totally broken because we didn't even know that this was a problem. We didn't have tests for this because everything was compartmentalized. We had
Ben Wilson [00:31:36]:
Mhmm.
Brad Micklea [00:31:36]:
Dataset, data engineering style, quality checks, and then we had ML, like, verify the predictions and, like you said, nondeterministic results. So doing statistical validations of, are we, like, classifying stuff to the right proportions? Are we is our regression results within these bounds of of SPC? And then you go and do quality checks, you know, unit test and integration test with the deployment side of the house, But there's nothing that had everything built into one artifact that you deploy to a system to say, here's all of the stuff that I'm doing to each of these these different systems and then test it end to end. That was always manual. In your first five or so deployments, you're really scrutinizing everything of that step of like, hey. I'm manually checking these these 10,000 results that I ran through for for validation and just making sure because I'm scared. You know? Like, is this junk or not? And then running through statistical validation, of the outputs and saying, okay. We're good. We can flip the switch or we can transfer traffic over to that.
Brad Micklea [00:32:48]:
It's good. And then slowly, you know, monitor it, and then people are looking in real time at data coming back and making sure, like, yeah, sales are up or, like, engagement's up. We're good. Let's let's flip the switch. But when you get to, like, the 37th release, 10 months after you started that project and everything has kinda gone well up until that point, you find a huge issue in your deployment. You realize, I think this has been around for, like, at least 7 weeks now. What state do we go back to to restore and get a like, avoid this problem? And now you're you waste 2 weeks or it's sometimes, like, personally, I've had to shut off a system that is making the company money in order to scramble to rebuild a a previous known last good state, and that's all manual. And, you know, like, on hour 53 nonstop without sleep and your 17th cup of coffee, you're wishing that you had something that had like, why are we not building containers here? Or something where I can have a build manifest that I can just say, yes.
Brad Micklea [00:33:57]:
This is an a known Alaska state. So it's it's cool that you guys are solving this. It it's it is a fundamentally massive problem for everybody involved and for a company.
Ben Wilson [00:34:11]:
Yeah. I think that's the thing, and I couldn't agree with you more about it. I think, you know, nothing you haven't actually operated a service until you've had to roll back the broken service. Like, that's the point at which you finally operated the service. Up until then, it's it's like riding a bike downhill. Like, sure. How hard is that? And you come to go up the hill and you're like, oh, wait. This is a lot harder than I thought.
Ben Wilson [00:34:37]:
I'm actually not as fit as I thought. So yeah. No. That's exactly it. And and I think that the company risks are often, in some of it because they're unknown, but but kinda underappreciated by the folks working in those roles. I was speaking to a a major global retailer, And this is, quite a few years ago now because they've been using AI for a long time. But they had a group of data scientists who built a model to predict the different types of clothing that should be distributed to warehouses in different regions and locations based on, you know, what clothing was popular in each of those areas. And it was very, very good.
Ben Wilson [00:35:30]:
It worked extremely well. And then as you can imagine, it's really important for the thin profit margins of these kinds of companies to get that right, because the more inventory they hold that isn't sold, the that's just lost money. So 1st year or so it was out, fantastic. Now part of the problem was that the data science team that had built it didn't have any responsibility for it in production. The engineering team that had taken what the data science had built and kind of built it into a service that was actually usable by a customer. Because generally it's not like a customer can just send an API request to the model. Like, you have to have a set of services that kind of broker between what the customer is trying to accomplish and the model. And so the software engineers built that and built the plugs into the model and so they owned that front end system but they didn't know anything about how the model worked and so they didn't feel ownership about the model either.
Ben Wilson [00:36:25]:
Well, the model started to drift and its recommendations got worse and worse and worse and it snowballed. And this happened unfortunately kind of during the pandemic when maybe people's eyes were not quite as on the ball as they would otherwise have been. But it ended up being actively bad for their business after a couple years. And everyone is looking everywhere else and the fingers are pointing in all the different directions. And that's the stuff that's scary and you realize that when sometimes when I'm talking to people and I say like this is going to give you a central way to track and control your AI projects, from an operational standpoint you're like okay that that sounds nice. But I'm like, I don't think you understand how critical this is. Yeah. You will.
Ben Wilson [00:37:12]:
The day you recognize how critical this is, you will really, really wish you could come back in time to this conversation and go, oh, yeah. Well, no. We need to do this now, because it will be too late then, and it will hurt. And that was just exactly what this particular company got to is they were like, wow. We've lost untold 1,000,000 of dollars because we didn't know where the thing was and how to roll back and where the good version was and where the bad version was. All those things. They're not the things that most people think about first.
Brad Micklea [00:37:42]:
And that's the sort of wisdom. Like, Michael and I have talked about it on the podcast a couple of times and in private as well, because he's very curious about software development as he should be. But some of the the bits of, you know, wisdom that I've given him, it's never the rose like, the rosy fun stuff. Like, oh, you get to build a super cool feature, and and it's really exciting. We never talk about that, do we, Michael? We always talk about You're always complaining. Yeah. I mean, that's the that's the that's the role, really. It's like, hey.
Brad Micklea [00:38:17]:
Don't think that this is I mean, it is exciting, and you are building cool stuff, but you're you're also knowing that it's what happens after it's built that matters way more. And there's nothing like that sinking feeling of you realize that you didn't write enough tests or you didn't validate something or some sub zero regression got shipped, and now you're in recovery mode. And it it's a learning experience, and it it it teaches the team and the organization new skills on how to recover from calamity, and it's going to happen. The only people that don't experience it are people that are, you know, in the basement building their models and hoping that someday it'll be used for the you know, by the business, which I think a lot of that is still present in the ML community. People are, oh, this is a really cool project that the data science team is gonna work on, and it does this really neat thing with AI. I was on teams that that did stuff like that. I built things like that. Like, it'd be really cool if we did this and built it, show it off, and people like, that's amazing.
Brad Micklea [00:39:30]:
And then you got one person, I guess, it's principal software engineer after it comes up and, like, how are you gonna ship that to prod? Like, I don't know. Let's not talk about that right now. That's next quarter. And they're like, come come talk to me. I'm gonna let's let's plan this out. And you learn from people, and then you really learn by messing it up. Yeah. And that that whole, like, ship something, have it break, panic, fix it, it teaches you how important all of this stuff is and that that ability to to recover.
Brad Micklea [00:40:06]:
I imagine I don't know if I haven't really dug into your tool yet, so I apologize for that. But automatic rollback, of something like, hey. I can flip a switch in either a UI or through the command line to say restore last known good version. That I think that was built 3, three and a half years ago at Databricks, where it's an automated system now, where in like, full full integration tests that are done in, you know, through container validation and integration tests if that prebuilt in staging environment fails. Or if if a region ship, like, we deploy code to regions like AWS regions or Azure regions. If something fails there before that gets switched on for customer usage, so, like, image, of our platform, it'll just automatically go to the last known good. You get a lot of pages when that happens to let you know, like, time to update your tests, or or fix your build process. But but it becomes so critical because if you don't do that, the manual recovery process, you could wipe out 100, if not 1,000 of human hours with just one regression if you don't have anything set up.
Brad Micklea [00:41:25]:
So I I see tools like this becoming more and more the mainstream for people that are serious about actually shipping their their basement projects.
Ben Wilson [00:41:36]:
Yeah. And that's, and I think, interestingly, Ben, you've given me kind of the perfect segue from Kit Ops to Josu. So, I mean, Kit Ops is very much focused on that packaging and the versioning and that single source of truth and be able to get people to, yes, use whatever tool you want because your tools are important to you. But this should be the location where is the canonical that's where the project is, and all the tools can pull from it. All the tools can push to it, but but we know. We know. And you have your tags there for, you know, champion, challenger, latest, you know, stable, whatever whatever makes sense for you and your in your process. Zhou Zhu, is going to be much more of a kind of control plane for AI projects is how I think of it, which is a a nerdy way of saying it perhaps.
Ben Wilson [00:42:23]:
But, you think about operationally there are so many of those steps that you outlined, Ben, and that I I know you've been involved personally in. And the way that those get done needs to be standardized. One of the things that I really, really took away from AWS was there was this strong focus on mechanisms which was really just a fancy word for process, but it had to be automated. And so people would relentlessly go and kind of pick away at anything where humans had to make a decision and say, do we need the human to make a decision or can we do something automated so that in the pressure of a moment when everything is happening is going wrong, where there's fires all around you, you're not waiting for somebody to make a decision at a time when their decision making capacity is going to be compromised. And so you think about a control plane as having a way to encode all those good decisions. And like you said, you're gonna get pages when the tests fail and it rolls back automatically to the last good version. But at least it rolled back automatically to the last good version. Didn't push forward
Brad Micklea [00:43:35]:
with the bad one.
Ben Wilson [00:43:36]:
Like, that's a way worse scenario to have. Oh, yeah. I once had a a junior engineer who made one mistake in coding some deployment code and ended up taking out an entire region for only about 5 minutes, but it was was scary. And it was a good life lesson and this is why we encode things not, you know, do it manually. And they've since gone on to to be an awesome engineer with AWS, but, it was a fun fun start.
Brad Micklea [00:44:08]:
Yeah. I'm curious. Do you get more of a an understanding of how serious things can be because of working at the companies that you have worked at? Because I can't imagine table stakes being higher for anybody more than those 2 companies that you mentioned, Red Hat and AWS. I mean, more than half the Internet runs on both of those companies. I mean, you're talking 1,000,000,000,000 of dollars in the global economy. So it can't really ship, yeah, it looks good enough, you know, version of Linux or something that enterprises use. Or, you know, if some sort of CVE gets issued and you just don't respond to it, don't fix it, don't don't ship out a build, it's more than just money for those for that company. It's reputation.
Brad Micklea [00:45:10]:
It's like, hey. This is the safe and secure place to run your code. Did did you absorb all of that that rigor around how to properly deploy and manage versions of code, like, based on just being there?
Ben Wilson [00:45:24]:
I think that that was a hugely important learning experience, and it's part of why I I really have always wanted to work at AWS and why it was important to me that in my career I spent time there because I've always respected culturally, I think we all kind of are taught through media and everything else that it's the idea person that is the thing. Like, Steve Jobs was amazing because he came up with the idea of the iPhone. Well, the iPhone wasn't a unique idea really, there had been other attempts at it. It was actually just operationally far better than anything else before and that's how I look at it anyway. And so when I look at AWS they didn't I mean you can argue about whether they invented cloud computing or not but but certainly they have proven year in year out to be the best at operating cloud environments. And getting a chance to run API Gateway, it was a privilege because you got to see that firsthand. Like, it it really melted my mind probably, like, the first day when I took on and we had a SEV 2 issue at about 2 o'clock in the morning. And I'm so I'm paged.
Ben Wilson [00:46:27]:
I jump on. I don't know much about the system yet because it is day 1 for me. One of my engineering managers on, she's amazing. Thankfully, she was on, and we talked through this issue, and we were in a rough spot. We to fix our problem would have meant breaking a downstream service. And so we had a choice, are we gonna break the downstream service or are we gonna continue to degrade API gateway for everybody else? And as we're trying to diagnose this, one of the things I asked was, well, have we done any recent deployments? And she kinda paused and she's like, well, we're always deploying. And I was like, what what do you mean you're always deploying? And she's like, there is always something deploying an API gateway somewhere. Like, it's literally nonstop.
Ben Wilson [00:47:18]:
And you don't think about that. You don't think about the idea that this could require literally 24 by 7 deployments to be happening in order for you to get all these things out. And so I think, yeah, there's nothing that can prepare you to, you know, for that kind of scale, and it does force it deep into your bones, because of how how deep you have to think about it. Red Hat was very different. Red Hat, of course, wasn't didn't have that immediacy because you ship enterprise Linux and people download it and then they install it and then they run it. It had a different challenge, which was that where AWS, like any cloud company, I mean, mistakes will get made and then you fix them in the moment and you measure your outages in terms of, you know, minutes or or hours if you're unlucky. Red Hat was different. I mean, it takes time to build a RHEL image.
Ben Wilson [00:48:12]:
Even a RHEL patch takes time. And one of the things that I loved about about Red Hat, which is they would really only productize a product, at least certainly when I was there, I imagine it's still true, if they had committers in that open source project. Because if there was a CVE that came in, they need to know that they had full autonomy to fix that CVE and deploy that patch as quickly as they wanted to do it. They did never wanted to be in a position where they had to wait on some other committer with an unknown schedule and an unknown set of priorities to merge a fix for a problem that was critical to their customer base. Mhmm. You've gotta own that ability to help your customer. If you can't own that then you should not be in the business. And that was really what what hit me because that's not a given in open source.
Ben Wilson [00:49:05]:
There's actually and probably, you know, this doesn't happen so much now but certainly back then there were tons of open source based companies that didn't have committers and couldn't directly control, the pace at which a security issue went out And, you know, Red Hat in doing that really made themselves made themselves unique, and that's why they're so trusted even today.
Michael Berk [00:49:29]:
Were there any practices or cultural elements that you have tried to avoid that you saw at AWS or Red Hat?
Ben Wilson [00:49:40]:
Yeah. I mean yeah. There's always there's always things that that you kind of, you know, don't agree with. I don't think that there were any systemic things, though. Some of it just comes down to to kind of personal approach. So and a lot of it has to do with the scale. In my companies, I've always believed in being extremely transparent with the rest of the team. And not just the exact team, I mean, everybody.
Ben Wilson [00:50:15]:
I talked to my team about the highs and lows of fundraising and where our finances are and what's going on legally or anything else. That is risky. It it is actually like a it it can pose a risk to the business to do that. And so in a company well, because people can share that information even though it is confidential, people can panic about the information and not share their panic with you. So you may not realize that the way you've said something came across much scarier to somebody else than it is to you because you have a lot of context and you know it's not that scary. But maybe for them, they hear it and they're thinking, oh my god. We're gonna be bankrupt tomorrow. And that's not what that meant but you don't know that and if they don't tell you how can you know? And so it's a double edged sword.
Ben Wilson [00:51:03]:
I have lost good people because they have gotten the wrong impression about things and thought that they were in a in a risky position when they weren't and and left to go somewhere else and that's unfortunate. It hasn't happened very often. But I believe it creates a level of trust and I think that one of the things that I saw within my team in API gateway and I can't speak to others within AWS, it could be very different. Within my team at API gateway, we developed a great deal of trust in each other and it meant that we were able to work in those sev 2 production out type scenarios much more efficiently because everyone understood that if they say, you know, Bob, can you do this? That Bob was gonna get it done. Janet, can you do that? Janet's gonna get it done. And if they need help, they're gonna ask for help, which is one of the most important things because a lot of times people think they gotta do it on their own. And if they just asked for help, could've gotten it done so much faster and easier. But, again, to create that trust, sometimes you gotta share a lot and it's not possible.
Ben Wilson [00:52:09]:
It's not responsible to share to that same degree in AWS, which ultimately was part of why I left not because they silenced me or anything like that, it's not it wasn't like that, they're they're awesome. It's really just that I realized that I couldn't be the kind of leader that I like being and be legally and fiscally responsible within AWS. And you should never be in a role if you can't do both of those things at the same time. So that was kind of part of what made me realize now I wanna go back to the startup world where I can kind of put my arms around the whole team, if you will, and, kinda make things happen that way.
Michael Berk [00:52:46]:
And do you hire for people with a potentially higher risk tolerance?
Ben Wilson [00:52:52]:
So it's interesting. There's a a perception that I disagree with slightly that that startups are riskier than non startups. I think they are very different. It's just different types of risk. And it comes down to what your personal kind of comfort level is with different types of risks. Within the start up partly and again startups are different, you know, from type from place to place. But certainly in in my startups, when we hire somebody new, almost well everybody is aware that we're doing it. Everybody has access to look at the LinkedIn profiles of the people were who are applying and everybody is welcome and encouraged to provide any feedback.
Ben Wilson [00:53:35]:
Either this looks awesome, this person looks amazing, or why? What? Those kind of questions too. We try and have people as many people as possible from across different or, areas participate in the interview process. What that means is that I tell people the level of risk that you're going to get a crappy teammate or an a hole boss is really really low because you are an active participant in hiring them. In a big big company that risk is actually really high, in my experience because in big companies, hiring tends to be made optimized for efficiency. And that means small loops with a small number of people given a lot of power and typically not a lot of cross team, you know, input, anything like that. So that's you get that risk. But you trade, yes. Startups have a certain amount of money, and if they can't generate more of it, then they go bankrupt.
Ben Wilson [00:54:38]:
And there are more startups that go bankrupt than big companies, but big companies go bankrupt too. So, you know, it's it it just depends on on what kind of risks you are willing to take. You know?
Michael Berk [00:54:53]:
That that's a a good way to put it. Yeah. It's been interesting. I've been at Databricks for about 2 years, and even I don't know the growth and number of employees, but it's been big, especially in the field engineering space. And no comment directly on quality, but I've seen hiring practices have shifted a bit. And it's the the loop has been a bit more concise. Mhmm. And there's pros and cons to that, of course.
Ben Wilson [00:55:23]:
No. That's exactly it. And so I think I think that's you know, it just comes down to what people want, in, you know, in their career and their life, and and that's that's reasonable.
Michael Berk [00:55:34]:
Yeah. Cool. Well, that was a nice little closing, topic. So with that, I will wrap and summarize. Lots of really cool topics today. A few things that stuck out to me is adding features does not always lead to growth. If you wanna do something really well, that can often be the most effective path to getting users. Another great way to develop features is just put it out there.
Michael Berk [00:56:04]:
See if it sticks. See if there's actual usage. Purchasing is not always a great indicator of whether there's actually value within that feature. And then when you're doing market research, ensure you have a representative sample of your user base, and also think about the tails of the distribution, not just your target user. When things are going down, you probably wanna have as many decisions automated as possible, so try to create infrastructure where decisions are automated. And, yeah, lots of other things. So if you wanna relisten, definitely do that. There are a few other things that I I wanted to discuss, but, unfortunately, we didn't have time.
Michael Berk [00:56:40]:
So maybe next time. So, Brad, if people wanna learn more about you, Jozu, or Kit Ops, where should
Brad Micklea [00:56:47]:
they go?
Ben Wilson [00:56:49]:
Now Kit Ops, you can go to kitops.ml or look for GitOps on GitHub, of course, and just visit the repo. Jozu. Is it jozu?com? Jozu.com. It's actually Japanese for something which is skillful or, or very well done. So, that's, that's where you can find those, and you can find me on LinkedIn. I think I'm the only Brad Micklegh because I think it's a that last name was probably made up when my ancestors came to this country. So
Michael Berk [00:57:21]:
Cool. Alright. Well, until next time, it's been Michael Burke and my cohost, Matt Olsen, and have a good day, everyone.
Brad Micklea [00:57:28]:
We'll catch you next time.
AI Deployment Simplified: Kit Ops' Role in Streamlining MLOps Practices - ML 159
0:00
Playback Speed: