Notebooks vs. IDEs With Fabian Jakobs - ML 103

How do you develop ML code? Do you use notebooks or do you use IDEs? In this episode, we get some practical advice from both Ben and our guest on leveraging software principles to write better code in both an IDE and notebook environment. We'll also learn about a cool new Databricks feature that will help you run ML code from an IDE.

Hosted by:

Ben Wilson •

Michael Berk

Special Guests:

Fabian Jakobs

RSS Spotify Apple Podcasts YouTube Amazon Music

Show Notes

Links

Transcript

Michael_Berk:

Hello everyone, welcome back to another episode of Adventures in Machine Learning. I'm one of your hosts, Michael Burke, and I'm joined by my co-host.

Ben_Wilson:

Ben Wilson.

Michael_Berk:

And today we're gonna be talking about a really interesting topic, and that's more about sort of how to develop ML models and the code behind the ML models. So today our guest is Fabian Jacobs, and he currently works at Databricks as a super senior software engineer. But he came from Amazon AWS where he was the TL for Cloud Shell and also worked on Cloud9, which I've played around with both of those tools. And if you're not familiar, Cloud Shell is sort of a terminal in your browser and Cloud9 is a full featured IDE that also runs in your browser. So think IDEs or development environments in your browser. So Fabian, do you mind elaborating a bit about what you do with Databricks?

Fabian_Jakobs:

Sure, I'm very excited to be here. Yeah, I joined Databricks to work on developer tooling. and specifically external facing developer tooling. We want to enable developers to use third party tools with Databricks, things like IDEs, CICD, CLIs, SDKs, everything that's outside of the web application and interacts with Databricks. And the most visible one is the IDE. And I think that's what we want to talk about today.

Ben_Wilson:

Yeah, a little bit of a backstory for listeners that aren't familiar from somebody who's, who's been at this company for a shockingly long amount of time compared to some of my peers. Uh, five, six years ago, the Databricks environment was mostly notebook focused. Uh, then you had two options of development. You know, most data science ML people were exclusively just doing stuff in notebooks, ML engineers or software engineers at companies. we're doing stuff in IDEs, but you would have to, you know, specifically write your unit tests in a way that you're creating an open source Spark environment and testing things against that, but a lot of stuff that exists within the Databricks runtime, that environment that you can execute your code on, it's just not available to actually do integration testing or unit testing using that tooling. The dev loop journey for building things was exceptionally long. And because of teams like the one that Fabian is on and the tooling that they're building is making that process so much simpler. It's night and day compared to what it was years and years ago.

Fabian_Jakobs:

Yeah, that's right. That's also what

Michael_Berk:

Yeah,

Fabian_Jakobs:

Michael_Berk:

and

Fabian_Jakobs:

keep here.

Michael_Berk:

for a- Sorry, go ahead, Fabian,

Fabian_Jakobs:

That's

Michael_Berk:

my bad.

Fabian_Jakobs:

what we keep hearing from customers. Their customers, a lot of customers, they use IDEs and they figured it out. But the real thing is they had to figure it out. There was not one way to do it or one supported way to do it. And that's what we want to address. We want to have one supported way to use IDEs with Databricks if you choose to do so.

Michael_Berk:

And so what is that way?

Fabian_Jakobs:

Yeah, that's a good question. The approach that we are taking is that we actually do run the code on the Databricks cluster. But in order to do this, we have to get the code to the Databricks cluster. So what you do is you have your project locally in your IDE. So you can have your Python files and you can use IDE features just the way you would usually use it. And then we have a command that synchronizes your code to a Databricks cluster. You configure a Databricks workspace, a Databricks cluster, and location to sync the code to. And once you have the code on the cluster, we can use command execution APIs or jobs execution APIs to execute the code on the cluster. The benefit of this approach is the code actually runs on the cluster. And like Ben said, suddenly you have access to all the features of Databricks, right? You have DBUtils, you have... feature store, you have all the ML libraries that are in the ML Databricks Runtime. So all of this just works. But you can do this from the comfort of your IDE.

Michael_Berk:

Got it, and the code syncing part,

Ben_Wilson:

which is

Michael_Berk:

does it actually

Ben_Wilson:

a very,

Michael_Berk:

live?

Ben_Wilson:

that's a very simple statement that you made there, that last sentence, but that's so big when you're talking about, if you're developing an ML project, um, where it's not the cookie cutter, like hello world for an ML package where you're like, Hey, I have actual feature engineering work to do here. I have a feature engineering module in here and there's 20 Python files. that all contain classes and functions. And if you need to test all of that end to end to make sure that does the entire pipeline run up to the point where I need to get my model ready to be trained and you like all the features that an IDE have, in the past, in order to test that, that's copy pasting to test, you know, hey, I need this module. Well, that goes into a cell in the notebook. And if it's a class, it gets instantiated. And when you're doing that debugging process, it's torture. And it makes dev work just take so much longer. So having that pipe

Fabian_Jakobs:

Ben_Wilson:

connect.

Fabian_Jakobs:

also makes certain things like natural things that are hard to do today. It's like you have this code in a cell, but nothing stops you to move this into a Python file and wrap it in a function. And if you design the interface of the function nicely, you can write unit tests for that, iterate on the unit tests really, really quickly, then plug it into the system again and run it. This is basically standard features if you develop in an IDE. But it's not all that common if you're coming from the notebooks. That is the main way to interact. This enables a lot of new ways and good ways to develop. And as you

Michael_Berk:

Yeah.

Fabian_Jakobs:

mentioned,

Michael_Berk:

And then

Fabian_Jakobs:

I could...

Michael_Berk:

I had one question.

Fabian_Jakobs:

Yeah.

Michael_Berk:

Just, yeah, sorry. There might be a bit of a lag, but we're working

Fabian_Jakobs:

I think

Michael_Berk:

on it.

Fabian_Jakobs:

it's because

Michael_Berk:

Um,

Fabian_Jakobs:

I'm sitting on the other side of the ocean.

Michael_Berk:

True. Yeah. So Fabian is based out of Amsterdam, correct?

Fabian_Jakobs:

That's right, yeah.

Michael_Berk:

Yeah. Okay. That, that might do it. So the speed of light isn't fast enough for us. Um, but, so I had just one question that specifically relates to ML, which is a lot of people tend to use notebook environments to, uh, sort of iteratively test and explore data and that's really good because it's super responsive. You have all your variables accessible. Um, and then sort of with an IDE, it's more of traditional software development where you have a single run. And if one thing breaks, everything breaks. So I was wondering if, uh, how you guys are thinking about ensuring that the Databricks backend cluster is very responsive and super fast because I know kicking off a job requires starting a cluster, are you guys using interactive clusters or how are you handling performance?

Fabian_Jakobs:

Yeah, like for the interactive development, we use interactive clusters because I don't think any developer would accept a round trip time of like five minutes just because the cluster needs to be started up and down. I tried that for like five minutes and then decided this is not good. So

Ben_Wilson:

haha

Fabian_Jakobs:

in order to do this, during development, you need an interactive cluster, which is keeps running and then you can like iterate on that cluster in order to get like this fast round trip time. So I think that we can go as low as like five second round trip time for like a single function, which is actually pretty good. Of course, if you do larger processing, then yeah, this can go anywhere, but. I think the important one is like, what is like the minimum leg that you can get down to and I think that's something that we really optimize for in our extension. The other thing that you mentioned, I think that's also interesting to look into is like, notebooks are really, really good for like this interactive use case, right? You want to explore something, you want to tweak a little bit, you run it, you have this like very interactive. experience, we're not going to take that away. I think also ML engineers will continue using notebooks like that. But what is possible with the extension is that you can put this notebook next to your code. You iterate the code in Python modules and then you import this from your notebook and the notebook essentially becomes your runner. You can have the best of both worlds cleverly. And I think that's something that we want to explore also a little bit more. The first version, we mainly focused on pure Python. But I think the interesting stuff happens when we start to combine those. Can we leverage notebooks for interactive exploratory work and IDEs for engineering?

Ben_Wilson:

Yeah, that's fascinating. And that brings me to a question that I have that I think people would be interested in hearing from a software engineer who is in this space and is evaluating how people write code in notebooks. What advice and tips would you have to a team? Let's say I have a team that have a couple of projects in production, 100% notebook code,

Fabian_Jakobs:

Mm-hmm

Ben_Wilson:

and it's kind of monolithic. One big notebook that maybe has 200 cells and each cell is anywhere from five to 500 lines of effectively script. How do you see that conversion process from, okay, let's take your script and let's write testable code. What tips do you have in converting that to testable

Fabian_Jakobs:

Yeah.

Ben_Wilson:

code? you know, applying dry principles to

Fabian_Jakobs:

Yeah,

Ben_Wilson:

everything.

Fabian_Jakobs:

so yeah, I think it's important to highlight that you can actually get pretty far with notebooks on Databricks these days. Earlier last year, we published an article called notebook best practices, where we talked about that a little bit. But I would say that the first thing you can do is like get the notebook into the repos feature, so you get the Git support, do conversion control your notebooks, you don't have to change any of the code. put in a repo and you have version control. The next thing is the repo feature has file support. So if you have a cell that just defines a function, you can take that out and put it into a Python file next to your notebook. Right, and then the next thing is like, oh, you put it in a Python file next to your notebook, you can actually write a unit test for that that runs in like a unit test run on notebook. And If you basically iteratively go through a code, and maybe look at the code that can be shared or that can be extracted, maybe a 12-bit cell notebook, it can be, I know, a 20-cell notebook. And then at some point, basically the notebook is kind of the entry point where you do all of the experimentation. And then you also have a folder where you have Python files with like... Basically the business logic, the functionality that you can iterate on. And then if you want to then take the next step, you can basically take that repository, that code, put it in the IDE and continue the process to do the work from the IDE. It's not like you have to stop what you're doing and do something completely differently. I think there are many, many small steps to improve what you can do.

Ben_Wilson:

Yeah, I think that's an excellent bit of advice. And it's something that when I used to be in the field at interacting with customers, that was kind of where we would get, they'd be like, Hey, we need an expert to come in and, and help us. And here's our code. You look at it and it's, it's like,

Fabian_Jakobs:

Okay.

Ben_Wilson:

all right, there's, there's 117,000 lines of script in here. Uh, and they're like, well, we can just make a production version of this. Like, yeah, we could. It's going to take. six weeks to refactor all of this into what you want. You're asking for a fully object oriented design with this. Let's not do that. Let's start small

Fabian_Jakobs:

Exactly,

Ben_Wilson:

and simple

Fabian_Jakobs:

exactly, yeah.

Ben_Wilson:

and we'll iterate. It's sort of applying agile principles to a refactoring on a massive scale. So it's good to hear that the team is thinking about how to make that even simpler. That side by side view that you're talking about. It's like, hey, Can I have something where I can execute a function and get that output, visualize my chart, see the print statement effectively without having to write print statements in? So will the connect,

Fabian_Jakobs:

soon.

Ben_Wilson:

does the connect tool, like right now, does it support IDE debuggers where we can actually stop in a particular position and the object values are returned?

Fabian_Jakobs:

That is the one thing that we don't do yet. It's also the one thing that we know is the most requested one. If there's one thing that customers might be a little bit disappointed, it's that there's no step debugging in the IDE right now. I mean, that's a little bit a function of how we approach the problem, since we basically synchronize the code to Databricks and then run on the cluster. We would have to attach to the Python interpreter

Ben_Wilson:

Mm-hmm.

Fabian_Jakobs:

on the cluster, which is technically probably possible, but it's not something that we've actually done, because then you get into all kinds of issues. But it is still something that we have in our mind, and it is something that we look at with the the SafeSpark project that was announced at the Data and AI Summit last year. And a SafeSpark is essentially a way to run Python code against a Databricks cluster from anywhere. And it's basically a library that kind of provides a DataFrame API locally and then connects to Databricks. With that, we will actually be able to do step debugging. And this is something we actually want to look into later this year as as a safe spark becomes a little bit more more available. So it's definitely something we know a lot of people are excited about and are waiting for, and that's all something that we want to do, but we need to need to take one step at a time.

Ben_Wilson:

We're just actually starting the proof of concept phase for integrating with that system on the ML runtime

Fabian_Jakobs:

Mm-hmm.

Ben_Wilson:

side with Spark MLlib. I've seen the proof of concept from your team's side. It's awesome.

Fabian_Jakobs:

I'm

Ben_Wilson:

Fabian_Jakobs:

sorry.

Ben_Wilson:

think customers are really going to like what that is going to enable them to do. But from a... sort of a backend nerdy engineering side. It's just fascinating how, you know, leveraging, you know, proto buff to pass arguments across the wire. And you're like, Hey, you could connect any API to this. It doesn't have to be, you know, specifically Python, or you could write it in, you know, PowerShell if you want it to, as long as you can submit that, that query, it can instantiate something. Uh, I think that's going to be, that's going to be I usually hate using the term game changer, but that is going to be a paradigm shift,

Fabian_Jakobs:

Yeah,

Ben_Wilson:

I think, for people using our platform.

Fabian_Jakobs:

I totally agree. I think Spark Connect is really a paradigm shift of how you can actually interact with Databricks and leverage Databricks in a lot of ways that are not possible today. Also credit where credit is due, this is not our team that is working on Spark Connect, but

Ben_Wilson:

No.

Fabian_Jakobs:

we are working very closely with them and they are amazing. They are really good.

Ben_Wilson:

It's your office. Like a lot of that stuff is, is all out of the Amsterdam office.

Fabian_Jakobs:

They're actually in Berlin, but

Ben_Wilson:

And

Fabian_Jakobs:

I think

Ben_Wilson:

then San Francisco.

Fabian_Jakobs:

it's close enough.

Ben_Wilson:

Yeah.

Fabian_Jakobs:

Yeah,

Ben_Wilson:

the AMIA

Fabian_Jakobs:

that's

Ben_Wilson:

engineering teams.

Fabian_Jakobs:

all. It's all you. It's close enough. Well, I think one thing,

Michael_Berk:

So I had one quick question

Fabian_Jakobs:

yeah.

Michael_Berk:

and just sort of to highlight the benefits of connecting an IDE to a Databricks cluster is that for data workloads specifically with data engineering, it's often the case that your local machine is not powerful enough to handle production scale data. So Typically you worked in a notebook environment, but if you don't, you often have to have this sort of custom setup that connects to an EC2 instance or whatever it may be that makes it just the dev loop really challenging. So I think specifically, specifically for data engineering and even for ML, which is more. in my opinion, more like software driven and Python driven usually, it seems like it's a very robust and logical next step. But I also have the question, which is why aren't we just building an IDE into the browser?

Fabian_Jakobs:

I love that question. I've actually done that in my previous life. It's not a small thing to do. I think today the option would be to maybe take VS Code and run that in the browser, like with VS Code code spaces. We, I would say like the main reason was like one, we wanted to get it out as quickly as possible. We wanted to, we had, we had a, we know that our approach would work. We know that this will help a lot of customers. Building a managed service, like a managed IDE in the browser, this is at least one order of magnitude, more work to build. takes a lot of magnitude longer to build, and it's very, very expensive to maintain. I'm not ruling out that we will do this at some point, but we just want it to be agile. We want it to get this out, get this in the hands of customers, get their feedback. Maybe we do that, maybe we don't, but now we basically have a massive head start if we compare to doing it in the browser. But in fact, the extension does work in code spaces.

Michael_Berk:

は、

Fabian_Jakobs:

So if you actually do want to use code spaces with Databricks, that does actually work. You can actually use, create a code space. You can pre-configure it even. For example, if you use Azure Databricks, you can have the Azure CLI installed in your code space. You can authenticate and then you can develop in the IDE from your browser. In this case, the nice thing is that it's not us who have to host the IDE. It's GitHub and Microsoft.

Ben_Wilson:

I can confirm if anybody's curious how hard this is to set up. Last time I did it was about a week ago. And using that integration point, I think it was like five or six minutes to set it up.

Fabian_Jakobs:

I'd love to hear that. It's great to hear.

Ben_Wilson:

Super simple.

Fabian_Jakobs:

Yeah.

Michael_Berk:

Yeah. And can you talk a little bit more about your experience developing cloud nine? Cause I'm not sure that I buy that it's complex. Can you sell it to me?

Fabian_Jakobs:

Yeah, Cloud9, I've been working on Cloud9 for like 10 years. We started this as a startup from Amsterdam. And the idea was we wanted to have an IDE developer experience in the browser. And that was before VS Code was there. That was before Codespaces was there. So basically we had to do everything on our own. Um, so building an ID, uh, I can, I can search you that is not a small thing. Um, the problem is, is that the, the minimum bar of features that you need to provide in order to be even like considered an alternative is very, very whole, uh, very, very high. I'd say like the amount of features you need to provide is insane. Right. So it's, uh, so that is the front end. And if you want to do a hosted IDE, then you need a backend. You need a place where you run your code. And these days that's containers. So that essentially means you're building a compute service or you build a compute service running hundreds of thousands or even more containers. And we wanted to offer a free service. That means we also need to do this economically and efficiently, right? So you have to make sure like... How can you provide compute for a lot of users for very little money? So scheduling is important. Fraud is going to be a nightmare because people figure out, oh, there's a free computer. I can run a Bitcoin miner and they will do that even though the computer is so little that they never get anywhere. So yeah, the I would say on the front end, the complexities is just like meeting the minimum bar that customers expect. And on the backend, it's just a big distributed compute service that you need to manage. And in the ideal world, you leverage it as much as possible what others have built. For us right now, VS Code and the VS Code extension for Databricks, we... We decided like, this is not a problem that we need to solve. There's not a problem that Databricks needs to solve. Databricks needs to solve how you can develop ML and data applications. That's not, we don't need to, we don't need to be, this is not our core business at this point.

Ben_Wilson:

Yeah, that extends across, you know, most SaaS companies that are in a space, the ones that are successful. It's that, Hey, if there's something to use, use it. If there isn't something to use that you can directly use for free, make friends with who has that thing

Fabian_Jakobs:

Yes.

Ben_Wilson:

and integrate and help each other out. We do that on the ML side with tooling as well. It's like, Hey, let's just make a partnership because you're, you're really good at this thing. We're really good at this thing. Let's make both of our things better together. And it's really that economy of scale gone are the days where, and we get requests like that on the ML side saying, well, why doesn't Databricks do this thing? Like, well,

Fabian_Jakobs:

Yeah.

Ben_Wilson:

why would we build that when you could just use this thing that's available? We'll make an integration point and make that simpler, but we're not gonna rebuild XGBoost. We're not gonna rebuild

Fabian_Jakobs:

that.

Ben_Wilson:

core parts of SKLearn. It's already out there and it's great. Both those products are fantastic. So why would we waste engineering resources on that?

Fabian_Jakobs:

Absolutely and also remember

Ben_Wilson:

And it's good

Fabian_Jakobs:

from

Ben_Wilson:

that,

Fabian_Jakobs:

Lex.

Ben_Wilson:

you know,

Fabian_Jakobs:

Yeah

Ben_Wilson:

that's common across everything really.

Fabian_Jakobs:

Yeah. I also remember like from startup, like my startup times, it's like, if something's core to the business, you want to build it yourself. Um, if it isn't, it's often better idea to just buy it or partner or leverage something else, right? I think for, for data rigs is like, if it is like core to the way we process data, like, like Delta is very core and we invented and we iterate there. It's Spark Connect is another example. itself is kind of very core. We would never hand that away or consider leveraging something else there. But everything around it, everything that's not in the core business, we partner, we work in open source, we leverage existing tools. There's so much. better ways to do things in these areas.

Michael_Berk:

yet.

Ben_Wilson:

And that's sort of a theme and a message that the people that are listening in right now, if you're working on data science problems or ML engineering problems, if you're doing something like that at your company, think about what Fabian just said about, is this your core mission to build this thing? A lot of times your core mission is to solve a problem and make the company money or save the company money. So even though it may sound super cool to be like, hey, I need a tracking service. So I'm going to build my custom, you know, MLOps tooling stack. Yeah, you can do that. It's going to probably take five or six years to get something that is, is useful. They can compete with some of the open source tooling that's out there for free. Or you could just download MLflow and get that set

Fabian_Jakobs:

Thanks.

Ben_Wilson:

up.

Fabian_Jakobs:

Yeah.

Ben_Wilson:

If you're trying to do distributed computing and you're like, hey, we need a way to run on a fleet of VMs that we're going to spin up and we're going to use really low level coding and network protocols and we're going to create our own security layer. It's like, okay, your team of 12 ML engineers or data scientists and a couple of software engineers are going to build Databricks from the ground up. The build versus buy becomes a pretty

Fabian_Jakobs:

No.

Ben_Wilson:

big ask when you're talking about what your project needs to use. So researching what's out there and trying stuff out, testing it out, making sure you find what works for you is definitely recommended.

Michael_Berk:

Yeah. And so for full disclosure, everyone in this podcast right now works with Databricks or generally pro Databricks, but

Fabian_Jakobs:

Hahaha

Michael_Berk:

I also wanted to sort of open the floor to both Fabian and Ben. Uh, what are some alternatives, whether they be sort of an end to end ML, uh, development stack, or even just for software engineering. If you were going to build what Fabian is building with this IDE connect, what would that look like?

Ben_Wilson:

If I was going to build that from scratch, I would talk to management and say, should we be doing this? I don't think this is a wise use of our time. If you're asking that hypothetically from like, imagine I'm in a data science team or an ML engineering team at a company, I would be raising that question. And if I was told adamantly like, no, we're not going to use anything open source, we're not buying anything. You guys figure it out. I'd be getting my resume up to date and I'd be looking for another job. Like that day, that afternoon. That's how scary that is to me. Cause think about how big the team needs to be for all of the, the implementation details to make something that will actually function correctly. You're not just talking about. Oh, I need the features of an IDE. I need code complete. I need to be able to detect all of these linting errors. I need code formatting. I need to be able to run execute tests. I need to be able to run a debugger that can step through code and tell me in a visually appealing way where, you know, what the values of different variables are and object states, there's a lot of other things that go into an IDE, like package management. How do I do updates? How do I integrate with. with Git revision control. That's all the user experience, like the user facing stuff. For anything that you see out there that a company is building and providing as a service, imagine that 10% of the code that's associated with that is what a user sees. The other 90% is backend. So network security, authentication, data storage layers, compute execution. All that stuff takes a lot of time and a lot of energy and a lot of planning to get it right. So I would be scared and running for the Hills if I was told to do that.

Fabian_Jakobs:

Yeah, yeah.

Michael_Berk:

Okay. So let's say you're not going to

Fabian_Jakobs:

Thanks.

Michael_Berk:

be building it. Sorry. Yeah. Fabian. Go ahead.

Fabian_Jakobs:

No, go ahead, go ahead. I wouldn't want to build it either.

Michael_Berk:

Nah, I'm curious your opinion. If it... Well, I mean, you're doing it right now, so what's going on? You're

Fabian_Jakobs:

No,

Michael_Berk:

not running for the hills.

Fabian_Jakobs:

we're doing something very different. What we are doing is we enable people who are using IDEs today to use these existing IDEs with Databricks. We're providing the glue, we're making the experience smooth and easy. Um, and I think that is, that is going to be our, our mission. I agree. We, we, we all, we only starting now, right now we can, you can sync the code and you can run it on the cluster, but there's so much more things we can do in the IDE, uh, to, to work with, uh, with Databricks and work with ML projects. Um, so I think what, what Ben meant is we should not be building IDEs. building IDEs is very expensive and very low return investment unless you build something radically new. That means if it's radically new, it's probably also extremely risky because it might not be actually worth it. I wouldn't want to build an IDE. I do think hosted IDEs or web-based IDEs are... are very appealing. And as I mentioned before, we might actually at some point go there, but I think we might leverage partners again. Like maybe we actually do partner with Microsoft and Codespaces and have an integration that offers Codespaces with Databricks or maybe we do something similar, but I think for now focusing on how do we make existing IDEs work better with Databricks? There's a lot we can do still. And we mentioned Spark Connect, but then we can also

Ben_Wilson:

Yeah, and that

Fabian_Jakobs:

consider.

Ben_Wilson:

really gets into.

Fabian_Jakobs:

Latency sucks, right?

Ben_Wilson:

Sorry, the delay.

Fabian_Jakobs:

Okay, I keep, I mentioned like

Ben_Wilson:

No,

Fabian_Jakobs:

we,

Ben_Wilson:

you were saying.

Fabian_Jakobs:

yeah, we, so there's a lot of things we can do. Um, the, the next thing that we are looking into, um, is, uh, now we have, now we provide the mechanics, how you can run code on Databricks. Um, but now we also looking, looking into what are kind of the policies, how, how do we structure a project? What is like a best practice to structure a project? How do you, where do you put notebooks? How do you put code? Where you put tests? Um, what does kind of this, this project model look like? And if we have a project model, can we provide an even better experience because we have more information to work with? Um, so that's, that's something that, that we, that we are looking into. And then also, uh, can we bring back some of this interactivity that, that the notebooks have and the web app? also in the IDE, so you don't have to do this context switching between IDE and web app in order to do exploratory work. So there's a lot of stuff that we can still do and we still want to do, and I think that's what I would aim for. And stuff that nobody else is doing. I think that's important

Ben_Wilson:

Yeah, I think

Fabian_Jakobs:

stuff.

Ben_Wilson:

you really. Yeah, you really hit it on the head with the sort of the theoretical approach that you're using and that we all use when working on projects at a SaaS company is when you're thinking about like, oh, what if we built an IDE in the cloud that was proprietary to us? It's not strictly just an engineering question of... You know, do we have enough people? Do we have enough smart people? The answer to number one at our company is maybe the answer to number two is definitely. Um, but is it worthwhile to take on an established company? Like would we win that battle in the mind share of people for like, Hey, we're gonna, we're gonna compete with visual studio and pie charm with like data. A data rigs

Fabian_Jakobs:

Yeah.

Ben_Wilson:

IDE that's proprietary. It's so risky, not just from an implementation standpoint, like would we be able to do it? Of course we could build it. Would people use it? Do people want to use it? You know, all these IDEs that are highly used, think about how different it feels to develop in Visual Studio versus PyCharm versus IntelliJ versus Eclipse. They all feel differently. They all behave differently. They're all doing the same thing when you think about it. You

Fabian_Jakobs:

Yeah.

Ben_Wilson:

know, they're changing, you know, text representation of executable code, uh, that are saved to disk. But the user experience is so different and fighting that battle is a, very much an uphill one. So the approach

Fabian_Jakobs:

But the.

Ben_Wilson:

that you're talking about, like, Hey, let's make sure we can connect with all these things and make that experience that people already are using as best as possible. I don't think you're ever going to be out of feature requests with the work you're doing.

Fabian_Jakobs:

Wait,

Ben_Wilson:

It's

Fabian_Jakobs:

wait.

Ben_Wilson:

just going to be more and more

Fabian_Jakobs:

Ben_Wilson:

awesome

Fabian_Jakobs:

don't think

Ben_Wilson:

cool

Fabian_Jakobs:

so either.

Ben_Wilson:

stuff

Fabian_Jakobs:

No.

Ben_Wilson:

to do.

Fabian_Jakobs:

I think there's like another question. Like, do we have the people? Do we have the quality of the people? But also it's like, is it the best use of these people? We are like a data company. Wouldn't it be much more useful for us to invest deeply in? making Databricks more efficient, making Databricks easier to use. I feel like you don't want to have like a distraction that needs a lot of people to work on. But actually, one thing I wanted to touch on previously to circle back is like, you talked, it's like, we have this notebook with 200 cells. And that resonated in a sense is like, we always talk about scale and scale in the sense of how much data can we process, right? We have like petabytes of data and we analyze them and process them. That's how we think in terms of scale. As a software engineer, there are other dimensions that I like to think of. Like you have this 200 cell notebooks with a couple of thousand of lines of code. How do you scale that? How do you get projects with a lot of code? How can you develop those efficiently? Right? How can you manage big projects efficiently? Like you need tools for that. And I see like our job is to enable scaling in that dimension. Or you start a project, you start a notebook and you do some experiments and then you actually think, oh, I'm onto something. Let's bring in the tea. How can you scale that? How can you scale? having multiple people working on the same code. And so that is in a way that is a solved problem in software engineering, but not necessarily in data science, right? Or in machine learning. Like people rediscover that there needs to be mechanisms and tools to scale, to scale the code base, to scale the team. And in a way I see that that is kind of what we enable. in our team, but I also see that as something that, for example, in the notebooks team, they are thinking about with features like repos or features like files and repos. And that's, I feel, is super important to me to think and scale in very different dimensions than we typically think.

Ben_Wilson:

Yeah, definitely. And the stuff that your team is focusing on these features that you're building is in my opinion, which is biased, but for any serious ML project, that should sort of be your eventual end goal. From the code development and architecture perspective is if you're wedded to notebooks and you want to see every time I run this thing, I need to see these charts that get generated. different sorts of plots that are put out there. That's great. That's what notebooks are for. But when you're talking about exactly what you just said, Fabian, with how do you get multiple people to collaborate? How do you maintain this thing when it's a big project? Any of us that have ever worked on data science projects out there that started off as just like sort of a hackathon idea of like, hey, it'd be really cool if we did this thing. I think it's going to make us potentially millions of dollars." Everybody's like, whatever, Ben, shut up. It's never going to work. Then you do the proof of concepts and you build it on a weekend or something and demo it and somebody sees it and they're like, we're doing this. Hang on a second. This was just a prototype. This code sucks. It's so fragile. I need to rewrite it all and make it proper. They're like, yeah, sure. Take five people, build it out. making that transition and doing a project properly from the start, it's natural to go into the software engineering development dev loop of saying, right, we're all using IDEs, we're using Git, we're not messing around with just everybody making a copy of a notebook, and then we'll try to glue it all together at the end. It just doesn't work. But if you're in that position where you have a notebook in prod, because that was just what you knew, And there's just that one copy. That's it. It might not even be checked into version control. It's just, Hey, it's, it's saved to object store. So it's, it's there. Uh, if somebody comes in and says, we need to do this project properly, or this is failing, you know, every three times that we've run this job, we need to figure out why it's broken. We need robust testing. We need, we are going to send a software engineer to work with your team, to teach you how to to do this properly, that moving to that IDE methodology with this Connect tool is a much easier path for a team to go down, rather than, hey, you have to learn about mock in Python. Like, you can use

Fabian_Jakobs:

Right.

Ben_Wilson:

PyTest and Unitest and mock all these services because you can't run them locally. That's a lot for a data scientist to pick up and really understand. So these Connect tools are really are the future, I think.

Fabian_Jakobs:

And I also want to emphasize, there's

Michael_Berk:

Yeah,

Fabian_Jakobs:

nothing

Michael_Berk:

and

Fabian_Jakobs:

wrong

Michael_Berk:

speaking

Fabian_Jakobs:

with...

Michael_Berk:

of tools.

Fabian_Jakobs:

I just want to say, I just want to emphasize there's nothing wrong with starting with a notebook and doing that experimentation and getting the first version out. That is actually very, very effective and it's a great way to work. I think that what you're pointing out is there might be a point where something that is an experiment turns into something that you would consider production. And at that point, you might want to think about applying some of the software engineering best practices. And as we mentioned earlier, this can be gradual. That doesn't have to be like, oh, now the software engineers have to take over. No, I don't want the software engineers to take over, but maybe you can learn a little bit from the software engineers.

Michael_Berk:

Yeah, so that's a great segue actually, because I would love to learn from some software engineers. What could you guys describe or list three software engineering tools or practices? So what is it and the sort of the name of the practice

Fabian_Jakobs:

Mm-hmm.

Michael_Berk:

that you guys employ daily?

Fabian_Jakobs:

Oh, okay. I'll take that. First one, I would say version control. Everything you do, you put in version control. So you have a full history of the changes you made. And it also allows you to work in a team on a code base. In an IDE, today, this is Git. There have been containers, but Git is the one that has won. You can use that in the ID. You can also use it in Databricks with the repos feature. So that would be number one. Number two, I would say is modularization. Right? So you break up your code into modular self-contained functions or pieces of code. And then I would say unit testing or testing in general. Especially once you model realize your code and you have small chunks of code, then oftentimes you can just execute those in isolation. And then you can write a test that basically executes the code and checks if it's doing what it's supposed to be doing. And ideally you like then the next step is like you run this regularly. Like if you, for example, if you're using GitHub, you can run this every time you make a change in GitHub and run all the tests and see if they are passing. So this goes into continuous integration. Like basically means whenever you make a change, you run a suite of tests and see if the changes are doing what they're supposed to do. So we go from version control, modularization, testing, and continuous integration. And then the next step could be that if you have a branch in your Git repository that you wanna push to production. You can even optimize that. So if you know the tests are running, now you can deploy automatically each time you push a change to that branch. That's called continuous deployment, or so continuous integration, continuous deployment, like CI-CD. And that gets you from. Oh, I have at our code into, I have an organized process to write the code and have an organized process to test the code and have an automated way to get the tested codes safely into production. Um, I would say like this, these are the core things I would say. Um, and for all of these things, I would say like, it's kind of in the chart of our team to make this easier on Databricks and we started with the ID, but like the other areas where there's a lot of stuff to do.

Michael_Berk:

Nice. And Ben, before we get into yours, I just wanted to ask you a question about testing specifically for ML. Because as I've gotten more experience in the field, I think that testing is one of the things that data scientists struggle with the most. It's a very software driven approach and it's sort of hard to test data because the data can change and there's a flow of data. It's not sort of necessarily an isolated unit test. So how do you think about testing machine learning code?

Ben_Wilson:

I don't know if I 100% agree with your statement. I'm going to be contentious. You can test anything. If you're doing something that's manipulating the state of the execution of a program, you can test it. If it's a no-op and you're just saying nothing is effectively changing to that data or you're not making any decisions. Yeah, maybe you can't test that. Maybe you shouldn't have written that. Uh, it shouldn't exist in your code base, but, uh, and sometimes there is superfluous stuff in data science. You know, pipelines and code bases that are like that. But if you're doing something that say we're adding a column, that's a common task that you do in ML or you're making a calculation, you're saying I want the value from column A and column D. To generate. a decision that is going to be manipulating the data in column F to create column G. So that stage of processing that you're doing, your decision logic that's based on the data itself, it's still logic that you're doing. You're instructing the computer like, hey, if you see this and this in this other column, then do this thing based on the value of this other column. And that could be a very complex chain that in a very difficult to test way would be a single function. And you probably shouldn't be doing that because that makes the tests, to validate that all of that is working correctly together is much more complex to test. So you can break that apart and say, well, I have a decision to make on my first column. And then I have a decision to make on my second column and there's logic involving that. Well, I can make those two both independent functions. And then I have another function that's calling those two to make the decision on the final one. So that's now three functions that I've created to do this one operation, but I can mock up data that meets those requirements to say, can I test that, like what happens if my expected values of this decision are present in a data set? What happens if they're not present? What happens if one is present, the other one is not? And then you can also test stuff like, what happens if the data is missing? What happens if the data is malformed? Like what happens? Is there an exception being thrown that I can understand that if this blows up in production, I'm going to know exactly like, oh, this messed up this, like this part of the code. And you want to start writing the code in such a way that it makes it easier to maintain over time in prod. And that's really the goal of how I see testing in ML is you're putting guardrails on the quality of what's coming out, hopefully. And some of that's. frankly, non-deterministic based on some models that are trained. But you should have a general estimate of any manipulations that you're doing to that data to know that you're not creating a bug or creating a flaw in that model such that it's going to impact the business and your project. That's what testing is for for that. That's the unit testing side. Integration testing side, that's more on that data side of things where, my model and my pipeline of feature engineering and everything soup to nuts to the point where this model is producing predictions. That's an integration test. And that should be done as a model evaluation stage of saying, Hey, I need to know what's going to happen if I pass this vector of data in. Is it going to give me a prediction or some transformation that I'm expecting? And I should have some range of that. saying, hey, we've used on our podcast many times before the hot dog sales in New York City or hot dog sales in different cities. Well, you should have an integration test guard on the hot dog sales model so that if it predicts that, hey, next week we need to make sure that we basically process every pig that's alive on the planet into hot dogs by next week because we're going to have a demand of 34 billion hot dogs and sales. We should have a test that's checking to make sure that that doesn't, doesn't produce the output, uh, on a, on like on a model retraining. Um, and that's where I see a lot of people build testing for ML, but I don't see too many teams without outside guidance from software engineers saying, Hey, you should think about how your code is written and all of those decisions that you're making within feature engineering and make sure that all of your logic that you're putting in there is what you're expecting to come out. Because as the complexity grows on that, it's a snowball effect. One little bug in that code, if you're not testing each one of those logical points, can spiral out of control to a point where it's so challenging to debug where that came from that you could waste an entire sprint's worth of work just tracking down what went wrong. Sorry, that was long-winded, but that's my thoughts on ML testing.

Michael_Berk:

Cool. And then do you have one or two or three software things you can't live without, whether they be tools, strategies, you name it.

Ben_Wilson:

I mean, Fabian nailed the three that I was actually going to say. We're a hundred percent aligned on that. I just had one additional one, which I think is, it's so obvious to a software engineer, but not obvious to people that aren't in software development, like pure software development, it's peer review. And it's something that not a lot of data science teams do. Uh, and that's peer review on two fronts. Uh, actually three fronts. The first front is when you're developing a feature branch. Other people should be looking at it for implementation for the ML side of things, but also for code quality. The second one is on planning. When you're getting ready to improve a model or you're just starting a project to begin with, that should not be a one person show. We don't do that when we're building features. We create product review documents and submit those. Everybody has feedback that is always constructive. and it helps make things better because more intelligent people are looking at that and commenting or just asking questions. Uh, and then the third part is getting, it's not really peer review. It's more like customer review. It's listening and also asking the people that are using what you're producing and getting their opinion, but their candid opinion, tell them like, Hey, you're not going to hurt our feelings. The only way you can hurt our feelings is if you lie to us. So if you hate it, tell us you hate it, tell us why you hate it. If you love it, tell us why you love it and why you want more of it so that we can help build something together. That's better for you. So that's the only thing to add to what Fabian said. I think he would have said that if he had four as an option as well.

Fabian_Jakobs:

Didn't come to my mind, but it's great. This is exactly... I have nothing to add to that.

Michael_Berk:

Amazing. Well, I will do a quick summary and then we can wrap. So today we talked about ML code development and typically there are sort of two environments that you can work in. One is notebooks and one is IDEs. Notebooks are typically better for exploratory data analysis, feature engineering, and sort of real-time results where you're not quite sure what you're building. But then IDEs, when you're ready to productionize, often have a lot richer feature sets and they support unit testing and other things. that make productionizing code a lot easier. So typically the workflow that at least I've seen is that ML engineers start off in sort of the notebook phase for EDA and then move to IDEs for productionizing. But Fabian here has been working on a tool that actually allows IDEs to connect to the Databricks cluster. So if you have a really large data set or you wanna leverage Databricks features, that is a great option if you don't wanna be working in notebooks. And then finally, As someone who's seen a lot of ML code, uh, ML engineers typically don't have the best software engineering, best practices. Um, especially if you come from sort of a stats or modeling background. And so some things that can make your code more robust is version control, for example, through get, uh, focusing on modularization or breaking up your code into core components, testing that code and testing typically takes two forms, unit tests and integration tests. So unit tests or testing each component integration is making sure that your predictions are valid. And then finally peer review. So Fabian, where can people learn more about you and your.

Fabian_Jakobs:

We will have an announcement very soon. Maybe when this podcast is out, we already announced it. But if you go to the VS Code Marketplace and search for Databricks, there will be an extension. And from that extension, we will guide you to the documentation page, show you how to connect it. I would say go to the Escort Marketplace, search for Databricks, and by the time this is out, we should actually have something in the marketplace.

Michael_Berk:

Amazing. Well, thank you so much for joining and until next time it's been Michael Burke and my co-host

Ben_Wilson:

Wilson.

Michael_Berk:

Have a good day everyone

Ben_Wilson:

Take it easy.

Fabian_Jakobs:

Thank you very much.

Notebooks vs. IDEs With Fabian Jakobs - ML 103

0:00

45:33

Playback Speed:

Show Notes

Sponsors

Links

Transcript