Moving from Dev Notebooks to Production Code - ML 098

In this week's episode we meet with Mike Arov, committer to the MLOps tool framework lineapy. From the benefits of notebooks as development tooling for Data Science work to the complex refactoring needed to convert them to production-capable code bases, our conversation dives deep into the generally under-represented bridge tooling of code base conversions.

Hosted by:

Ben Wilson

Special Guests:

Mike Arov

RSS Spotify Apple Podcasts YouTube Amazon Music

Show Notes

On YouTube

Moving from Dev Notebooks to Production Code - ML 098

Links

Transcript

Ben_Wilson:

Hey, everybody. Welcome back to Adventures in Machine Learning. I'm your host, Ben Wilson. And today, we have a very special guest. It's not somebody that we typically have on the show. This is an actual practitioner. And his name is Mike. He's a principal ML engineer at PostClick, which is basically an ad hosting company that allows bidding, I believe, in the

Mike_Arov:

Let

Ben_Wilson:

management

Mike_Arov:

him pay you.

Ben_Wilson:

of. of ad placements. He wrote a blog post recently that we found, that we found also very interesting, and it's a way that's, you know, presenting the idea that there's a struggle in the methodology that data scientists use to create their work, namely notebooks. And bridging that gap to what people consider to be production grade ML applications that can be deployed and tested. And we're gonna talk about a lot of that stuff today. And we're gonna talk about an open source project that's out there called LineaPy that aims to work on bridging this gap. And we'll just have a fun discussion about all this stuff. It's near and dear to my heart. It has been for years about this exact struggle. and how we can collectively as a community work towards building tools that make this a little bit easier. So Mike, please introduce yourself and tell everybody what you do.

Mike_Arov:

Thank you, Ben. It's a pleasure being here. My name is Mike Aroff. I worked in machine learning engineers for a good part of the last decade in companies of all sizes from large enterprises like Intuit and... Verizon to three people startups. And also in my career, my particular kind of emphasis and passion was to avoid repeating yourself, to bring repeated patterns and kind of disciplined engineering approach to. data science and ML engineers, which currently is popularized and called ML Ops. But I've been introducing pieces of this to different companies way before ML Ops buzzwords was... coined and popularized. And throughout the years, I found there is a big gap in MLOps methodology, which does not exist in the more general DevOps world. There is a lot of development and standardization and tooling on that. engineering development site in the notebook like Ben mentioned and there is a lot of work been done to automate CICD deployment Managing versions of the code and models testing, monitoring modules in production. So on the deployment in engineering production side, there's a lot of operational and infrastructure as the call to work was done. But there is not a lot done to bring one to that, to breach. That's why my blog was titled, And this is what I

Ben_Wilson:

Mm-hmm.

Mike_Arov:

chiefly gonna be talking about today, how we need to bridge this gap and what the processes and ideas are being developed and introduced recently to help that. I want to emphasize that in general, software engineering DevOps methodology includes a lot of attention to moving development code to production. The development environment and production environment, the process of writing a code and then committing and infrastructure code automatically running tests and deploying it. This is a big part of the GitOps, DevOps methodology and it's currently missing from MLOps. That's natural.

Ben_Wilson:

Yeah, I couldn't agree more. Yeah, and it's something that any of us that have held that job title that you currently hold, anybody that's been doing that, if you start off in the data science world, as we were talking before we started the recording, notebooks are awesome when you're doing data science work, when you're doing analytics work. You mentioned, hey, you can visualize data frames. You can... run an operation or write a simple function that's going to be manipulating that data frame and you can see how it's changed it, how it's mutated that data. Like, hey, I need to pivot this table structure in order to give me additional data that I can then join to this other data set. I need to see it. I need to do that join and then see the table with my own eyes and not write some test suite code. that validates that the structure is what I want it to be, because the notebook gives you the ability to just use your eyes and see right in front of you, okay, that's what I want. Now I'm gonna build a visualization and I wanna see that visualization of the data in a chart or a graph. And you can't really do that in an IDE. And IDEs aren't really designed for that anyway. But as you said, IDEs are designed for traditional software engineering and... deploying code and modularizing things. So in your experience, what other things do you see people using notebooks for and why do you think they're so useful?

Mike_Arov:

I want to say that notebooks actually appeared way before Jupiter. It's been used in research in mathematics and science since computers become being used in mathematics and science. In industry, it's relatively new development, by new means like a decade old. it's kind of come to be used as the professions of data scientists, as a world data scientist becomes widely used. And this is important because in ML development, it's critical to look at data and find different ways to slice data to determine the features which are input to a model. It's important to... Try different models and see which accuracy metrics you're going to get by trying it. And finally, once you select your model, you need to run optimization. You commonly call this hyperparameter optimization. You try different parameters to the model and see what gives you the best performance. So there is a lot of science elements. data science becomes popular. There is a lot of elements of experimentation that you need to do before you actually come up with something that is good enough, and then you put a stop to it, and then you're ready to put it into production and bring it to customers and make it useful.

Ben_Wilson:

So if we're talking about bringing something to production and... as an ML engineer, do you, if you're looking at a prototype notebook that a group of data scientists has worked on and you look at the notebook and you see, all right, there's 30 visualizations in here. There's a bunch of, you know, hundreds of print statements to standard out. What do you look at in something like that that's given to you? to realize what needs to make it into the production code base and what can be left aside.

Mike_Arov:

So it's usually very, a lot of it is usually very straightforward and repetitive and something that you do every time. You mentioned visualization. They're critical to determine in which data are good features for the model and which models provide good output. But they are not usually. almost never useful in the production code because it's running the back end and creating

Ben_Wilson:

Mm-hmm.

Mike_Arov:

those plots. They are not, they're useless, right? They're not gonna be visible to anybody. There's no reason to run them. Print statements, they need to be converted into whatever cloud login the company's using to integrate system. While you are looking at the notebook, it's probably okay to just print statement. When you are running it, you probably want to save it in Splunk or the Datadog or some other logging systems that data stack uses. You need to replace your printing statement by logging statement. You need to remove a lot of visualizations. In addition, there are a lot of cells in the notebook which... turned out to be failed experiments, or just

Ben_Wilson:

I'm going to go ahead and turn it off.

Mike_Arov:

something that people run to get to the final answer. They don't affect the production code at all, so you need to identify that this is a piece of code that doesn't produce anything useful, doesn't produce any useful artifacts. So you ignore them and only use the ones that need, that produce artifacts and pieces of code and data. that needed to get a final answer. And then you need to take all of this and the next step, it's a bit of more of an art, computer science art, rather. So far it's all very mechanical and very standard. Monkey can do this. But the next step is to, when you get to actual code pieces in artifacts, you need to organize it into a good modular code. classes, functions, in services. Finally, you need to add a lot of boilerplate system codes to integrate it into wherever you are hosting it, which is typically either the services, like Flask or... open API or some other kind of model serving code. It could be the server like MLflow or some other ML platforms like SageMaker where you serve your code. So you need to add additional APIs and codes for this particular model serving code. Or if it runs as a batch, you need to add a code for batch processor. It could be Apache Airflow, or it could be something more specialized for ML, like again, MLflow, or it could be AWS batch, something that runs your code on a schedule or run it as a service. Those systems require additional setup settings and basically, well-paid code. Once you add it, then you commit it into your... code repository and then CI CD picks it up and deploys it and then you're done in principle. And then you go back, take the next notebook and do it again.

Ben_Wilson:

Yeah, I mean speaking from personal experience on this front, when handed over a notebook in the past, I've seen some that are relatively trivial to convert. The person that did the notebook almost approached it in an object-oriented design paradigm. They're like, all right. They know that eventually... code is going to be refactored. So this will probably be a method in a class or there's going to be some sort of object that's encapsulating this functionality. So they write a function that is fairly well designed and has a single purpose to it. It's not, you know, the meta functions that sometimes you see. And when you get one of those, you're like, all right, the refactoring isn't going to be that bad on this. They're already reusing this function 17 times throughout this, this code base and making my life easier. But then sometimes you get those notebooks that are, uh, you look at a single cell and it's script, you know, like, hang on a second. There's 137 operations happening within this cell. And this, this one six line block of code is copied 50 times throughout the notebook, but just slightly modified each time. How do you go about the process? You said it was an art, which I 100% agree with, the art of refactoring something like that. How painful is that for you when you get something like that?

Mike_Arov:

Well, the painful is actually something that less of a concern, was historically less of a concern for me than routine. It's when you get something that requires a lot of work, well, that's a fact of life, right? And part of it. will is a learning experience for me and my data science partner to write modular code, not to write a hundred cell notebook or which looks like a messy spaghetti code. Part of it is educational part. And there's a lot of people in ML engineering world that believe that just write a better notebook code. But I want to point to the repetitive and routine parts of it. So when which is not solved by education, that there are pieces that you need to do all the time and you cannot really avoid it by writing better notebook. Sure, you can write a good well formatted, you can add a lot of documentation, not because you're very good at, that's one of the advantages. They make a good self-documented, the documenting code, that's very, that's the advantage. But you can't avoid the fact you are actually having M disjoined pieces of code which needs to be taken and put into a structure of classes and models. No matter how well your original code is designed, you still need to copy paste, put it there. You need to, you're still gonna have blocks of code and cells which were just part of exploration and... experimentation and no need to go to production. All of this routine and repetitive things will need to do over and over and over again. And when I done it for 100 notebook, I was naturally thinking that I got to find way to automated it and I was thinking about building automated part of it. Not to mention that boilerplate part. This is something that at the end, right? Python code that you can just execute by something, my converted notebook.py, right? That's not the end of the story. You need to take this Python models and insert it into something that will run it on a schedule in the production gate system. Airflow would be the good example of it for batch systems. It's popular. used in many kind of smaller companies. And then there are more specialized ML platforms. Kubeflow comes to mind. that and others, but they all require additional libraries and code and systems that you need to put there. It's all pretty standard stuff. To create Airflow DAGs, you need to do import Airflow DAG and write a bunch of standard settings for Airflow DAGs, which have to be done every time, every time you productionalize analytics or ML projects from the notebook. So the repetitive and routine is the one that was getting to me far more than complexities. And that's incidentally something that is easier to automate and put into the computer systems. do art, well, recently they've been making headways there as well, but generally we can't replace a good computer architect or computer scientist with this machine yet. But making... standard repetitive manipulations like copy this code and save at this standard piece of boilerplate code and then package it together and run the command to deploy. All of this is something that we could let computer scripts and automation do and this is something that I was increasingly aware and was looking for a problem, for a way to solve it. And without finding anything on the market to help me with that, I started building something of my own. I called it Notebook Airflow Dag Factory. mouse fall and very, very

Ben_Wilson:

Pfft.

Mike_Arov:

kind of that's how engineer would call product.

Ben_Wilson:

Yes.

Mike_Arov:

But

Ben_Wilson:

Can confirm.

Mike_Arov:

it's, uh, to in my mind, this was good description of what I was doing technically by taking notebook, converting it to Python code, and then using Doug factory to create airflow Doug out of it. So you take notebook in the beginning. delete a lot of useless, from production point of view, stuff from it, like plots, because of generic plots, or comments that need to be converted into the... The logging statement, all the standard stuff, adding airflow, dogs and ports and etc. and producing something that we could run by airflow on a schedule on a daily, weekly, hourly basis and have a broken production system. So about two years ago I had this idea and started hacking away and building this system. This is what eventually kind of turned into LinearPy open source project, which we were going to talk about later.

Ben_Wilson:

Mike_Arov:

But...

Ben_Wilson:

you're taking a notebook and looking at cell-based boundaries when you import the raw text of that and then encapsulating each of those cells as a function and then decorating that function so that it can be accessed by Airflow.

Mike_Arov:

So it's close. More precisely, I was basically converting notebook entirely into Python. There's... facilities in Jupiter called NBConvert that can do it. But it does not produce, it produce a really what I would best describe as one-to-one translation. It preserves all those unnecessary print statements and the going nowhere plot code. and it even converts the cell numbers, cell one, two, three, four, into comments, saying this is cell three. Obviously in Python code, that makes sense. So it does a very, very literate translation, if you will, and it produces a one flat Python file. There are a lot of limitations, and you cannot really run it in production right away. But that was the starting code. I added additional overloaded the processors inside and we convert to split it into pieces and with a defined function, drop the one that don't have defined functions, that would be the pieces where we just plot something that didn't go anywhere. That was just intermediate step for people to look at the code. And then finally, it was adding the Airflow DAG components to it automatically from the template, from Gigi template. And that's why it was... kind of converter to air flow duct. I use the air flow duct because at the time... company that was used for all of their scheduling. But I could use any sort of production deployment systems for that. So that's what I've built, working on the site for a few months. mostly as a way to automate my own drudgery, but mostly to avoid doing the same thing for every new notebook that came in my way. And as it was project done on the side by one person out of necessity, it was very hockey, very kind of... specialized code not ready to be shared and distributed to other people. So at that time I've met a couple of people from Berkeley who had the same idea and built in similar tooling. And we come up with LinearPy, which would become a LinearPy project, an attempt to build this for general use.

Ben_Wilson:

Nice. So with this approach, you could technically just write a template factory for any execution engine. You could say, hey, I want to run this with Bazel as my orchestrator, if you have infrastructure like that. And then it would just say, all right, I need to assemble these code artifacts into a logical instruction set of... I'm going to first load my data in and then I'm going to save my data after doing feature engineering to location A on object store somewhere in the cloud. And then my next operation is going to be, you know, maybe cells seven through 37 and cell 42, 43 and 44 are the next operation things that I need to do. So that's my next DAG that's going to execute. And then I'm going to save the results of that off. data scientists to think about potentially reproducibility in a production code base, where if you're storing artifact stages all along the way in production, you could go back and say, hey, we have a scheduled job that's retraining once every two weeks, and we think there was an issue six weeks ago. If you're storing the artifact stages, the outputs of each of these DAGs, you could directly Go back and investigate, right?

Mike_Arov:

Yes, both of us mentioned word artifact in many times. And this is really core of the approach we are taking here. And it's really not unique thinking for me or Doris, that's the founder of the LinearPy. This is something that being practiced by all ML development teams that in order to save the, to bridge this gap. in practice without having the proper tooling. What people were organizationally doing is asking data scientists to provide artifacts for software engineering teams to include into their production systems, whether it's they're using some dark scheduler, Airflow, or they used any sort of cloud application platform, data breaks might be there, and anywhere, anything that is run analytics or the software stack. So they usually looked at, give us the artifact and then insert it in our production system. That's what usually a mantra people do. This is a logical continuation that, okay, so how do we get artifacts out of the notebook? How we define artifacts in the notebook? and how we extract it and package it to be used in production. And these are where a lot of repetitive work. by human and typically ML engineers, but they're scientists themselves. If they, once they have enough, you know, engineering knowledge, regardless of who it is, it's a lot of human thing, work that can be automated by basically identifying artifacts in the notebook and saving them. Now, what is artifact? Artifacts usually come in three categories. model, serialized model, that's an artifact. So in pickle file or PyTorch file, TensorFlow file, basically trained model in some binary serialized form, that's an artifact. Then there are usually data, that features that come into the input of this model. Those are also artifacts, whether they're actual database object or SQL code to produce them. Finally, there is a code that needed to be, to glue all of this together. Those are also artifacts. And when you get all the street categories of artifacts, model, features, and the code together, you effectively getting a production allies code.

Ben_Wilson:

with lineage. Now you

Mike_Arov:

Dios.

Ben_Wilson:

know exactly what the state of everything was to produce that exact instance of execution.

Mike_Arov:

Yes, you can do versioning of these artifacts, all of them, code versioning, data versioning, and then you know exactly what produce. You can swap, you can experiment, you can do A-B testing on them, a lot of things you can do with artifacts that you generated. But essentially production of the strict cause of artifact is what we want to automate, and this is what this new bread of tooling helps.

Ben_Wilson:

Yeah. And one thing that you mentioned really resonated with me with regards to this concept of artifacts and how you said, well, data scientists that gain enough engineering skills might be able to do some of this stuff. And I just thought back to something I was working on a couple of months ago, which was on the pure engineering side, serialization format changes when Python is upgraded. If you're using Pickle. And you save it in Python three seven. Now production system is now upgraded to Python three nine where the pickle protocols have changed. So you can no longer load that, that serialized artifact through pickle or even cloud pickle. It just blows up. It's like, I don't understand what this serialization format is. So with the nuances that a software engineer understands from like, okay, I know that this is going to have to migrate to a different ecosystem and there's going to be non-backwards compatible breaking changes sometimes. And we need to keep this around for potentially years. So I now have to write my own serialization deserialization protocol, which goes from taking a Python object or a Java object and writing that serializer. doing a dir or a var on this object and I'm getting all of the attributes associated with it and then I need to base64 and code all of that stuff so that it can be safely stored as a byte array. Do you think that sort of activity is just a little too far to expect for most data scientists who are specializing in other areas? Like should ML engineers and software engineers ever expect a data scientist to know how to do that?

Mike_Arov:

And what you're describing is probably not part of the skills expected from data scientists. And even ML engineers should not normally have to write a serializer. And it's usually picking up the correct version. So... solves the problem that you are describing, this particular, but automation that we are talking about of notebooks to code helps to keep versions of cell. Because configuration, like versioning of packages, in particular serializing, is also part of the artifacts. like your requirements, the txt, that's definitely artifact that needed as a part of what we all, even Docker file, generating a Docker file, there's a proper versions of things you will install it. Those are definitely things you need to build it and deploy it in whatever your production system is. And you need to use the correct version of Python, correction version of the pickle. file or job, job lead, whatever you're using it so tracking and it becomes even far more untraceable problem when you talk about deep learning things like TensorFlow, PyTorch, there's whole lot of other, it's far more complex and far more important to keep the correct versions, I've seen problems when you install nightly version of PyTorch and it stops reading your computer vision models that trained for weeks. That's

Ben_Wilson:

Mm-hmm.

Mike_Arov:

sort of thing happening and keep in track of the versions of your software is a big part of the DevOps methodology in general and MLOps methodology in particular.

Ben_Wilson:

Yeah, there's no greater pain that I have faced when dealing with maintaining production ML software stacks than what you just described as like, Hey, we didn't pin our versions of what we used when we trained this thing. And now there's a breaking change that was just released with, you know, A minor version or major version was just released of some major package and now everything's broken because it's just like the pip install is getting the latest version of this stack. And then you try to walk that back and say, okay, what version were we using when this trained? Okay, we'll pin that. And then you realize that all the other dependencies have also updated. with breaking changes to that older version that you were using. I guess for the vast majority of use cases that are out there, and the way PyPy is structured in general, you can't do patches to PyPy. When you release a version, it's there forever, for all time. There's no take backsies. So you can do a micro version patch fix. So... I think nobody understands how important version locking things are until it blows up in their face in the real world. And then they realize, oh, geez, we really should have been looking at this or handling this. So in addition to the version requirements that you mentioned with, you know, if you want to do a brute force, you could just do pip freeze and save that off. and now you have your exact development environment that was used. What are some other metadata that's really important to capture when we're talking about moving into a production deployment?

Mike_Arov:

Well, there is the versions, the numbers of hyperparameter that you need, that you created to, when you come up with your experimentation with the best model, so it's also important. The actual... what versions of the model file we tried, how many, which data we run it. But those are actually falling into in general this tree section. There's a data artifact, there's input features. There is a code to produce it. There is a code to, there is a model artifact. and the corresponding code that glued this together including versions of the dependent packages. And then there is something that things like linear pi. do for you and without it you have to do it manually is to trace dependencies throughout your notebook. Your artifact may depend on other artifacts that by itself don't produce anything useful for your model in production. They don't produce. your result, right? The output, like your score or classifier, whatever, whatever is the output. But they are important as an intermediary because other artifacts that are producing... important value, they depend on it. So dependencies between artifact needs to be traced and this is something that You can see by looking at which variables Your let's say your cell has a dependency on the DataFrame which was generated five cells before so the code from five cells needs also to be saved and included otherwise your code will break because So linkage between the data, between artifacts is important. Hence, in order to do it correctly, you need to actually know how kernel, which is Python interpreter inside Jupiter, handles data in memory and which data objects are connected. And... Hence, they are part of the execution dependency graph.

Ben_Wilson:

So you can extract from the IPython kernel all of that information.

Mike_Arov:

Correct. That was something that was missing in my kind of hockey vizier. And this is something that current open sourced general availability linear pi actually has. It tracks dependency on all the artifacts in memory inside kernel, and it dumps it into your... artifacts that it then packages into Airflow or the Perfect or Databricks or general Python Flask services. The list goes on for how we can run it in production.

Ben_Wilson:

Yeah, and that's really the miraculous aspect of the open source package of LinearPy is the fact that it's delving into... I mean, it's great because it's solving this struggle that a lot of people have. It's saving a lot of time that people who would have to manually do this... I mean, it's not fun work when you're just building boilerplate and orchestration. And like you said, it is incredibly repetitive. It's not creative. It's just structured work that you need to get done to get this thing into production. But that internal functionality is non-trivial to implement on your own. If you're evaluating a notebook and you're like, hey, well, I can just figure out what this lineage is of object references by, you know, looking at the code, it's not always as intuitive as that. You know, there's notebooks I'm sure both of us have seen, I know that I've seen, that are hundreds and hundreds of cells, tens of thousands of lines of scripts in a notebook. And working through that and finding, okay, is there variable reuse here? Are they using the same name of something and it's changing its objects, allocation and memory? That's really tricky to, you can't just. search a notebook and be like, oh, well, all of these are defined correctly and they're all referenced the same object. There's no guarantees. So going into the kernel and getting those relations and actually accessing, you know, the actual hash map memory table of like these are the object references that are still used. That's pretty powerful. Yeah, I think that's what sets it apart from other implementations that are out there. So you did mention it's open source. So it's open source. How many outside contributors do you have to the project, and how many would you like to have?

Mike_Arov:

So it's been released to open source just a couple of months ago. So it was very new. At the moment we only have a handful of outside contributors. and we don't even have any formal kind of collaboration or the steering committee or anything. It's a small 10 person company that write model code. I am their startup advisor and contributor. You can say me as an outside contributor, because I'm not like officially on the API staff but I do kind of believe and help and collaborate. But there is no, we definitely want a lot of people to use it. try it on for their problems. I have to be honest at this moment, because it's effort of a very small number of people, and it's very difficult problem to generalize to all sort of... It's a problem that everybody solves on their own turf, in their own company-owned project, including me in PostClick. But it's a very difficult problem to generalize to... as a tool applicable for everybody, as anybody who worked on the infrastructure and tooling will attest that it's a big, big jump from

Ben_Wilson:

Yes.

Mike_Arov:

building something that is tuned for you to open source in it and producing it for everybody. So I encourage everybody to use it, to try, to see. communicate the gaps and actively contribute and add features there. And if I make any point, any message during this conversation, that this is a very important problem that's been largely overlooked by MLOps. development community and big

Ben_Wilson:

Mm-hmm.

Mike_Arov:

companies. There is a lot of development on leaps and bounds in the notebook side, starting from Jupyter itself, becoming much, much more powerful, much more feature-rich every day to new kind of notebooks like Deep Note or Hex or... notebooks inside cloud environments, IDs have better and better notebook support. On the other side, production, like running ML in production, again, the tooling become much, much more advanced and effective. This kind of translation, there is not a lot of work to do it, and LinearPy is just a small team building. project from start. So it takes a lot of effort, there is a lot of room for improvement and just need more attention and more work and more creative minds from different engineers to build. linear pi and similar tools that

Ben_Wilson:

Heh heh.

Mike_Arov:

will be amplified effect on ML engineers, ML ops and data scientists in this way.

Ben_Wilson:

Yeah, I mean, I couldn't agree more. And that, that bridge translation also known as a, what I, what I see as a Rosetta stone between pure data, you know, data scientists and statisticians or physicists, the way that they work and how they, they need a notebook to do what they do. In order to be efficient. And then from the engineering side, you definitely don't ever want to schedule that notebook in production. There's going to be a lot of modifications that need to happen to get it to be within the structure format testability and deployment state that your production code requires for your company. That actually can create an additional problem that I think tools like LinearPy and however other people are thinking about. working on things similar to that. Without that bridge, that gap is actually a lot deeper than what most people realize, that chasm between those two implementation or working methodologies. Because data science code is not like standard software. It's not something that's static. I mean, standard software isn't static. I know you make updates to it all the time, except as you said, there's the DevOps tooling stacks that people use that makes that process so much easier and so much safer. You know, you can create a branch, fire off a PR, run your unit tests, you know, have the CI system run unit and integration tests, make sure that you're not breaking anything with that code change. And that, you know, what you're pushing is the feature that you want or the bug that you're trying to fix is actually fixed. And you can safely know that, okay, the system is operating as it was intended to operate and you don't have to worry about it. It's not going to automatically degrade in two weeks. Well, data science code does. Models fall apart. They need to be retrained. Features need to be added. Features need to be removed. There's modifications that need to happen. It's not just... Hey, I'm going to retrain this thing statically and schedule retraining. You know, there's a, a serious art to data science that needs a bit of creativity and modification. So when you have to do that modification in today's paradigm, the data scientists are used to working in notebooks. ML engineers and software engineers are used to working in IDEs. That code is probably not that similar anymore from what the data scientists originally created to what. software engineers built. So a tool like LinearPy, by bridging the gap between those two, it means that the data scientists can modify their code without having to learn object-oriented programming and systems design and understand truly advanced software engineering techniques where your model pipeline that you built, that might have been abstracted into know, six different modules within a code base and you're using abstract base classes. You know, how many times did data scientists build those and use those? It's like ancient Greek to most data scientists. They don't, they don't need to know how the sausage is made in their ML, you know, toolkit that they're using from open source. They just need to know the application of that public facing API. So. The idea of a tool like this excites me because it's allowing for that Rosetta stone. I'm like, hey, we can actually both work on this.

Mike_Arov:

Ben, here I would like to actually maybe emphasize something that automation in general does not replace education. And so what you touched that many data scientists don't have experiences, abstract. classes or nuances of polymorphism of the programming generally may not have a training and experience of a computer scientist and experience that engineer, but the argument, the answer to this that many, many people in C-suite, everywhere I will say, well, teach them, right? Or hire the right data scientist, find the right data scientist that have genuine experience. And, but I have to say that even if people actually can do, the job and have the knowledge, that does not mean that they're not making mistakes when they do it every day, day in and out, and do it manually, right? When we talk about this hand-off of artifacts from notebooks to production, that could be one person doing it. Like I've done myself, when I actually built in the ML system on my own. That was probably in a smaller scale, like 0 to 1 projects, right? then the person on both sides who are doing notebooks and modeling and person who are to protect is the same person. He knows what you're doing, but he's still a human and he or she can still make errors, mistakes and automation save them. Being analogy,

Ben_Wilson:

Yes.

Mike_Arov:

I would point on another popular bread of software productivity enhancers that currently code generators. GitHub copilot is making splash, this is very popular. I actually love it, I use it every day. Doesn't mean that I'm bad coder, I don't remember the pieces that it suggested, no.

Ben_Wilson:

Thanks for watching!

Mike_Arov:

helps me to avoid small minor bugs and mistakes that break production every day and causes money and time and reputational damage. It's very unlikely that they will... GitLab copilot will have, you know, comma mistakes there or some typo, right? And this is a value proposition of those automation things. They are not helping people who don't know how to write recursive functions, right? They help people who make silly mistakes and write recursive functions day in and day out. And this is the value of proposition including linear pi and the way I would see it in a long, long time, long, long future, we are not nowhere near there, I would see convergence of code generator and morphing of tools like LinearPy in the architecture generator. So if you think what I was describing you as extracting artifacts, refactoring, if you think of what this process is doing, architecting software, right? It's a

Ben_Wilson:

Mm-hmm.

Mike_Arov:

job of software architect. We take pieces of code, we put them at the models, we add boilerplate from the execution systems, execution agents, which is again, part of the architectural considerations, whether we use the airflow, whether we use the data breaks, whether we use something else, how it's gonna be deployed, all of these architectural considerations. Ultimately, it's going to be, I can envision something like AI assisted architect, AI architect that would take your code and suggest you a more optimal way to group it, to architect it.

Ben_Wilson:

Interesting. That's a fascinating idea actually, where you could get almost like a Terraform template that's generated based on code inspection and data volumes and say, hey, this first part, you just need this really small VM to run this part. And then this next part that's hyperparameter tuning, you have 10,000 iterations here because this is production grade and it's a complex algorithm that you're trying to tune. Let's parallelize that for you. And we'll run that on 60 different VMs, and we'll have a command and control primary node that's basically a ray head node. And we're going to run this in ray, and we're going to get all the results back from a barrier execution mode standpoint and use that to populate the next phase of iterations. And then for deployment, you could maybe pass in or say, hey. This is how many users I have. This is how often they're going to be hitting this API. And then it spins up or configures Kubernetes deployment with traffic allocation to handle the REST requests. That's pretty cool. Right now, I know for a fact, having built many of those, that is a very manual process and very error prone from the perspective of efficiency, of making sure that, hey, we're not allocating too many machines to this task. And we've optimized this to a point where we're sitting at 80% CPU utilization and 70% memory utilization instead of, hey, we're ooming containers out or we're spilling the disk in Spark. It's tricky, it requires a lot of experience to figure all that out. So. I for one am excited if there's going to be a tool that somebody out there is going to build that's going to look at cloud deployment services and infrastructure and also inspect the code and the data to understand what might be involved in that. Maybe that's your next open source idea, man. Let

Mike_Arov:

It's

Ben_Wilson:

me know when you file the PR for it.

Mike_Arov:

Yeah, but I was trying to paint a picture of long term vision and

Ben_Wilson:

Yes.

Mike_Arov:

show that there's a lot of very ambitious work done there. There's not just data optimization described. There's also code optimization. There's a caching, right? It's a very common thing that you take a function that's being used there and you add last... last reason use cache on top of it. That's a common optimization that's good engineer will do almost without thinking. This is something that is not automatic. But it can be. If we know the execution graph of the code and we do it, we can do it. So we can do it. have this information inside the kernel, we can figure out which function calls million times and add other requests or decorator on top of it. That's very, very obvious. And there are tons of such things that we could do automatically. And that's next step. I have once again linear pi is nowhere near there and it's probably going to take more than one small team. As you said, maybe different kind of open source projects or the different commercial tool. It's there's a lot of room to make it to this vision I described of. the AI architect almost

Ben_Wilson:

Hehe.

Mike_Arov:

like AI code architect. It's exciting and takes a lot of people and effort on more immediate roadmap on the linear pie is something much more mundane and less ambitious called testing. And this is another thing that we did not touch. And this is also an important thing. And it's also something that every software engineer in general, and the male engineer in particular, needs to do. And it also feels like a kind of necessary evil and routine exercise of discipline for us. I have the functions, I need to write a unit test. So the most immediate and obvious thing that we add in, and this is actually the feature that LinearPy is building right now, there's still a lot of gaps, it's in the works, is adding, it's generating unit and system tests for this notebook as a

Ben_Wilson:

Nice.

Mike_Arov:

new force kind of artifacts. To repeat, there is a code, code dependencies, code installations requirements. There is artifacts like a model, and there is a data artifacts. There is a force category that is just being added and it's obvious it's tests. that separates the theories of artifacts and they're generated the same way. When you know what code goes into production, you can generate boilerplate or like skeleton tests. Currently, the engineering team in the NioPy couldn't figure out how to make them actual work in tests. So it's a skeleton right now that's being generated that humans still needs to put the right kind of code to call the mock data snapshots on which to

Ben_Wilson:

Mm-hmm.

Mike_Arov:

run the test. but it's currently the work and again, anybody who is interested can contribute with code or ideas or just, you know, trying it on testing the testing mode.

Ben_Wilson:

And it's an, like you said, it's an important thing that people that handle production software definitely focus on. I mean, sometimes it's, as you said, it can seem mundane. But anybody that has to be on pager duty alert appreciates tests quite a bit. In fact, that's how I'm spending the rest of my day today, is writing some tests for a new feature. And yeah, it takes some time. And it is a lot of boilerplate a lot of times, where you're just like, hey, I'm basically doing the same thing 17 times here and just tweaking a couple of different logical to, you know. things that I need to test here and having that generated for you can save you some time and also be a reminder as well. Be like, hey, don't forget to do this. Don't ignore your code coverage tests because that feature that you're implementing right now that you didn't bother to test, that could be the thing that blows this up in production and makes everybody wonder what's wrong with the data science team or the ML engineering team. Why did this broken code get shipped? So it's very important. But hey, Mike, it was an absolute pleasure to talk through with LinearPy, how it was designed, what it does. And it was great talk on notebooks and production code. So before we leave, how can people get in contact with you? How do they find out more about LinearPy? And is there anything else you want to leave us with?

Mike_Arov:

So you can contact me at... You can find me on LinkedIn, it's Mike Erv. You can get to email, mike.erv.gmail.com. So I assume this podcast will have my contact info there. So I'm happy to get any request. we'll be adding link to GitHub repo for LinearPy and the webpage as well. There so that you could, I encourage to, to scout, contribute, learn. and even build a competitive project. As I said, the fact that nobody, to my knowledge, building something like this is not a competitive advantage for LinearPy, but rather a miss for the entire industry, right?

Ben_Wilson:

Hehehe

Mike_Arov:

It's an important piece, and it's definitely something that a small team... One repo cannot fully satisfy and solve, but I believe it's a great start, it's an important thing. So go download, use, send feedback, join Slack channel. And you know. Let's make life of ml ops and ml engineers and data scientists Practitioners everywhere easier and more productive

Ben_Wilson:

Awesome. Great recap. And once again, it was a pleasure. And

Mike_Arov:

Absolutely. It

Ben_Wilson:

thank

Mike_Arov:

was

Ben_Wilson:

you

Mike_Arov:

an honor

Ben_Wilson:

everybody

Mike_Arov:

being

Ben_Wilson:

for

Mike_Arov:

here.

Ben_Wilson:

tuning in. All right. Thanks everybody for tuning in. And we'll catch you next week with another guest. I've been your host, Ben Wilson. And see you later.

Moving from Dev Notebooks to Production Code - ML 098

0:00

1:11:11

Playback Speed:

Show Notes

On YouTube

Sponsors

Links

Transcript