Challenges and Solutions in Managing Code Security for ML Developers - ML 175

Today, join Michael and Ben as they delve into crucial topics surrounding code security and the safe execution of machine learning models. This episode focuses on preventing accidental key leaks in notebooks, creating secure environments for code execution, and the pros and cons of various isolation methods like VMs, containers, and micro VMs.

Hosted by:

Ben Wilson •

Michael Berk

RSS Spotify Apple Podcasts YouTube Amazon Music

Show Notes

They explore the challenges of evaluating and executing generated code, highlighting the risks of running arbitrary Python code and the importance of secure evaluation processes. Ben shares his experiences and best practices, emphasizing human evaluation and secure virtual environments to mitigate risks.

The episode also includes an in-depth discussion on developing new projects with a focus on proper engineering procedures, and the sophisticated efforts behind Databricks' Genie service and MLflow's RunLLM. Finally, Ben and Michael explore the potential of fine-tuning machine learning models, creating high-quality datasets, and the complexities of managing code execution with AI.

Tune in for all this and more as we navigate the secure pathways to responsible and effective machine learning development.

Socials

Transcript

Michael Berk [00:00:05]:
Welcome back to another episode of Adventures in Machine Learning. I'm one of your hosts, Michael Burke, and I do data engineering and machine learning at Databricks. And I'm joined by my extremely well dressed cohost,

Ben Wilson [00:00:16]:
Ben Wilson. I work on open source stuff at Databricks.

Michael Berk [00:00:22]:
Today, we're gonna be talking about a real world case study, and it's something that I'm currently working on for an unnamed customer. And we're basically just gonna riff about this use case. It's a very popular one. It's a very powerful one. And Ben is also doing some feature developments that are adjacent to this as well. Yes. So the the use case today is building an internal code oriented chat assistant. So what that looks like is I'm a data scientist.

Michael Berk [00:00:52]:
I type in my question, how do I pull this data? How do I build a model? How do I do this type of analysis? How do I access this 1 s three bucket that's all the way over there? And the Chadda system will have context not just about code, but about your organization's code. And then it will give you concise answers that will hopefully make you more productive. So that's the setup. What we're gonna be specifically focusing on today is not RAG

Ben Wilson [00:01:14]:
or retrieval augmented generation,

Michael Berk [00:01:14]:
but fine tuning. So all these big LLMs, they've been trained with 1,000,000 of dollars worth of compute. And what we're looking to do is modify those weights a little bit so that they know a bit more about your data while still having understanding of the English language, Python best practices, etcetera. Sound good to you, Ben? Mhmm. Cool. So let's start it off with what I think is potentially the most important piece, which is evaluation. Evaluation is the thing that will allow you to iterate in a stable manner and know whether the things you're trying are good, bad, okay. So, Ben, how would you think about evaluation?

Ben Wilson [00:02:01]:
Safety first. So, like, when you're talking about code, it's amazing how much damage that stuff can do if you run something that you don't intend. And it's all about the that level of security, and there's multiple layers of security associated here. But stepping back from that, the first thing, when we're talking about eval of anything that's coming out of Gen AI, you need to be able to have a metric that you that is actually valid, that is custom designed for the problem that you're trying to get this thing to do. So you start asking yourself, like, well, if this thing is supposed to generate code, what do I want to evaluate its quality based on? Is it important that it's, you know, creating code that is legible? Like, something that is something that a human can read and grok pretty quickly. I would say that most commercial grade LLMs out there are exceptional at that. Like, whether you're talking about, right before we're recording, we went over and played around with 401 preview a little bit about doing exactly this. And when it's evaluating code and potentially giving an example of something that you're asking for, like, hey.

Ben Wilson [00:03:28]:
Rerete my code for me. It's really good at it. Like, it's not perfect, but it's pretty good. Cloud 3.5, also fantastic at cogeneration. But when you were talking about evaluating something for fine tuning, we'd wanna be able to rate this as how good is it syntactically, how good is it for a human to read. Maybe there's performance considerations. Like, hey. Do do a big o analysis of this and optimize for, you know, minimization of memory footprint or CPU complexity.

Ben Wilson [00:04:07]:
It could be some some things that you need to evaluate there. But a lot of that stuff can be determined by just evaluating the the text of the code itself without having to run it, and you can get metrics associated with that. Use an as a judge and say rate my code for these things. And then there's another level of this where you can use linters and use something like rough, for instance, written in Rust. It's super fast. It can parse ludicrous amounts of Python code and adhere to a set of predefined stochastic rules. Like, hey. I have this rule set to make sure that my code is formatted in a certain way that I'm not doing things that are against Python development practices.

Ben Wilson [00:04:59]:
You can enable, disable different rules, create your own if you want. And you can execute that in a very safe manner because it's just parsing text and giving you an evaluation of what it finds. You can even have it auto fix it. You wouldn't want that for this case. You just say, check this code and report out all of the violations that it has or whether it has violations, yes or no, and that can be a metric. But when we move away from that into, does this code actually run? Does it do what I intended to do? That's when we're in the security world.

Michael Berk [00:05:36]:
Yeah. So giving a little bit more context, and I would love to deep dive into that, in just a sec, but we're pretty comfortable with basically leveraging either LLM as a judge or deterministic metrics with or without Python execution to see if the code is, like, stylistically reasonable, if it's concise. Most LLMs, especially open source ones, are are fine at doing this and analyzing code. So what we built is we built an MLflow evaluate framework where we basically create a bunch of metrics. Some of them are custom. Some of them are prebuilt by mlflow. We run via the mlflow dot evaluate command, and then we get a bunch of summary statistics of each metric, so percentiles, etcetera. And this works great because then we can create custom functions.

Michael Berk [00:06:25]:
The question to you first is let's say I'm a data scientist. Let's take a very specific use case. Let me think of 1. Let's say we're looking to do a cohort analysis on a, I don't know, freaking shopping cart or something like that. Like, we have a bunch of products. We wanna see how users are buying them. We're gonna do a cohort analysis on the users that buy x number of products. So it's effectively like clustering or something like that.

Michael Berk [00:06:56]:
Let's say there are a few APIs that are specific to the organization, and those APIs need to be called correctly to get the data. And then we also have a prebuilt sort of notebook example of how we would expect this should be done. So the organization has all of this source code. How should we go about executing all of this very complex read logic, and evaluating that the output is actually correct in a safe way?

Ben Wilson [00:07:30]:
So if I were the one designing that system, I wouldn't use, like, a base LOM to do that because there's so many things that can go wrong, and it's so hard to teach that to use those properly and to have enough examples for it to understand. So I would instead not look at the LLM fine tuning as the means of doing that, but I'd rather go into a like an Agintiq framework, where if I have an API that I know that I call this service and I'm expecting to get a response, I can make that deterministic. And the way that I make that deterministic is write the interface and register it as a tool. So I know I can test that. I can say, here's the example input for this. Here's my, like, my arguments that I would pass to this, and I would get back a deterministic response based on what the conditions are of what I'm submitting to that. And then the agent would just have that tool available. So I would write a very thorough description of what the function is for, what the tool is for, and what the arguments are, and what the arguments can and can't be for its interface.

Michael Berk [00:08:47]:
So, effectively, you'd mark it.

Ben Wilson [00:08:51]:
I would just would push that to the location that it needs to be in, which is a tool. Like, it's a function call that you expect consistent behavior of retrieving this data or contacting the service and not leave it up to an LLM to have to learn that.

Michael Berk [00:09:12]:
Sorry. I'm not following. So LLM generates a piece of code that says, read data x, read data y, do cohort analysis. How do we execute that?

Ben Wilson [00:09:27]:
I mean, if I'm building a system that is a human interface to ask a question about this analysis, that agent would do that one thing. Like, hey. Here's your tool that you can use to get the data for cohort a. Here's the tool you can use to get the data from cohort b. It's probably the same tool with just different arguments fast end. And then the analysis function could be another tool that takes in the those datasets and does whatever analysis we wanna do and returns a result. So now I have this agent that can answer a question. Hey.

Ben Wilson [00:10:04]:
I'm interested in, like I wanna know the difference between people that are buying milk and people that are buying chocolate milk. How many how many people and what's our sales for them, and forecasted sales for the next 6 months? Like, should we buy more white milk or chocolate milk?

Michael Berk [00:10:21]:
Yeah. The use case is slightly different. We're not actually gonna answer the question. We it's not like a function calling paradigm. It's an assistant paradigm where you say, how do I do this? And it generates text, and then the user will go execute that text or Python code.

Ben Wilson [00:10:41]:
That's complicated, man. Like, really complicated.

Michael Berk [00:10:45]:
It's it's, like, effectively chat gpt without running the execution where you say, how do I do a group by in pandas? It'll send you the text, and then you now go use that in your environment. How do we validate that that generated Python code will produce the correct result without all that agentic tool function stuff?

Ben Wilson [00:11:02]:
Gotcha. Okay. Yeah. This is, like, science fiction level stuff. Don't think it's possible with today's technology, to get something like that that would be fine tuned to be really good. You would need to I mean, maybe maybe OpenAI can do it with a fine tuning dataset that is fairly large that has all of these different things that you wanna do. It'll be expensive, to do that, and I would not be able to tell you, like, what that quality would be, because it this would be very complicated.

Michael Berk [00:11:43]:
Well, even just on the evaluation side, let's say I have three lines of Python code. I wanna execute that as if I was a user and evaluate that the output of that code is roughly correct.

Ben Wilson [00:11:55]:
Right. Right. Yeah. So if if we're setting aside the sci fi aspect of this and how possibly impossible this is to get something acceptably good, Just executing Python code in a safe manner, it's super dangerous. And there's a a number of reasons why. You'd unless you know exactly the data that it's been trained on and also have some sort of guardrails put into it to say, here are the libraries that I do not want you to ever use, and here's some more instruction sets of operations that I am not permitting you to generate code that uses these things. So I would have a, basically, a a disallow list of, like, libraries within Python that I know can cause some very serious problems. Like, maybe I don't want it phoning home.

Ben Wilson [00:12:52]:
Like, I don't want it connecting to the the Internet for when it executes code. So I'm not gonna use any, like, HTTPX. I'm not gonna use your lib. I'm not gonna use, I don't even want it to parse, you know, URI. So I don't want it to use URL encode. I don't want it to use, like, any of the base networking libraries that are part of core Python, and I would instruct it that as part of, like, the system prompt during training. And then I would probably also say don't use shellutil. That's operating level system stuff.

Ben Wilson [00:13:32]:
If you've never, like, really messed around in a virtual environment with what you can do with Python with pseudo root access, I highly recommend everybody try that out. Take a take a Friday afternoon sometime and see if you can completely break a computer with Python. Promise you you can. You can do crazy stuff like start deleting user directories. Like, and, yes, that is recoverable in most operating systems. There are ways to, you know, get that data back, because it's it's usually a soft delete. But there are creative ways that you can turn your computer into either the world's most expensive space heater or the world's most expensive paperweight. That's why you should always do this in a virtual environment, because VM, who cares what happens to it? Just start another one.

Ben Wilson [00:14:33]:
But I recommend people try that out. Like, see, can I obliterate this compute environment and make it just unusable? And the answer is, yes. You can. Don't do it to your own file system, though, because if you do some of these techniques, like, I'm gonna, you know, r m minus r f on this folder that contain, like, your docs folder or something. And then you can write code that will, like, just fill that entire space up that you just deleted with just random bytecode, and then delete all of that again, and then fill it up again with a bunch of random byte code. You do that enough cycles. The operating system can't recover your underlying data because it's been overwritten, and the blocks are all now corrupted. Right.

Ben Wilson [00:15:21]:
So, yeah, you can do stuff like that, or you can go into your disk recovery, directory on your your operating system, Mac or Windows or whatever. Just wipe all that stuff so you can't, like, safely restore your your computer, from a backup.

Michael Berk [00:15:39]:
Sounds like you've done some of this before.

Ben Wilson [00:15:42]:
Maybe. You can do crazy stuff like, generate, like, generate so many files on an operating system that you actually crash the operating system. Like, it can no longer index files. So, like, yeah, you're still looking at a screen that's on, but your CPU is pegged to a 100%. And the if you open up any file browser in your operating system, it doesn't do anything because it's sitting there trying to index the, you know, 37,000,000,000,000, like, files that contain hello world, that you've just written to your hard drive. So now your your actual, you know, actual index tree, for your computer is just filled. It like, it's a bad day bad day for Michael. If, you do something like that, I don't recommend you do that.

Ben Wilson [00:16:39]:
No. I was

Michael Berk [00:16:40]:
going to, but now I won't.

Ben Wilson [00:16:43]:
Modern computers handle that a little bit better. I think the last time I did that was on, like, I think it was on, like, a Windows 7 computer, that we weren't using anymore at a job I was working at many years ago. And I was like, hey. I bet I can just break this computer. And guys I was working with are like, well, yeah, you can, like, you know, delete the root directory. I'm like, no. No. No.

Ben Wilson [00:17:08]:
You wanna see something cool? So I just put, like, a recursion algorithm. It's just, like, using, like, file writing of a very small file just billions of times. And we're just sitting there looking at the actual task monitor of it, and everybody's like, dude, what is going on? I'm like, file indexing. They're like, this is insane. Like, we can't even do anything. I'm like, yeah. We gotta fdisk this thing, in order to recover it and reinstall Windows. They're like, woah.

Ben Wilson [00:17:40]:
Like, yeah. Don't do that at home. But, yeah, you can code is is potentially dangerous. In in most base libraries, you have the ability to do crazy stuff. If usually, people who are on that computer, they're working in a job. They know, like, I'm gonna get fired if I start doing stupid stuff like this. Or it's your own computer that you paid money for, and you don't wanna, you know, wreck that. So you're not gonna do stuff like that.

Ben Wilson [00:18:15]:
But we're not talking about code that you're writing. We're talking about code. Some Gen AI is generating, and you're just blindly executing it. You have no idea what that thing is gonna come up with.

Michael Berk [00:18:26]:
So how do you do this safely?

Ben Wilson [00:18:28]:
You don't I mean, the safest thing is don't run it, and have a human in the loop. That doesn't scale, and that's not a good answer. There's safe ways to run arbitrary code. And if you ever check out, like, any sort of, like, DevCon, like, the black hat hackers, you know, symposiums that they do, they have one, like, they have a bunch of them, but you can check out YouTube videos if you want. Or any of the white hat hackers stuff, like presentations that people like, oh, this is how we do penetration testing, and this is how we evaluate these things. And some of the stuff that they're running is, like, dangerous stuff. And if if you're in, like, cyber forensics, you're going through and and trying to debug what somebody, like, wrote in order to exploit the system, You don't ever wanna run that on your computer. Like, oh, this looks like maybe this is a virus.

Ben Wilson [00:19:29]:
I should install this to my computer and see how it works. You you create a sandbox environment, and that sandbox environment is just like we're saying with these file system access stuff. If it's in a virtual machine that's running on your computer, who cares what it does? Create that VM. Do not allow it to have outside access to any network. Disable all ports on the thing other than the fact that it's reading from a file system where you're submitting that code, or you have this one port that's input only, like, during the instruction set, and you're saying, execute this. But you wanna make sure that you're executing it only within that environment, and it has no access to your file system. It has its own file system. And if it blows up or does something super dangerous, kill it.

Ben Wilson [00:20:20]:
Delete it. Kill it with fire. And it it's fine. Like, just delete everything associated with it, and you're cool.

Michael Berk [00:20:30]:
Going one level deeper, because I will implement this in the next, like, 4 days, what exactly what you do would you do? Let's say we're using MLflow evaluate. We get a string of code. Let's say it's compilable, AST parsed, everything, and it will run. But it has a drop table command, and then it also has a vacuum command to remove all of the underlying data for that table. And let's say I have admin access to the workspace.

Ben Wilson [00:20:59]:
Don't do any of that stuff. So if you're if you're getting into the world of SQL execution, there are libraries out there that can clean and parse dangerous commands. You could write your own parser. I don't recommend that. It it's really complicated. But at a bare minimum, there should be an exclusion list that you're evaluating whether this is safe to run or not. And you can just put in, for, like, dummy parsing. Like, is there a is the word drop in here in either uppercase or lowercase? Is there any command that's in the SQL execution that will execute something other than a query?

Michael Berk [00:21:45]:
I don't think whitelisting or even blacklisting would work because we would be leveraging APIs that are custom built by this organization. So we don't always know what is happening within those APIs.

Ben Wilson [00:22:00]:
So the APIs have delete statements and truncation statements?

Michael Berk [00:22:05]:
We don't know. I would not be comfortable, like, creating build this. Okay.

Ben Wilson [00:22:10]:
Like, just don't build it.

Michael Berk [00:22:11]:
But is there a way to spin up a Docker container or something where if we execute the most horrible evil Python code, it'll just kill the container?

Ben Wilson [00:22:21]:
Python, yes. SQL, no. You don't if you're calling some code that then has access to a table somewhere, you have no way of knowing what that is in an external system. Yeah. But what you would have to do is replicate that system internally within the container. So you would basically take a snapshot of that data in whatever mechanism that it's stored in and have your local code connect to that local instance. That's a lot of work, though. Like, a shocking amount of work for something, like, so silly.

Michael Berk [00:22:55]:
Could you give it read only permissions?

Ben Wilson [00:22:59]:
I mean, you could set access control for this execution that this is a, like, a read only user. Sure. But you'd still have to be kinda careful about exposing a, like, a production data source to something that you don't trust its code execution. Like, do you have robust injection attack protection where somebody writes some crazy SQL, they close the statement on the select, and then they write, like, some sort of transact statement.

Michael Berk [00:23:31]:
But the whole point of this read only permissions, how could that so to be crystal clear, what I was like, my my cursory proposal as of, like, yesterday, so I only spent, like, a tiny bit of time thinking about it, would be spin up a execution environment via a service principle that has read only access to dev, obviously, not prod.

Ben Wilson [00:23:57]:
Theoretically, I don't know. I'm not a security professional. Okay. I wouldn't do it, just because there's a redesign that would make more sense. So if you're is not abstracting the actual command that you're gonna be running. So if you're calling some library that has some code in it that could potentially do dangerous things, then you would have to evaluate what the the impact of that is. I'm like, okay. This could call this this admin API.

Ben Wilson [00:24:35]:
And even though I I think I have everything configured, and this is safe and secure, and then you run it, and then you find out later on, like, 10 minutes later, like, we just dropped the entire catalog because we forgot to do this one. Like, make sure that it didn't access this one thing. You have no idea.

Michael Berk [00:24:56]:
Sorry to harp on this, but isn't the solution environment separation? Like, if I go to my personal laptop and run drop catalog from, like, on any workspace, nothing will happen because it's my personal laptop. I don't have the workspace URL. I don't have the token, and there's no way that the LM will guess both of those things.

Ben Wilson [00:25:14]:
So that it's guessing that. It's that you writing connected like, let's say we you set up this this unit catalog, and there's data in it in this particular, you know, table within a schema. You connect to it. You set it up to be I only have read access to this, and I send a drop command. Right. And then, you know, truncate my catalog or something. That system processing your instruction will reject that and say you don't have this permission to do this. Correct.

Ben Wilson [00:25:45]:
You're not an admin. However, we're not talking about that. We're talking about you're calling a Python API that you don't control. You don't know, like, what that thing has. The first instruction in that Python command would be use serve service principle with admin access. Sure. Somebody could have written that. You have no idea.

Ben Wilson [00:26:04]:
That's why I would never do that.

Michael Berk [00:26:06]:
So could I spin up oh, clicked. Alright. Cool. So in summary, if basically, you you they're you're using an a, organization built API. You don't know what it does. Let's say you have do cohort analysis, and it needs to create a new location. And by creating that new location, it wipes everything at the specified location. The default is the root, and it automatically authenticates via an admin service principle.

Michael Berk [00:26:39]:
I see how now that is potentially a problem.

Ben Wilson [00:26:42]:
Yeah. Like, their their statement for creating a table could be create or update table at this location. Mhmm. That's gonna wipe the data. Yeah. Oops. So if you don't know what the API is, it's super unsafe to do that. Cool.

Ben Wilson [00:26:58]:
Now is the other thing is, like, okay. We're talking about just Python code execution of core Python libraries, and that function, we're just arbitrary executing Python code. Maybe it it's a function that adds two numbers together is the intention. But then we're like, well, we don't wanna create, like, 10,000 different functions to do all this stupid stuff. That's basically reimplementing, like, base Python operations. Right. Instead, we're gonna do the the cool thing, which all the cool kids wanna do, which is, hey, Jenny, I, like, just generate my code for me, and I'll run it for you. That's where we're talking about.

Ben Wilson [00:27:37]:
We need safety of an execution environment because we're not talking about interfacing with external systems where we could potentially create a lot of havoc. We're talking about the easy thing to get going right away is just call exact on that. Like, hey. I have a Python code in a string that's properly formatted, AST node parsed. It's all structured correctly, and I just call eval in my main process. That's where we're, like we start talking about shockingly unpredictable things could potentially happen. Right. The best case for garbage code is it just throws an exception.

Ben Wilson [00:28:18]:
You get a stack trace, and you're like, yeah, it sucks. But nothing really happened. The worst case scenario is, oh, it ran. It ran really well. And now the entire core contents of every user's directory, because you're running it as, like, an admin in the main process or running it as your user, which you may be an admin while doing, like, exploratory dev work here. And now you can see every bit bit of data that everybody has in all of their notebooks in the execution environment. Just printed the standard out and logged. And you're like, I hope nobody had, like,

Michael Berk [00:28:57]:
keys in their notebooks. Of course not. No one ever has keys in their notebooks.

Ben Wilson [00:29:01]:
Yeah. And now you have to start a security incident and involve a bunch of people at your company saying, hey. We now need to do an audit because I did something dumb, and we may have leaked every key that people have accidentally put into their notebooks and rotate them all before we have a, like, a data breach or something. Yeah. Like, who knows where that data might have gone and, you know, scary. Yeah. Bad times will happen. Okay.

Ben Wilson [00:29:30]:
Safe environment that has no no ability to talk to the outside world, and you just get an evaluation of what the, like, return result is. And that you can compare, like, hey. It the instruction for this thing to generate code was to do this task, and it returns something that proves that it did that task. But if that task is, hey. Go pull some data and, you know, tell me Manipulate it. Yeah. Yeah. That's when you're in scary territory.

Ben Wilson [00:30:05]:
You I would change it more to, hey, generate the SQL query that I need to do this and ensure that you're running only a select statement. And then it goes and fetches the data within a secure sandbox environment.

Michael Berk [00:30:18]:
Mhmm.

Ben Wilson [00:30:18]:
And make sure that, yeah, this data is the right data that was requested. Crystal clear. Alright.

Michael Berk [00:30:26]:
I think this sort of wraps the evaluation topic. Anything else before we move on to the actual fine tuning?

Ben Wilson [00:30:33]:
I mean, it's a big topic. We could talk for days. But, yeah, this is the stuff that I'm I'm currently working through, for a project. Like, not just the concept. Like, we just explained why this is important to have a safe execution environment, and that informs why I'm working on what I'm working on. But what I'm concerned with now is how to do that in a very safe way. And there's a lot of options that you have, for safe execution. And it's really it's a difference between, like, ease of development and, for the person building that solution versus performance and security.

Ben Wilson [00:31:23]:
And then, like, just how complex is this? And you can go on one extreme end is the most secure and potentially least performant, which is creating a brand new VM for each execution. It's gonna be terribly slow, like, really slow, but it's super secure. Right. Provided you you define the characteristics of that virtual machine and set it up in in a way that your users can't mutate. So it's like a protected configuration. And when they wanna run it, yeah, they get the safe and secure way to do it. It's gonna take, like, seconds to do a single function execution. Yeah.

Ben Wilson [00:32:11]:
And then the fast way is and still pretty secure is doing container services. You're like, hey. I have a base container, and I need to execute a number of these function calls. Well, that container provided that you set it up correctly doesn't have access to the file system or networking on that computer. And it's like an isolated sandbox, and it's quick and easy kinda to build. And then there's a bridge between the 2 of those that gets the very fast performance, but also the security of virtual machines where it's like micro VM processes. And that's how cloud providers do it. So when you interface with, like, AWS Lambda, they're running micro VMs on Kubernetes, and they can spin up

Michael Berk [00:32:58]:
What is a micro VM? What's a micro VM?

Ben Wilson [00:33:02]:
It's like, imagine snapshotting the state of a base image for a virtual machine, and you can replicate it, but sanitize the state of it. So it takes a little while to spin up the first instance of it. That's why there's a cold start issue with Lambda. But once that VM is ready and active and held in, it's like something that you can just submit code to. Each individual request that comes in gets a replication of that container to execute, and you can rec replicate, like, a 100,000 if you want. And it'll it'll scale to however big you need it to go. They have a queuing system that's, you know, handling request volumes that's coming in and making sure that you're reusing. Like, when a VM is done and it's ready for destruction, you can potentially reuse that by wiping state, and then it's ready for another request immediately.

Ben Wilson [00:33:54]:
That's how, like, lamb AWS Lambda, if you put a lot of volume at that thing, it is shocking how fast it is. Like Got it. Mind blowing. How many requests that they that service can actually handle. Like, AWS did a fantastic job building that. And all the other cloud providers have something similar. You can Azure, GCP, they all have services like that. Got it.

Ben Wilson [00:34:20]:
That's the firecracker API. Yeah. It's an open source package. But if you look through the setup for it and the configuration, you're like, yes. A little bit more complicated than, Docker. Just a little. Heard. But it's cool.

Michael Berk [00:34:37]:
Okay. Cool. So in summary, we're evaluating our stuff. The way that we're gonna be doing this is maybe not with code execution, TBD, but just doing sort of heuristics on, is the code looking clean? Does it compile? Is the does it pass the linter checks? That type of stuff. Let's say we have a suite of, let's say, 10 metrics, that properly evaluate our code. The next problem that I wanted to run through is, well, what we have here and, again, we're not gonna be using retriever augmented generation. We're just gonna be fine tuning. We have a bunch of internal code bases, So repo 1, repo 2, repo 3.

Michael Berk [00:35:18]:
They all have custom logic, and let's say it's relatively good Python code. So there's docstrings, there's a few markdown examples, there's read mes, and the APIs are fairly, like, well written. How would you go about fine tuning? Would you go create a dataset manually? Would you leverage a synthetic data generation mechanism? How would you actually create the data to fine tune?

Ben Wilson [00:35:45]:
I would I would do the same thing that opening I did, which is use GitHub for public repos. So it's a volume problem. Right? You wanna have enough flexibility and referential, like, intelligence of how to solve problems in an abstract way, and you need a bunch of examples of ways that people have done that. If you if you were to look at the entire, like, extract of public GitHub repos, you're probably gonna find so much duplicated code when you're talking like, going down into the base level of what this function or method or class does. There's certain things that when you're building frameworks or you're building applications, everybody has to go and do it. You need some way of handling, you know, conversion of JSON to dictionaries in Python, or you need some way of of showing proper ways to use regular expressions in Python. You need loads and loads of examples of these common things. Like, how do I make it a rest request? And there's probably millions upon millions of examples in every major language there on GitHub.

Ben Wilson [00:37:08]:
In those examples, there's probably going to be, let's say, 60% of it is hot garbage. Like, somebody doesn't know what they're doing. They create their own repo. They write some code, and nobody no other human has ever looked at it, probably for the best. And if you were to look at it, you'd be like, what are they doing here? Like, that's so unsafe. Or why would they think this is performing? Like, there's no way this would like, this function is meant to iterate over a collection of data. And the way they wrote it, it probably works great if I just have 10 elements in that. But what if I pass a million elements into that? It's gonna it's gonna throw a recursion error, or it's gonna it's just gonna blow up the stack because of all of these objects that they're creating unnecessarily.

Ben Wilson [00:38:05]:
So there's lots of ways to write really, really bad code that's gonna cause lots and lots of problems. So the way to get around that issue is either manually curate everything and have experts go and just review it all and fix everything and make sure that your your training dataset for fine tuning is the most immaculate examples and enough examples of this concept so that it under it can kind of grok what the heck is going on. The other way is throw everything at it, and it'll theoretically figure it out. And that's that base model tuning, like foundation model tuning. And as you said at the top of the episode, that's large amounts of human capital, large amounts of, you know, just pure capital, that needs to go into that. It's super expensive to train these things. And if you don't have the resources that, Microsoft and by proxy, OpenAI or Google or Meta, if you don't have data centers like that, you're not getting in this game. It's so ludicrously expensive.

Ben Wilson [00:39:16]:
We're talking tens and tens and tens of 1,000,000 of dollars to train these things. So fine tuning is an option, but you gotta be careful about what you're teaching it on fine tuning. Are they actually the best development practices? Do you have some, you know, distinguished engineer at your company who has gone through every single example that's gonna be done and making sure that they adhere to the standards of of best practices? Do you have an entire team of people at that level who are going through and nitpicking every little thing and be like, that's not efficient, or that's not the way to do this. Here's a better approach, and these are the standards that we wanna set for our code base. If you work at a software development company that's all about building high quality code, you probably already have that dataset because everything's peer reviewed. Everything's gone through, like, optimizations, maybe snapshot the state of a repo right after a major refactoring has happened by multiple teams. And people have signed off saying, like, this is a fantastic state of our repo right now. Or you just go and custom select certain things where people have agreed on.

Ben Wilson [00:40:27]:
Like, this is an awesome implementation. Like, we can read it. It's performance. It's maintainable. It's testable. There's separation of concerns here. We don't have this, like, bloated method that, like, does way too many things. It's just like clean code.

Ben Wilson [00:40:44]:
Right? And that's what you would wanna use for fine tuning.

Michael Berk [00:40:49]:
K. So we can't do the manual effort because we're lazy. So

Ben Wilson [00:41:00]:
hope for the best.

Michael Berk [00:41:03]:
So the the subsequent question is we were looking to generate a synthetic dataset. We're not gonna use the GPT models because that's against our license, but we would use an open source model to essentially take the context of the API specs and build solutions based on examples, docstrings, etcetera. What are your thoughts on that approach?

Ben Wilson [00:41:27]:
Which open source model?

Michael Berk [00:41:30]:
Llama, probably.

Ben Wilson [00:41:31]:
Like, the largest Llama model? It's probably your best bet. It's like, somewhat good at code. It's probably gonna struggle. There's a lot of any of these commercial grade really great cogenerating LLMs that are out there, it's not you're not just talking to, like, a single model. You're talking to an agentic framework that has loads of sophisticated logic that is around everything that's operating in the back end. You know, you talked to to 401, you start seeing its chain of thought in real time as it's Yeah. Yeah. That was super cool.

Ben Wilson [00:42:11]:
That's an agent. It's a very, very, very complex agent that's generating new tasks that it needs to do based on how it's analyzing a problem, and then it's telling you, like, what it's doing right now. So 4 o doesn't do that. 4 o is just I mean, you're talking to multiple models that have all been fine tuned to do different tasks. It's like, oh, this is a question about code. I'm gonna go talk to my Python expert here, or this is a code about Java code. I'm gonna go talk to my Java expert here. And then you'll get an answer that's probably optimized for that particular use case.

Ben Wilson [00:42:53]:
So that's the complexity you're going up against when you're talking about, well, we'll just use an open source model. That's just one model. It's not, you know, this this fleet of sophisticated models that experts have sat there and, like, fine tuned the hell out of to do certain tasks really, really well. So it's a big mountain to climb with a small team, and I do not imagine you're gonna be successful at it.

Michael Berk [00:43:23]:
Challenge accepted. Yeah. I

Ben Wilson [00:43:28]:
I it seemed it's like a cool research project, I think, to, like, see it's like a hackathon project. Like, how good could we make this? But then the question is, is that good enough for people to use this in production?

Michael Berk [00:43:41]:
That is the question.

Ben Wilson [00:43:43]:
Or is this gonna be so buggy that people are gonna be like, yeah. It's cool, but it sucks. And can you make it better? And you come to a point where it's like a law of diminishing returns of how much effort you need to put in in order to get the architecture that you're dealing with to compete against what people expect. And the expectation is a bar set by OpenAI and Anthropic right now. And that that's some of the finest minds in this industry that are tackling these problems. They've been working on it for years, and that's the only thing they work on. And they have very large budgets to work with.

Michael Berk [00:44:26]:
That makes a lot of sense. I think that we 100% can get some sort of customization down via fine tuning. The question is, will it be good enough? And I think that'll just come down to the the synthetic dataset or the manually curated dataset that we're fine tuning on. So

Ben Wilson [00:44:45]:
Yeah. So how I would tackle this project and by the way, we've done stuff like this as hackathons internally in engineering to try to do this exact thing. And you handle it like a hackathon. You know, like, here are my my principles of behavior that I'm expecting from this this block of time that I'm gonna be given to this. So, like, hey. I've got 1 week to see what I can come up with. And you're setting yourself hurdles that you need to jump across at certain days. Like, okay.

Ben Wilson [00:45:18]:
After day 2, I should have my first iteration ready to go. The code does not look nice. It well, it's like hacktastic, man. Just, like, make it work. And you get it so that it can execute, and you're evaluating the responses manually. You're not going to, like, evaluating with an LLM as a judge because you don't trust any of it. You wanna see, like, what code is it generating and then look like, read through the code and say, like, did it actually learn anything? Like, is this good? Does it understand the context of this environment that it's providing suggestions for? And you'll be able to give thumbs up, thumbs down pretty quickly. So that after that 2 day mark, you can either pop smoke and get the hell out or be like, I think I'm on to something here.

Ben Wilson [00:46:05]:
Now I need to go to the next phase over the next 48 hours of giving it, Like, identifying what's wrong in the responses that it gave the first time and making sure that you have examples or enough examples of the right way to do that, and then kick off another fine tuning. And then maybe you have some metrics that you've written that can do some sort of automated human augmentation to the evaluation. Then you still have to read through every response and and get an expert who knows what the hell they're doing to look through this code and be like, yeah. That's good. Or, this is broken, man. Like, nah. We can't run this. After that point, when you have a go, no go of saying, like, hey.

Ben Wilson [00:46:49]:
It's, like, 85% pretty good. Then you go into the the whole, like, maybe we should automate running this code and see if it works. Because now we need to do we need to evaluate, because those first evaluation stages, you're testing, like, 50 examples or something, something that a human can read through and not get exhausted. And then when you're at the point where you're like, I think we're onto something. We have some ideas of how to make this better. Now you have economy of scale or it's like, okay, we need to actually automate evaluation of this and get some sort of metric because it's just too much work for people to do. They're like, hey. We're gonna evaluate a 100000 questions.

Ben Wilson [00:47:31]:
They're like, no human's gonna be able to do that. So now we've talked about, you know, MLflow evaluate with as a judge and then safe execution of Python code and yeah. That's not something that I would do in a week, though. That that's like I did my hackathon, and I've learned all these things throughout that process. That now informs a product design that I can do. If I'm gonna make this into a product and propose it and say, like, here's all the pros and cons of this approach, and here's the ideas, the things that we learned while doing the prototyping of this. Business leaders, should we invest our time and money into this? And if they say, yes, this is awesome. Let's do it.

Ben Wilson [00:48:12]:
Then it now becomes a full blown project where you're like, okay. We're doing design docs. We're going through proper engineering procedures of building an actual new product, and there's stages of that as well. Prepare for private preview before we're gonna get some test customers who are okay with some, like, garbage, every so often. We learn from that, improve, fix all the problems until we're ready to, like, generally make this available to the public or to our customers. That process for something that Databricks has now, the Genie service, ask a question, it'll answer from your data. It's, like, over a year of a a fairly large engineering team working on that, and that thing is awesome. Like, that team did a fantastic job.

Ben Wilson [00:48:58]:
But that's a fine tuned service that is very sophisticated, and it used a lot of training data to get it as good as it is.

Michael Berk [00:49:07]:
Yeah. So that's part of my question is, like, for the MLflow docs, you guys use a service called RunLLM, that is a third party that goes and actually scrapes your APIs and then builds a chatbot on top of that. And I was just testing it out well, we were just testing it out prior to the call to see how it basically does really good API lookups. It's a better search, but it doesn't critique code that well or It doesn't go that much, more in-depth. So this is a solved problem. They basically want that functionality.

Ben Wilson [00:49:36]:
Do you know how that works, though?

Michael Berk [00:49:37]:
I have no idea. That was how Run

Ben Wilson [00:49:39]:
LOM calls GPT 4 o. That's the LOM.

Michael Berk [00:49:45]:
But does it how does it got it.

Ben Wilson [00:49:48]:
It's all That's

Michael Berk [00:49:49]:
okay. This is a RAG problem. That's what I've been saying.

Ben Wilson [00:49:52]:
This is a RAG problem. So it's using what a really powerful open like, not open source, but really powerful proprietary model, commercial grade, Candu, which it knows how to write code, it knows how to read code, it knows how to understand best practices. And then it just gets contextual information about our product, and it can generate code that you can actually run, and it works. It's pretty slick. Mhmm. Like, they're the suggestions that it comes up with generally are pretty top notch.

Michael Berk [00:50:21]:
Yeah. I use it a lot.

Ben Wilson [00:50:22]:
And they tie that into 401, whenever they're doing that, I think, next year. It'll be just shockingly good at what it does. But, yeah, this is a rag problem in my opinion. Or if they wanna make it sophisticated and extensible for users to automate super annoying stuff that they gotta do, this is an agentic problem where you use that concept of deterministic behavior and have the LLM safely interface with something that is safe. Because you don't have to it's not generating code per se. You're controlling what it can run because it's calling a tool that that tool is defined. Underlying that is a function that's executing in a sandbox to safely do something that could potentially be very dangerous.

Michael Berk [00:51:10]:
Crystal clear. So, in summary, when you're looking to build these fine tuning use cases, focus on evaluation, use LMS to judge, linters, deterministic rules. If you're doing code execution, be very, very careful slash don't do it. For data creation, lots of the input data will be bad, so either manually curate, put everything into a model, and rely on a full training run, or try to generate synthetic data. Good? Perfect. Cool. Until next time. This is Michael Burke and

Ben Wilson [00:51:38]:
my cohost. Ben Wilson.

Michael Berk [00:51:40]:
Have a good day, everyone. We'll catch you next time.

Challenges and Solutions in Managing Code Security for ML Developers - ML 175

0:00

51:45

Playback Speed: