Michael (00:01.08)
Welcome back to another episode of Adventures in Machine Learning. I'm one of your hosts, Michael Burke, and I do machine learning and data engineering at Databricks, and I'm joined by my cohost.
Ben (00:10.818)
Ben Wilson, I do all sorts of code related stuff at Databricks on the ML side.
Michael (00:20.896)
Cool. Well, today we're going to be walking through a case study, something that I'm currently working on and Ben and I are going to workshop this. And hopefully you'll have some tangible sort of takeaways of how you can apply lessons that I'm currently learning to your own projects, but also it's a project that you might be interested in building yourself or even using some of this code, it'll be open source. So the problem statement, we are looking to standardize documentation. No one likes docstrings.
No one even wants to write examples, blog posts. The it's just the worst thing in the world. I think we can all agree. And one thing that currently the MLflow repository is facing is there are several different docstring styles. And essentially we're looking to standardize functions, methods, classes, modules, all of those docstrings so that they're in the same format. So while back we were doing a standup and basically discussing this issue.
Ben, do you mind walking us through some of the solutions that we were discussing?
Ben (01:25.286)
Yeah, I mean, we didn't go through a formal design process because I guess we were just riffing during standup or towards the end of standup and saying like, you know, it'd be great if we had standard docstrings. Because any project of the magnitude and scope of MLflow, I mean, it's been worked on now going on five years, I think. It's almost five years. And
There's been a lot of people that have touched it. I think we have 500 contributors now or over 500 contributors. So over that long period of time with that number of people that have been implementing and modifying and adding to, uh, particularly the public APIs. So anything that is in Python's parlance, uh,
a function or a method that actually doesn't start with an underscore. So it's something that a user is intended to interact with and it's not within like an internal, internal module.
Things have not been standardized. Uh, to everyone's best intention, they were, uh, and they have been after a certain point, but in the early days of building something like that, you're trying to get it out there pretty quickly. And standards aren't usually something that most people think about at the start of a project. You know, sometimes the first committer is going to have standards of like, this is how I want this to look. And I, I'm going to have that. But.
When you start getting 10th, 11th, 12th person coming in and applying patches and fixes or implementing free features, if, if the whole team isn't thinking about we're setting this standard for how this is going to look, we need to have this explanation style. We need to define the arguments in this way, whether we add typing or not. There's lots of things that go into that. And over time, if you don't set those standards, you're going to have a mixed bag of all sorts of.
Ben (03:30.226)
stuff. Some people have a personal preference to be very verbose in, color me guilty on that one. I tend to write too much in the docstring. Then other people are very succinct and very brief. We were saying, what could we do here? Everybody had a couple of different ideas. The first one being the worst, which is let's ask
open source contributors to rewrite our docstrings for us and use it as good first issues. And yeah, some people will pick that up and want to do that so that they can, you know, see that they have committed code to a very popular repository that has, you know, millions of downloads a month. But most people are not going to pick up on stuff like that and
Ben (04:27.706)
on the part of maintainers to go through and, and review all of those. If you're just saying, okay, I'm rewriting all the doc strings in a single module. That could be a PR that has 1500 or 2,500 lines of code changed. Uh, it's docs, but you know, it's still important that it's formatted correctly. No grammatical errors, punctuation. And then also thinking about those doc strings are actually our API docs on the website. So.
they need to render correctly. So the process of reviewing that is not just reading through text in a PR on GitHub. It's, I have to pull this person's branch locally, like fetch it, and then I have to build the docs locally. And then I have to render the website locally. And I have to go through and check formatting on all of these things to make sure that the build process for the static HTML files is correct. So,
Michael (05:26.461)
right.
Ben (05:27.402)
That was kind of out of the window. If you remember, we were like, so what did we come up with next?
Michael (05:30.116)
I do.
Uh, we were looking for sort of programmatic solutions to automate this process. Reviewing is still going to be kind of a pain. And we have some ideas about how to make that a bit more concise and simple. Um, but we were talking about this magical new creation called an LLM that could theoretically do conversion. And, uh, so I was tasked with sort of building out a prototype and. Essentially seeing if this could work. And so the first thing that I did is I actually just went into the.
GPT-4 UI, sort of the chat module. And I tried to copy and paste some of the PMD Arima docs from, from ML flow. PMD Arima is the classic autoregressive time series model. Trying to convert some of those doc strings from the existing format into Google style or NumPy style. And it's just seeing what happened. Would it completely rewrite stuff? Would it hallucinate? Would it do a good job? And.
After a bit of context engineering and prompt engineering, I got a solution that works pretty well. Um, and so the next step essentially was to try to put this into code. Ben, if you were going about sort of seeing if this was feasible, is that the first step that you would do or would you do something else?
Ben (06:44.986)
Oh, for sure. So, I mean, you went off and did what any of us on the team would do, which is, what is the simplest thing that I can do to determine if this is feasible or not? So that, you know, we call it a spike, right? To build a rough prototype in whatever means possible to determine, is this possible? Some things that you're doing a spike on in the ML world.
are slightly different than what you would do a spike on in the engineering world. So if you're building, you know, framework code, we would usually classify that as something you would work on for a hackathon. And that means get dirty with your code, uh, for your first iteration. And don't worry about putting tri accept blocks in. Don't worry about exception handling. Don't worry about writing tests in that first proof of concept.
Sometimes you're just writing a script and it's like a dot PI script or, you know, you're opening an interactive environment with a REPL that you can just issue commands to, or you're writing a bash script, something that's just really quick and dirty that doesn't have to be production ready at all. It's just, can I get this to physically work? And sometimes it's a big no. And that's why you time box it.
You say like, Hey, I'm only going to spend two hours on this or four hours, larger scale things that you're trying to do for like hackathons. That's why hackathons are time boxed. You say like, Hey, we have 48 hours. What can we do in 48 hours? Can we get something that actually works? Sometimes that first prototype you do in the first two hours of the hackathon, or if you have a team of four or five people, all of those people are trying their own.
attempts at getting this thing working in different ways. So you can fail fast, fail early, and then pivot to something else. And that's really the key with that, like quick and dirty, get something out there. And you did exactly what any of us on the team would have done. Like use the tool that's most available, see what it does.
Michael (08:48.126)
Right.
Michael (08:55.793)
Nice.
Michael (09:00.144)
Yeah. So the way that I typically try to approach these problems, if I need to do it in a systematic fashion, and it's a bunch of new things is I sort of create a list of blockers that I need to test. So essentially components that I need to prototype and validate that this will not prevent the solution from working. And you can tackle it in a variety of orders. One way is you sort of sort by the difficulty of testing this hypothesis. So if it takes you 10 minutes to say.
All right, this won't be an issue or this will be an issue. You should probably start with those really fast and easy things to test out. Um, but oftentimes with larger projects, they're really big things that are just time consuming to test that could potentially block a project. So how you tackle that list of, of legitimate blockers is up to you. Um, but in this project, I identified two things that I was not sure if they would work.
And then beyond that, I was quite comfortable that I could write any code around this that would make the project successful. So the first was would the LLM support doc string conversion? And the way that I tested that is via GPT-4's UI, playing around some prompt engineering. And then I actually went in after that was successful and signed up for OpenAI, got an API token and started doing some context engineering and basically tried to figure out what is the best prompt that I can pass it up to a certain
So I, after some iteration on that, I sort of, I, it wasn't perfect at all. The formatting is still a bit off and I'm going to add a second implementation, which we're going to talk about. Um, but that was sort of step one component one, it'll work. Generally the user might have to do some manual review or manual revisions, but we're looking good. The second component was actually parsing the doc strings from the module itself. I didn't know if that was possible. And it turns out it's super well supported via the AST module.
Ben (10:51.8)
Mm-hmm.
Michael (10:53.288)
in Python, but I had never done that before. So essentially what I did is I Google the Ask Chat GPT and found that this module actually in about 10 or 15 lines will provide a string that contains every single doc string, whether it be module, class or function. And so with those two components, I knew that I could create a mapping of that doc string. It can be in dictionary format, how I insert it into the file. That's sort of implementation detail.
But with those two components, I was confident, if not certain, that this, the solution would work. So when you guys are doing this, that's a really effective way to essentially fail fast, because if I built out an entire structure of, let's say mapping docstrings to the LLM conversion, but I couldn't extract docstrings from module, it would be pointless.
Ben (11:42.37)
So I have a question for you. What if you went through that second phase of trying to extract docstrings and you couldn't figure it out and you spent like six hours working on it. Would you mark that as failed or what would you do?
Michael (11:44.382)
this.
Michael (12:01.728)
Good question. Um, so I, it's very dependent on the environment that you, at least from my experience, the environment that you're working in, uh, I'm sort of doing some volunteer MLflow work to learn, and, uh, I probably would have just asked someone. That might be okay. It might not be okay, depending upon your environment. Um, but if you're sort of the most senior person in your, in your, uh, sort of vertical or industry and you can't figure out a solution,
Maybe that is a definite it's not possible, or at least it's not possible with my current knowledge. But typically asking around is the highest ROI solution. But I feel like you were getting at something with that question.
Ben (12:43.914)
Yeah, totally. Ask other people that you're working with. It's a huge difference that I've seen in the different roles that I've been in throughout my career.
Ben (13:05.174)
Before I got into data science work, when I was doing traditional physical engineering work, everything was decided as a group. You present your data, your hypothesis, your conjectures of whatever you think is going on. And then everybody on a team that's working on a particular project or technology, everybody has their voice and they can comment and say, actually, I think that's this.
Ben (13:35.502)
massive level of experience all working together. Then moving into early days for me of data science work, it was surreal. Everybody was working on their own. One person would work on a project in isolation. You could maybe ask somebody who's competent in this area or ask somebody's opinion, but it was never particularly collaborative.
at least wasn't for me in the places that I worked. So it's a lot of figure it out at all costs type thing and make it work. Sometimes that produces some really bad and unmaintainable code or solutions that are just way over engineered or way under engineered. I've been on both sides of that. But then moving into pure software engineering now, it is working on your own.
to solve a problem. But if you hit a blocker, it's one of the reasons why we have standups every day is if you're blocked on something and you've been spending too much time trying to figure something out, you just ask. Ask the team or if you're uncomfortable doing that in a public setting, which if you work in a team where you feel uncomfortable, find another job. Your work environment sucks. You should feel free to voice up.
Michael (15:00.177)
Yeah.
Ben (15:03.702)
without judgment or without feeling ashamed that you got stuck on something and your team should have your back. But when you voice that, you would get that answer within seconds for many of us that have used AST before or had to do raw Python code parsing. It's what that module is effectively for, AST module. And you know.
usually use it for stuff like literal eval and you're converting string wrapped, you know, code or string wrapped dictionary into a dictionary or collection back into a collection. Or if you're a hacker, you're using that for execution of remote, you know, nefarious code. So be careful with ASD, literally eval.
Ben (15:52.598)
But getting that answer really quickly and unblocking yourself is super valuable because you just saved yourself potentially hours of wasting time and not, maybe not finding the correct answer online or in a book or something.
Michael (16:04.undefined)
Exactly.
Michael (16:10.404)
Right. So yeah, so I had prototypes built out for the two components that I was sort of cloudy on or uncertain on. And then the next thing that I did is I actually built a framework or a sort of a structure that would let me modularize a lot of this code and do some of the things that I was anticipating I would want to do. So one of those things is I needed a place to store context. As we all know, LLMs take in some sort of context in a string and then they return whatever that context and that.
prompt, it basically tells the LLM to do. And so I knew that context engineering would be a big part of this project because maybe we want to convert args and returns to Google style, but maybe we want to actually have the LLM write the entire doc string if the doc string doesn't exist. Maybe you want to do different, there was just a lot of modularity, a lot of different use cases for this. So I wanted to keep that as open as possible. So that was one area is basically a bunch of files.
pass in a YAML file to my code and that will be parsed and then put in as a context. And that would allow versioning and a bunch of other things. Another is the classic main versus utils folder. I have just a main that is very simple. It's a click CLI and it calls run. Run then will basically handle all of the logic behind the scenes. So the CLI file is very simple. The run has all of the run branching logic for kicking off the different types of runs I'm interested in.
And then the utils will actually be the things that are actually running all of the code. And one other point is the utils. I also wanted the LLM to be very, basically have the LLM component be very ag model agnostic. So right now I'm just using open AI, but they have very generic functions. It's basically build context and predict, and you can sub out any model relatively easily. So that was my initial structure. No tests, no anything like, like that.
Ben, what are your thoughts so far?
Ben (18:12.994)
So you talked a lot about your code structure and code architecture, about a very Python-focused directory structure of the actual repository. And my question back to you, how do you think I would have started with that?
Ben (18:40.97)
see how well you know from seeing some of the stuff that I built.
Michael (18:46.44)
Um, I think you would still be in the get shit done scripting phase.
Ben (18:52.918)
Correct.
Michael (18:54.782)
Okay.
Ben (18:55.774)
Maybe not pure scripting, but I would have my prototype open on one side, which would be a script, to just get one prompt working, like end to end. I don't care how good the results are from GPT-4. It's more like I need to connect to, I would probably use Moflow AI Gateway, by the way, to do that request so that I don't have to worry about, you know.
that key going anywhere that's safely stored on my system. And I would just have request module connecting to that. I build that code. I'm like, Hey, I'm, I'm planning on response code 200 for all of the, everything that I'm doing. So I'm not writing error handling a retry logic or anything like that. I'm just saying request.post. Here's my payload. Here's the URL that I'm going to be hitting. And.
get the response, parse the response so that I'm just getting the content that I want. I don't want the rest of the metadata. I don't care about that right now. I might later on, but don't care at this moment. And that would be that script. So like end to end, can I get this so that I have the return value that I can put somewhere for eval? The first pass that I would do is probably just overwrite the file that I'm trying to do on a branch. Make sure that I'm not on master branch.
When I run this, replace in place, and then I'd probably write a very simple, uh, command line instruction through sub process that actually files a PR on my behalf so that I can use GitHub's diff functionality to just visualize the differences. And maybe I hard code in there to say, you know, open bracket, do not merge or open bracket whip or something so that other maintainers would know like, Hey,
Ben's testing something out right now. Um, and I wouldn't file that against like the MLflow repo. I'd file it against my own fork just so that I can kind of look at stuff and probably mess that up the first time and delete that branch, uh, close that branch and iterate three or four times until I had something that was working and at that point that's demo time to show everybody else on the team. Like, Hey, this is the state of it right now.
Michael (21:00.809)
Right.
Ben (21:22.678)
What do you love? What do you hate? What do we think about? I'd probably file like 10 PRs as demonstration of different modules that would have different contents of doc strings. For anybody familiar with MLflow, we'd probably do something like PMD-ARIMA flavor or some other flavor that has arguably two verbose docs in it. And then I'd find another example where the docs are really lean.
And there's not a lot of content in it. And it's probably missing some args definitions in there. See what it does with that. And then try out something that is sort of a, a framework API, like see what it does with the client API or fluent API. Those are massive blocks of text that have examples embedded in them. And there's a lot of stuff in those. So I would get a good sampling of all the different doc strings that
this thing would potentially be touching. And then I probably also try a little bonus round where I, you know, fire up a, a very simple attempt, you know, V zero of, uh, of a context prompt that would say, read this code from this, this function or this method and generate a dock string, and I want to see how well does it do that.
Like, does it get enough context of the code? And is it even remotely correct style-wise and content-wise? And then I'd have all of those as examples. Show the demo. And if everybody's like, yeah, this is cool. Let's do this. That's when I go back and start thinking about, OK, what do I want the code to actually look like? And if I'm doing just a basic CLI, it would be very similar to what you did.
Michael (22:55.078)
Yeah.
Ben (23:19.274)
I would define the entry point in an init, and that's the entry point to this application. So I could run it as a Python executable or also create the click interface where I have a CLI definition that says, okay, here's my options. Here's the file path that I want to use and the name of the method. And then start building all of that out. And those would be two separate, you know, modules. One's the init and the other one's like the main process or CLI.
And then from there, when I'm looking at the script and what it, what exists, I look at functionality and I'd say, okay, we have this thing that's calling out to the service for REST API. That's a module. So it's going to be like, you know, request module for you. Not necessarily like a utils thing, but it's just the thing that calls out. So that's going to go into its own file. And then we have something that is dealing with parsing. Yeah.
Michael (24:04.541)
Right.
Ben (24:20.098)
going through a Python text file and extracting the actual doc string itself, that would be in the parsing or the parser directory and there would be a module in there that I would always just start with a single module. I'm not the biggest fan of writing a crap load of code in an init. I know that's up to other people's interpretation. Some people love doing that. Some people think that's clean Python. Don't debate me.
I don't care. I just have my own preference of I like when I see an init that just basically has includes in it. So it's referencing what the public APIs are going to be that are going to be exposed when you import that namespace and no other real code in there. But it's totally valid to put classes and functions and tons of stuff in it. Nothing wrong with it. But I would create things that I...
I can just read when I'm going to the directory structure and be like, Oh, that's the request thing or that's the parser. So I'm going to just open that up and I start breaking that out into common utils if I have like functions and I usually try to stick with pure functions in utils. So that thing stands on its own. It has no other references to anything else that's in that module. It's just a function. Um,
Michael (25:42.588)
Right. Same.
Ben (25:45.622)
You never want side effects in utils. That is a recipe for disaster. Uh, so, so if I'm using that util from multiple different other submodules, it goes into a yields utils module or a series of utils modules. I mean, look at MLflow, how many utils modules? We have one for each component of, you know, the code base. Uh, sometimes multiple ones for different components.
Michael (26:11.113)
Yeah.
Ben (26:15.618)
But I don't start there because I don't know if I need to have a generic utils directory, but I'll, you can always go back and move, right? If you're in an IDE, it's pretty simple. Like, you know, select the method, right? Click refactor move. It'll find all your dependencies. And then hopefully you go and clean up anything that it missed. Um, that's kind of the process is going through and.
and doing that rough blocking of where I want things to live. And it's not for efficiency of execution. The computer does not care where the files are. That's purely so that when I know I need to add something, I don't have to spend 30 seconds or I don't have to use search through a code base to find where I need to be going, I can intuitively look at the directory structure and know
Oh, this is where this like anything related to this code is going to live here. It's just, it's purely for humans to make it clear. And then once all of that is roughly blocked out, that's when I start writing tests. Before I start fleshing out like, Hey, what are all the functionalities that I need to have for these requirements? It's more that happy path that I did for the prototype, write a test that recreates that and that would that.
Michael (27:19.336)
Yeah.
Ben (27:43.838)
like end to end would be an integration test. And sometimes I do start with that, like, hey, what are all the things that this thing needs to do? And then as I'm blocking that code out into different functions or into classes with methods, each time that I write one of those, I write it a series or either a single or a series of unit tests that are validating the behavior that I'm doing. Particularly if I have error handling within the function. I've...
I don't know. I just have a habit now of if I'm eating an exception and handling it in some way, uh, I write two tests. One is, does this work the way I want it to validate that and validate it with a bunch of data, like all the different possible things that you could possibly try to do that you can think of that could potentially that you would want to function correctly. And then another test would be parameterized with
a whole bunch of conditions that you're expecting it to fail and your exception handling to handle properly. Now if you approach testing functions in that way with like test that works and then test that not works, that way you're not ever surprised when you have to modify that, that hang on a second. I was expecting these 10 things to fail before. I just made this code change. Now two of them are passing.
That's not what I like. So insidious bugs that are really hard to think through, but having that set up. On PR number one for a new feature that makes it just covers you from unexpected things happening later on. And it, it's not just you. It's also whoever's, you know, the person who would be modifying something that they didn't originally build. You want to give them a nice.
Michael (29:14.576)
How bad?
Michael (29:25.929)
That's smart.
Ben (29:43.166)
safety blanket underneath.
Michael (29:45.704)
Yeah. So a lot of stuff just discussed. Let me explain how I actually went about writing this code because what Ben just gave you is the right way and I did not do that. So, um, steps sort of zero and one, we're like, define the problem and fail fast. I'm, I feel like I'm pretty good at that. And from an applied data science background, it's really important to define the problem well, and then sort of go through and prototype your components and know that if it's going to fail, know that as early as possible. Um,
The next thing that I did is I over-engineered the living crap out of the solution. So Ben said, build sort of an MVP, make sure it works end to end. And I immediately jumped to building an entire directory structure, thought about modularization, thought about the use cases, thought about how, if we open source it, how other people could use this, contribute their own context. And, uh, I think the reason that I did that and Ben, if you have psychological insights, hit me up. But, um, I think I.
was just feeling like I didn't quite know what the end product would look like. And so I just was taking a stab in the dark basically, and building something that I was like, yeah, let's see if this works because I didn't have a clear picture of, I frankly, I don't know about the difference between in it, uh, entry points and clicking entry points. Um, so if I were to go build those different entry points, I wouldn't know what they look like without actually going and prototyping.
So because I don't have this sort of world of software knowledge around me, I just built something. Was it the cleanest and simplest thing? Clearly not. And I think simplicity of prototypes actually shows an extremely high level of understanding of the problem because you can pinpoint exactly what you're uncertain of, for example, with the dictionary substitution design of getting the doc strings and seeing if the LLM would convert, I have a very clear understanding of how everything around that.
But context handling, not so much tests, not so much entry points for users, not so much. So if you see yourself over-engineering, it might not be a bad thing. I mean, you shouldn't be doing it, but it's a learning experience. And I now am crystal clear on that design. I know when I will use it, I know when I won't. So Ben, in your career, as you were learning things, did you do something similar?
Ben (32:09.058)
of dozens of times, maybe hundreds. Anything that I ever built as a data scientist early in my time was always that. It was a product of the time at that period where companies were hiring people in this role and just saying, we have this product problem that we need solved. Can you do something to solve this?
by predicting this thing. With that product guidance and then no experienced software engineers around who are certainly not in the team and they're usually very, very busy at a company.
You don't always have somebody to say, Hey, am I going about this process correctly? And even for teams that are full of data scientists, the process of work that you do is it's very much like waterfall development where your output is either good for the business or it's not good for the business or it's somewhere in between and it needs some tweaking and fixing, but in general, you're not
shipping the bare minimum for your end users to use. Cause the bare minimum would be does the code execute? Can I build a model and have it served? Sure. That's all the engineering pipe piping that you have to do for an ML project to get it to actually do something. But when you hit that endpoint for getting inference on something, the, if it's a
binary classifier and it's just returning ones regardless of the data that you're, you're passing it. Not a very good solution. The model sucks. So the process of making the model actually work well lends itself to the waterfall approach, which is build everything that I need. Even if you're doing it in what you're in your head is like an agile approach of like, okay, I'm going to start simple and I'm going to iterate and I'm going to get it so that it's to a point where it works pretty well.
Ben (34:23.234)
That's still weeks of work usually, of iterating and improving. And you could be using, you know, agile software development where you have a product design document and you're going through, you know, sprint planning and you're chunking up the work over story points over a series of sprints. By the time anybody sees it, it's not a prototype. It's a product that you've built.
So in the process of that, you're thinking about all the things that aren't on that requirements doc that the business gave you. You're thinking of, well, we have these other datasets and we can use these. In software engineering, you would never do something like that because you don't know if that's needed yet. What you do know is here's the minimum product that I need to build because here's the list of requirements that I went through. It needs to do A, B, C, D, E.
it would be cool if it did F, G, H, and I, and it really shouldn't do JK all the way to Z. So that's why we go through that process to know what are we building? Why are we building it? What aren't we building? What would be cool to build because people asked us for it if we have time. And that's that Moscow method that's used on all design docs, at least internally at Databricks. And
We build that minimum product and then we release it. And then we ask people, hey, does this work for you? Do we always get the answer of like, yes, amazing? No, a lot of times that's why we do private previews. We get people's feedback and sometimes people are like, this sucks, man. Like this does not solve our problem at all. In fact, this is just gonna create a problem for us. Okay, we now have all of the data that we needed.
Michael (36:00.256)
Hehehe
Michael (36:06.292)
Yeah.
Ben (36:15.778)
that we couldn't get begin with because even if you ask somebody, they're just guessing. They don't know until they see it and start using it. And then you go and fix that. And that's one of the reasons why code design is very simple. It is so important, code design and project design, because if you're not doing MVP, if you're not building the minimum thing that you need, when you get that feedback, the more extra crap you built, that's just.
Michael (36:43.296)
Thanks.
Ben (36:44.414)
so much stuff that you have to change now. Now, if you have that initial prototype that has 10 modules and 6,500 lines of code in it, and that's the MVP, that's just the bare minimum and you get feedback that says, actually the APIs need to change or, hey, this actually doesn't work at scale. We need to implement.
Michael (36:47.8)
Yes.
Ben (37:11.206)
asynchronous processing or something, or, Hey, this, the service that we're using just doesn't work. Changing those 6,500 lines of code is probably a week worth of work. And then writing like rewriting tests and getting everything to work. But if we built all of the, what we thought people would want, all of the could be cool if, and all of the, we will not touch because that's out of scope, all of the things that are listed on that Moscow method are.
are things that were good ideas, but we've explicitly said we're not touching these things because of the risk associated with it. Because that now becomes 50,000 lines of code that takes an additional six weeks to implement. You ship it out there and you now find that, oh no, I've got 66,000 lines of code that I need to update because the core principle of this needed to change.
That's why that's so dangerous.
Michael (38:13.608)
Yeah, I've built one too many products there like that. Um, but I want to bring us, so we talked a lot about sort of the past and the best practice, but I want to bring us into the present of this project. Currently the project does doc substitution essentially perfectly. It'll take a docstring. It'll feed that to an LLM with whatever context you pass and it'll handle relatively complex context. Um, and then it will return that.
as a docstring substituted in, and you can either overwrite the file, write it to a new file, or run it five times to compare with prior runs to see if the LLM is producing consistent results. Now, with all that, I'm still seeing some sort of stochastic nature in the output of the LLM. So sometimes it will perfectly convert arguments and returns, which is the only prompt we're giving it. To convert arguments and returns, it'll do that perfectly sometimes.
Other times it'll add a method description. Other times it will add a little code example. Other times it will produce just random crap. So Ben, what are, especially for LLMs and everybody's using them these days. In your experience, how would you go about sort of reining in this stochastic nature so you have a more reliable result so that humans needs to do less review at the end of all this?
Ben (39:36.13)
So one of the things that you see when you're interacting with chat GPT, like on the UI and other bots of that nature is something happening behind the scenes that users aren't really aware of, which is that continuation where you submit an instruction set to it. Like, Hey, I want you to do this thing. And you've seen some of the results of my, uh,
playing around while testing of just asking it to do crazy things. And your next response to it when you're correcting it and you're saying, Hey, that's not really what I was looking for. I think as a user, you assume that it's storing memory somewhere. Like it's like the model itself understands who you are and something about what it was just asked in the past. And that is not how it works.
Michael (40:08.011)
F
Ben (40:35.642)
It's the actual website that you're interacting with is storing context. And you can see it on the left-hand side of the page, the chat history for each of those. So it's pulling from your browser cache. What was, what you asked, what it generated and then what the next response was. And it'll go back so many iterations of that to maintain a session, but when it submits the request to the actual model, it's giving those system
and assistant and then user and then assistant and then user. So the back and forth, it's got that chat history with each request that it's processing. So it knows like, Oh, I, you know, the, this human asked me to do this thing and I generated X and then the human said that X sucks. So I generated X sub one. And then the human said, well, yeah, that's better. But X sub two should be.
slightly different, like it should change in a little way. And then it generates why, and that's, you know, the human either stopped responding or asked it to do another thing in that style. So that context is what the model is using to understand what statistical probability completions it should be selecting in its responses due to that context that it has.
If you're just firing off to the direct API with your context and you need a correction to be made, the key with that is to provide those prompts in the engineering to say, I have an instruction, like, give you like the user example. And then the system gets it a little bit wrong. And then you get the next user example and then the system gets it correct. That's your entire prompt context to say, Hey,
I told you to do this, you did it slightly off and then I told you, actually I need it like this. That's one technique. Another technique is to explicitly show it what it is that you're trying to do. So you would go in and manually do it, correct one time and then do it and then show the system response being slightly messed up and then doing another correction on top of that so that it understands. Okay. It needs to...
Michael (42:44.252)
Right.
Ben (42:58.766)
select these next word probabilities that adhere to this restriction about this correctness issue. Because without that, it doesn't have a brain. It doesn't have any sort of short-term or long-term memory associated with the system other than text history prompts feeding into its input.
know, without all that, it's not going to remember what you're trying to tell it to do.
Michael (43:31.076)
Right. Yeah. So you sort of read my mind. The steps that I took were start off with just saying, Hey, convert this to Google style. The next thing that I did was add sort of some direct and concise commands, such as line length should not be greater than a hundred characters and other things like that. The next thing that I did is I went to the Google documentation and actually just copy and pasted their examples and their instructions into the prompt.
And then the final thing that I'm going to do, which I'm going to try to build, maybe over the weekend or early next week is self validation. So essentially say, Hey, we've already gone through one interaction. Can you validate that your output is the correct output for the given prompt? And if not, please correct it. And all of that is handled completely in an automated fashion. So this will scale really nicely. Users don't actually have to go interact. And I think that'll work pretty well. But again,
the stochastic nature of the output makes it kind of hard when you have such a defined structure of something like a doc string or even something like code. So that's what's gonna happen next. I'm sure we'll be talking about it in future episodes at some point, but yeah, I think we're right about time. Ben, anything else that you can think of before I wrap?
Ben (44:51.69)
Uh, no, like I think that was a pretty good explanation of a real world use case applying, you know, data science technology to software engineering and like how, how these two worlds can, can coexist, uh, in a nice way.
Michael (45:09.616)
Yeah, precisely. And yeah, so hopefully the purpose of this case study also is if you want to go build your docstring converter, feel free. But there's a lot of lessons throughout this process, both with LLMs and for projects in general that are hopefully beneficial. So I will recap some of them. The general process that both Ben and I subscribe to, or at least try to subscribe to step zero is sort of define the problem, scope the issues, scope the requirements. Step one is fail fast.
The way that I do this typically is I get a list of blockers and iterate through them, either doing things that are the least amount of work to validate or doing the biggest blocker that are sort of the highest probability of completely destroying the solution. Once you have prototypes of all of the blockers, then you can go and build sort of an end to end MVP. In this example, I did not do that. I over-engineered, I jumped the gun, I started writing tests. And I should have been sort of more concise in my implementation, but
On the flip side of that, I'm learning a lot. Step three, you can start building out your code structure after that MVP. And then step four, sort of write tests and make sure that it's extensible and valid for other users. And then one really, really cool point is for error handling test cases, test that it works, and then also have an inverse test that tests that it actually properly handles exceptions. So yeah, I think that's everything. Ben, anything else on your mind?
Ben (46:37.498)
No? That was a good recap.
Michael (46:38.816)
Cool. Sweet. All right, well until next time, it's been Michael Burke and my co-host. And have a good day, everyone.
Ben (46:44.162)
Ben Wilson. See you next time.