MICHAEL_BERK:
Welcome back to another episode of Adventures in Machine Learning. I'm one of your hosts, Michael Burke, and I do data engineering and machine learning at Databricks and I'm joined by my co-host.
BEN_WILSON:
Wilson. I do these days a lot of stuff with LLMs for Databricks.
MICHAEL_BERK:
Yeah, we love hype trains. So that's that's that's what we work on So today is going to be a really cool episode. Hopefully it will be very actionable for for all the listeners out there and what our goal is is to start with a Essentially a case study on how you can learn new things So the initial place that we're gonna be all coming from is let's say we have cursory knowledge about some top So you've heard of linear regression or you heard of machine learning and you would like to actually build a production grade implementation. So we're going to go through a case study that Ben just completed, and we're going to hopefully learn some actionable tips along the way. So Ben, what is this case study that I have been alluding to?
BEN_WILSON:
So I think it was about five weeks ago, maybe six weeks ago, we got a directive from some very smart people at Databricks in engineering who said, hey, we really need to focus on large language models and supporting them in MLflow. And there were three main tracks that we were going to go down. So three different people, actually it's four different people. uh, got assigned different components of things that we were going to focus on in preparation for our next big release. And I got tasked with, Hey, there's this library that a lot of people are using. And a lot of people have asked for us to build an integration with. We didn't have time in the past, but now there's a lot of people talking about it. And we think it it'll be super important and useful to people who use ML flow to have. support for the Hugging Face Transformers library. And I, as I often do when somebody asks if I'm up for like doing anything pretty much for development work, I was like, sign me up, this will be exciting. And in the back of my mind, I'm like, yeah, I know nothing about this library. I know what it is, sort of an abstraction layer around. high torch and TensorFlow models focused on primarily, most of their models are around NLP and large language models. And I knew what transformers were from a basic architecture. I've never handcrafted built one before, but have read papers about them. And we talked about it in the podcast like a year ago, I think. And so I had some amount of information, but no firsthand. down and dirty. I've used these APIs. I'm familiar with these APIs. I know what they can and can't do. No idea. So that was the setup for it. And they said, Hey, we need the implementation done in a month. Get it
MICHAEL_BERK:
Yeah.
BEN_WILSON:
done.
MICHAEL_BERK:
Yeah. And hopefully this should be relatable. I've been given similar objectives from management or whoever, where they say, you know, this thing that you've heard of that you roughly might know something about, well, we need that in production in two weeks, get to it now, just rolling it back a little bit and sort of understanding the use case. So we all should probably know what transformers are and LLMs. If not, feel free to Google or listen to a prior episode, but Why is it valuable to have the Transformers flavor in MLflow? And also, what even is a flavor?
BEN_WILSON:
Excellent question. So Flavor and MLflow is a named like wrapper around a particular framework. So we have stuff like TensorFlow, PyTorch, Profit, SKLearn, XGBoost. So when you want to interface with one of those libraries and be able to... basically save or log a serialized format of that model. It gives you the ability to do that and register it to the tracking server in a way that you can have that linkage. Okay. I, and I went through training evolution on, you know, my XG boost model and I want to log the parameters that we used during training and the metrics that were used. And then some additional metadata in the form of tags to explain what the heck this thing is for. You can log that actual XGBoost artifact along with those metrics and parameters into MLflow. And you don't have to do something like craft a custom implementation for something that's not supported. So MLflow handles that for you. It might have a custom serializer. It might just call the serializer and the deserializer from that library if it works really well. And if it's fully featured, you might just say like XGBoostModel.save, and that's what MLflow is doing to save it. It might be significantly more complex. Like maybe that API doesn't really work or it doesn't fully support all of the information that you need to be able to serve that, that saved and trained model. So we'll, we'll do stuff to make your life easier on the MLflow side so that it, it, it doesn't require a lot of boilerplate code. for a data scientist to write. And it doesn't require an ML engineer to take a data scientist artifact and write a whole bunch of serving logic around that just so that they can push that to real-time serving or use it for batch inference. So that's kind of what flavors are. The reason that the Transformers flavor, and this is the Transformers library from Hugging Face. That library is kind of similar to how MLflow does with models, but it's doing it with extremely, extremely powerful and pre-trained large language models. There's other types of models that are out there as well that it handles, but it gives this abstraction layer that simplifies the process of using and fine tuning a model that might be prohibitively expensive for you to train from scratch. or just prohibitively complex to build from scratch. Like a lot of these models come from very large research institutions or companies that have research arms within them. They can take a lot of time, effort, knowledge, and money to train for the first time. So hugging face is this at its core is this hub that allows you these institutions to share with the world. These models and say, Hey, you can use this for whatever you want. And you can retrain it a little bit and make it custom suited for your use case and your project. And it's this great abstraction layer around doing stuff like inference. Instead of having to write raw PyTorch code or raw TensorFlow code or Keras code in order to interface with that neural network that's been saved. It's, it really is like one or two lines of code. And it's this. It's really great what they've built. I'm a huge fan of the library and of the engineers that have built it, it's great. So why it's important for us to interface is because it's really popular and a lot of people have asked for it. And we firmly believe that with this recent resurgence of interest around large language models in the last six months or so that was started by ChatGPT. and its release and people are like, wow, these things are way more advanced than we thought. And the people that had been using it were kind of like, yeah, we know. That's why we asked for this feature a year and a half ago or two years ago. But now that there's enough momentum behind all of this and a lot of new sort of new blood that's coming in saying these are really important. We believe in these now. We want to actually use these to build things. We think that it's going to be worth our time and money to do this. We wanted to make it as easy as possible for the world in general to utilize all the tracking capabilities and deployment capabilities of MLflow to support this very popular and increasingly popular now library.
MICHAEL_BERK:
Cool, so the MLflow flavor will sort of reduce the amount of code that users need to write each time they run a transformer model. And I'm going to actually be testing this out because right now I'm working on a time series that is transformer or time series forecasting model that's transformer based. So it'll be fun to explore this flavor. I think it came out like three weeks ago, right?
BEN_WILSON:
our first release and then we had a patch release because I broke something. And there are some
MICHAEL_BERK:
Yes.
BEN_WILSON:
other things that were broken in the release. And then we're also releasing sometime before Monday the next version, which is an additional suite of features predominantly for this flavor to add additional functionality.
MICHAEL_BERK:
Exciting. So yeah, listeners stay tuned, but, um, great. So we sort of created this setup. We know that, uh, the transformer flavor will be valuable. You know what it is. We know why we're doing it. And so Mr. Manager comes over to Ben and says, Hey, we need this. What is your next step?
BEN_WILSON:
Uh, so usually that conversation for, Hey, we need to do this is handled maybe a day or two before sprint planning. And, um, it's, Hey, this is a big important thing. We need to get this done. Um, but before we start working on anything, um, start with the normal process. And that normal process is your sprint starts. And you have assigned the way that we do it. And the way that I personally do it is, um, we set aside time for developing a, basically a product document. You know, we've read a PRD and part of that first phase is to understand the scope of what it is that we're trying to do for something like a new flavor. And MLflow we're familiar with what that needs. It needs to be able to save an artifact. load an artifact, it needs to be able to be wrapped up in a Python function. We call them PyFunk flavors in Moflow that permits you to basically use it in a way that's deployable anywhere. So, you don't need to have some sort of bespoke custom implementation or infrastructure in place. Everything should be encapsulated in such a way that you can containerize this artifact. and it exposes a single endpoint, a predict function, and or predict method. And that predict method accepts some certain form of input data and it will return input, like output data after it passes through that model's architecture, or in the case of transformers, the pipeline architecture. So we know what those core things are, and there's a lot more, you know, to a flavor than what I just said, but. It's something that we already understand. Like we built all of the things that need to be incorporated into that. So we're not focused on those details. We're focused more on what about this particular flavor is going to be different than what we've done in the past. What do we have to think about? And what decisions need to be made about what we should or shouldn't be, or could, you know, potentially implement.
MICHAEL_BERK:
And
BEN_WILSON:
That's
MICHAEL_BERK:
just
BEN_WILSON:
really
MICHAEL_BERK:
for
BEN_WILSON:
what
MICHAEL_BERK:
con...
BEN_WILSON:
the PRD is for.
MICHAEL_BERK:
Yeah, and for context, about how many flavors exist in MLflow right now.
BEN_WILSON:
of named flavors a lot. I think it's like 15 or so maybe 16.
MICHAEL_BERK:
Cool. So this is a very scoped process and we sort of know what features are gonna be implemented. And so the intricacy comes from the transformer aspect itself. So what were some of the challenges that you were anticipating specifically for transformers that would be different from all other flavors?
BEN_WILSON:
Yeah. So the thing that we knew going into this and how we scoped, even the research phase was MLflow doesn't have any other flavors that support input of strings. So every other ML model that's out there, it's all, you know, either a data frame that you're passing in. Like you were talking about SKLearn, XGBoost, CatBoost, any of that stuff. That's a pandas data frame that goes in and it'll do inference or If we're talking about TensorFlow or PyTorch, it's a tensor that's coming into it. Something that's gone through an encoding phase before it gets to a model. So those are sort of known things. For LLMs and transformers, we knew that we needed to think about how do we pass a string in or a list of strings? How do we pass a list of dictionaries of strings? What are all the different input types that we might have to think about in order to support this abstract wrapper around the transformers flavors. So knowing that we have an idea of what we don't know answers to, in that first four hours of the design review that we're getting into, I just looked at their APIs and did a really quick test of I went, I drove directly to their documentation, went into the pipelines APIs, and said, how many different pipelines do we have here? And started reading through just their docs and not writing any code or anything, just seeing what's documented. And then every time that there was a link to their GitHub repository, I'd click on that real quick, do a quick search to find what the output and the input types that we're expecting and just write that down in a note file. And I'm saying, like, hey, this pipeline, here's the input output, this one, here's input output. just to have a feel for what are all the options that we have to think about here. And then next phase, immediately after that, the second half of that first day was, all right, now I need to just make sure, not that their docs are correct or anything, their docs are really good. Not all libraries are like that though. So to prevent me from having to redo something later on because I didn't do due diligence, I'll take all of their examples. for our main point of interface with their library and run them. So set up a test environment and just start copy-pasting code, making sure, like, does the example run? OK, I'm going to play around with it. Here's some inputs. Here's some outputs. And then I'm just jotting down notes for myself while I'm doing that, saying, like, wow, I have no idea how this part works, but that's cool. Or, hey, why does this do this? And. But when I do it this other way, it does something else. So any of those gotchas that sort of break my mental model of what it is that I'm interfacing with, I make sure to write those down as, Hey, this is not intuitive and I need to think about this. And at the end of that first day, I'm, I'm done. I've done my first part of the research. I probably have some idea of what I need to do the next day. And then the next day I read through all of my notes again. to say, okay, here's what I was thinking yesterday. Does that do, do all of these points still make sense to me? If they don't, I make notes on my notes and say, I don't know what you're thinking yesterday, Ben, but this is, this makes no sense. Like, what the hell is this? And if it's, if my notes on something that I didn't understand, aren't clear enough, that means that this is potentially very confusing and not something that. intuitively is explained by just notation of a complex topic, which means I have to pay special attention to this when thinking through implementation. And I'll focus on those parts on my prototype code. Like, hey, I don't know why this particular, you know, pipeline takes in either a dictionary or a pandas data frame as a supplementary data set for doing table, you know, question and answering. And why is it formatted this way? Why does it not take a dictionary for the inputs, but takes a string that we're encoding the keys right before a colon? This is super weird. Is this, am I missing something here? And I start realizing like, oh, we need to make some decisions here, some design decisions. Seeing how all of these nuances play out, start crafting our first set of requirements of what data are we going to allow users to use? This is for the PyFunk stuff, like, you know, real-time serving. And we start realizing that, A, we need to be able to accept strings, lists of strings, inputs, dictionaries, lists of dictionaries that contain strings. And that's what made this flavor a little bit more complex is we're not just creating a flavor. We have to change the implementation of the serving layer and how PyFunk does schema inference and schema validation. So we knew that part of this implementation would require changes to core MLflow. Like our backend, you know, version of stuff, which increases the scope of what we need to do. But this is critical stuff. Like you can't actually. release of flavor that doesn't allow people to send data to it, or also just throw exceptions. So that process for the design aspect means we're creating sort of a field of options for other reviewers to look at. And we say, hey, here's three different ways that we can handle this. We could force everybody to use this. And this will be super simple for us to implement. users are probably going to hate this because
MICHAEL_BERK:
Bye.
BEN_WILSON:
it's such a departure from the library. Or we could do option B, which is sort of the middle ground of saying, okay, our implementation
MICHAEL_BERK:
Thanks for watching!
BEN_WILSON:
complexity will increase because we're supporting more of these types, but it's also more similar. And there's just these couple of edge cases that maybe we can't implement right away. We'll have to think about it later on. And then option C is, no, we're going to match exactly what the library's APIs are, but going along with that is, hey, this is going to take us like 10 weeks to implement because we have to do all of this work to make sure that we can do data conversions to conform to how ML server works. So
MICHAEL_BERK:
Got it.
BEN_WILSON:
yeah, we do that for all of the different items of, okay, here's the data input output types that we have to validate, here's options. The next thing is like serialization turns out you can't save a pipeline. It's just the API doesn't exist. You can save models, you can save configurations, you can save all these other things, but the actual pipeline object itself, you can't save it. There's no API. So we know like option A would be a, people can only say the discrete elements. We're not going to use pipelines at all. You know, downside to that is we're probably going to have to write a lot more backend code for that. And it's going to make one part of the implementation much more complicated while making another part much simpler. So there's a trade off there. Option two is just make everybody use pipelines. That makes our implementation very easy because we just have to write a serializer and a deserializer. But the downside of that is not everybody. Not every use case that uses these models can use pipelines. They can be put into pipelines, but stuff like looking at encoders and you want to get the encoding vector from input text, for instance. Well, most of those, if you just put that into a pipeline, you're going to get an arbitrary N number of vectors that come out or tensors that length or number of strings that are input. You basically put in two, you get back seven. Another model might you put in three sentences, you get back 17. And it doesn't really make sense. You can change the length of the sentence. It's like, okay, it's always giving me back 17. And then this other one's always giving me back four vectors, but all of a fixed length. What's going on here? And the internals of encoders, once you look into that implementation, That's why it does that. And I need to be able to pool these together and average them. And, you know, basically clamp down on a summation and an average of over, you know, where the attention mask is versus, you know, the actual raw output of the model itself in tensor format. So for use cases like that, you know, okay, it's probably not going to work when we're going to have to do option C, which is we'd like people to use pipelines. And when they can use them, we're going to, you know, heavily promote that because it's simpler for them and for us. And it just makes for more like a robust implementation, but we also need to support component level. So you pass in a dictionary of model to tokenizer to feature extractor to image processor, whatever it may be. So that dictionary that allows advanced users who want to use a specific thing to say, I don't care about PyTorch. I'm doing batch, you know, encoding transformations on a bulk data. I just need the tokenizer and the model, and I need to create my own pooling function. So that's why we have to implement that one.
MICHAEL_BERK:
Cool, all right, I have like 700 questions. Let me try driving for a bit. All right, so going back to understanding scope, you talked about listing features and then you would go out and prototype sub-components of those features. Now, Ben is pretty well-versed in the MLflow world, but for someone who is really working on not just a new topic, but let's say a new structure. So by that I mean the topic here is transformers, the structure is MLflow flavors. What if both of those are new? What if you have to learn what you need to build in terms of features and what tooling you will be leveraging to build those features? How do you handle that sort of duality of unknown?
BEN_WILSON:
I mean, I think you would break it apart in the same way that I'm talking about the, the transformer stuff. If you had asked me eight weeks ago, or last year when we were talking about this transformers library, if you're like, Ben, how do I do an encoder? And I'm like, I don't know. I've never tried it. Um, I'm sure the APIs are wonderful. Uh, I hear that company is great and that would have been my response. Like, I don't know. And if I, if we had done that live and done a screen share, And we both sat down and we're like, Hey, let's figure out how to do this. It probably would have been really funny for a lot of people to watch that video. Uh, it would have made me laugh at seeing me screwed up live on, uh, on live code share, um, but it would have been. Me making probably 30 mistakes in like trying to read their docs and trying to grok what the hell is going on. Like, Hey, passed in this sentence, shouldn't it just give me back one vector back? Like what the hell is going on? You know, click on 17 different links and like, Oh, that, Oh, now I get it. Okay. So we need to average these. What's the syntax for that? Like, how do I do that in torch?
MICHAEL_BERK:
But making mistakes isn't the best advice. There's gotta be some underlying structure for like hypothesis testing and then answering these hypotheses that you use.
BEN_WILSON:
When I'm learning something new, I like to approach it. And this is from like a tool builder perspective. So if I'm learning like this new library, which I built a number of flavors for MLflow over the last two years. And if I'm interfacing with this for the first time, I approach it as an end user would. who is not familiar with that library and certainly is not familiar with the MLflow way of integrating with that because it hasn't been built yet. I'm working on that. So by starting at it from a user perspective, it lets me feel the pain points of the API. Like, hey, what's it like to train a model with this? How does that work? And what's it like if I need to use it for inference? And how many inference APIs are there? Is it just one? Is it just predict? Or is it, you know, with some of these time series libraries, like seven different, you know, APIs you can hit for generating a forecast. So I'll try all seven of those and see like, how do they work? Do I need to generate data? Do they have an API for that? Or is this me brute forcing, you know, continuation logic on a NumPy iterator? How do I do that efficiently? How would a user do that? What's the dumbest way I could do it? What's the smartest way I can figure out? And I'll try those things out, see if they work with the API. So I have a good mental model and example code to look at, even though it's just scratch crap code that I just, but I have, probably shouldn't say this out loud, but when I'm doing that stuff, I actually use Jupiter. And the reason
MICHAEL_BERK:
Oof.
BEN_WILSON:
I use Jupiter is because a lot of our users are using Jupiter. So I'll have a Jupyter notebook for each of the evaluations that I'm doing. And I'll interactively go through and write like applied code of that library, just so I have a reference. I'm like, this is probably how a user is doing this. It's not demo code. It's not like a hello world for the library that already exists online. This is more of a, I'm going to take an actual dataset and then I'm going to take this model and I'm going to try to build it by hand. Is it going to be great? No. Uh, is it going to be useful? Uh, no, I just need it to execute and I need it to like create a model at the end that I could then think about saving. But that's, that's a reference for me to understand and sort of feel the pain of what it's like to work with that library or feel the pleasure. It could be an amazing library with like super sexy APIs that it's just like. It's like butter. Um, And
MICHAEL_BERK:
Got
BEN_WILSON:
there are
MICHAEL_BERK:
it.
BEN_WILSON:
some out there.
MICHAEL_BERK:
So you sort of play around with this and like put on your customer hat, try to be empathetic and try to do what they would do. Is that about right?
BEN_WILSON:
so that I can understand where. Like things that you're not expecting might happen. And it also helps to expose what it's going to be like. So if I manually save something, what happens? What is the structure of that data? I'll go into where it was saved on disk and see is this just a single monolithic file? Is this like a pickle file that gets saved? Well, I hate pickle. It's not transferable across Python versions. So I think it sucks. Uh, is it a cloud pickle file somewhat better, I guess, but still kind of sucks. Or is this a custom, like did whoever wrote this library, did they do their homework and do they put on their real software engineer hat and write a true serializer, you know, look at how the weights and configurations are saved with like a torch model or TensorFlow. That's serious work that somebody's going
MICHAEL_BERK:
Got
BEN_WILSON:
in there
MICHAEL_BERK:
it.
BEN_WILSON:
and doing that. So when I see stuff like that, I'm like, I don't have to worry about serialization at all. Like this is going to work. But if it's, if it's like, Hey, there is no save method. Like, what do I, like, what are we going to do here? Okay. We need to look at, like, I'm going to take that object. I'm going to do a VARS on it. I'm going to do a DERS on it. I'm going to see like, what are the actual instance attributes? What's important, what's irrelevant. And I'll make notes of all that stuff and saying, Hey, we need to build a custom serializer. That's going to increase the scope by N number of days. We need to add extensive test validation for the serializer to make sure we're not dropping data or that we don't miss something. It that's the reason I do all that stuff.
MICHAEL_BERK:
Cool. So there's this exploratory phase where you try to find potential issues that you wouldn't know about unless you had actually built them in production, let's say, or at least built out some demos.
BEN_WILSON:
Yes.
MICHAEL_BERK:
Now, great. So we sort of have an understanding of the Transformer library at this point, and you know a lot about MLflow, and so you know what features are required in this step. But sort of going back to that original question, what if you don't know MLflow? and you don't know transformers. How
BEN_WILSON:
You
MICHAEL_BERK:
do you know what transformer components should fit into what things in ML flow?
BEN_WILSON:
I mean, if I personally didn't know either of those, I would be in trouble being
MICHAEL_BERK:
there.
BEN_WILSON:
a member of the team that maintains MLflow. But if we're talking in a general sense where there's multiple libraries that have dependencies on one another, and I'm not sure like how all that stuff works, that would be like transformers would be an example of that. There's... libraries that wrap around transformers or that use transformers like SetFit, for instance, that is an abstraction layer around Transformers Trainer. And it's meant for a specific domain of large language models. And it's a super cool API. I had never seen SetFit. In fact, another person in engineering had done a prototype customer user journey of trying to train a model And he was like, yeah, dude, this is totally broken. Like this, this doesn't work with our design. Uh, this is after we released, uh, transformers flavor actually. He's like, we, I can't save this. This doesn't work at all. And we're like, yeah, we, we don't have that explicitly stated as being supported, but we're going to look into it. So I had to, you know, look at both of those, like look at set fit and say, how does this work? Like, yeah, I can see the demo code and I'll run that and see, okay, how it works. But then I'll, I try to reproduce exactly the issue that he found, which is what happens if I just try to save this with MLflow? What exception do I get? And I look at it and like, wow, that's the worst exception I could have gotten.
MICHAEL_BERK:
Thanks for watching!
BEN_WILSON:
Cause it's like the catchall with like, I don't understand what you're trying to tell me to do with the implementation. And that meant that I... I created a feature branch and started working on it and put, you know, basically binary search elements within the implementation to say, what is the state? So basically entering debugger mode and setting stop points and saying, what is the state of these objects right here? What is, what does the transformers MLflow implementation think SetFit is doing right here? And I had you know, found out within like the first five minutes, I'm like, oh, that's why. So if we just do this and support this thing and yeah, all of a sudden started to make a little bit more sense.
MICHAEL_BERK:
Yeah, you hit on a really valuable thing that I am currently learning the value of and starting to really, really appreciate, which is this magical tool called debug mode. It's so, so valuable for essentially understanding how existing stuff works. And so like going back to that question, if you don't know transformers and you don't know MLflow, well, to learn those things and specifically on the structure side, the MLflow side, best option is usually to ask. If someone's kind enough to volunteer 30 minutes of their time, that's really, really valuable to you. The second is look at other PRs that are similar. So MLflow has 15 flavors. You can go in and see what are people doing when they add new flavors. Maybe there's a contributing guide, that type of thing. So just leveraging examples. And then within that, if you actually go and run another flavor in debug mode, you can actually see what's being called and how. And so that's incredibly valuable for learning. how to contribute to a given structure. If you're working with a completely new structure, you're probably a little bit out of your element and should ask someone more senior to advise you on this. Does that sound about right?
BEN_WILSON:
Yeah, totally. And even if you're using these tools and not talking about contributing to them, it's not. There's not like this never ending list of like, Hey, we have these 100 flavors that we want to implement. We just want people to contribute. It's not that. Um, but if you're using this stuff and you don't grok it, which newsflash the people that build the stuff, we don't grok it. Uh, when it, when we're talking about that day one or day zero of a new project, Hey, we need to implement this thing. Like I said, at the, at the start of this, I'd never really use transformers. I didn't know the nuances of it. But I. I know what I'm expecting from it. I know that based on the mission statement of the package where it's like, Hey, we're trying to democratize large language models and, and deep learning, and we want to make it simpler for people to use these. So we make, you know, this trainer abstraction and we, to make it simpler to retrain these. And we do all these other things, like the package code together and deploy it to, you know, common repository that the world can use. I was like, man, what a great mission statement. And I have this mental model from them, from the, the, the fingers of the developers themselves that said, this is what we built this thing for. So I have an expectation. And then as an ex user of tools like that, and a current builder of frameworks, uh, I know what I would expect or what I would want to build if I was the one building this and how I would want. a user to interact with it. So I'll look at the APIs and see what their design is and see, like, do they have methods that, that allow me to do lifecycle management of something? And
MICHAEL_BERK:
Right.
BEN_WILSON:
then go learn those, just see through the examples. Like, how do I save a model? Okay. I'll go to the API docs, look at it, go to the source code, look at it, and then write my own real quick and say, yep, I can save and load that. That works beautifully. What are all the other arguments here? Huh? There's a ton of them. Which ones are important? Oh, that one. Oh yeah. Some of that data is in the model card. How do I get a model card? Okay. So that's how I fetch a model card. So it's going through and looking at what is the public API interface for something. And that gives you a really deep understanding of the design, the thought that went into how it was built and what it can and can't do. And once you have that sort of mental model with some example code along with it, it's a very fast and efficient way for you to understand what your tool can and can't do so that you can say, Hey, I can use this for my projects or I can use this for my project, but I also need to do some work myself to make it fit into my project. Or option three is this, this is awesome. but it sucks for my project, I'm going to use this other thing that I've done the same evaluation on. And to evaluate an API, I understood the ins and outs of transformers in one day because I approached it in that way of saying, OK, pipeline that I'm going to build with these three components to it, I need to be able to see how does it retrain, how do I save and load this stuff. What are my options? How do I change arguments? How do I do hyper parameter tuning on one of these things? Just jamming out all of those different examples. I had that mental model of what these things are in that first day. Whereas
MICHAEL_BERK:
Thanks for watching!
BEN_WILSON:
years ago, this would have taken me like two weeks to, you know, meticulously go through. I probably would have said like, well, I need to get really good data and I need, you know, I would be thinking like, how would I do a business presentation of a demo to. You know, my company, if somebody was like, we need this thing to do this thing. So do the data science thing where I'm like, Hey, I can't, I can't let see somebody see a crappy model. That's just totally broken. But that's not learning the API. That's learning how to use the API to solve a problem. Just learning the API is I don't care what garbage model I create. It could just, you know, for a large language model, it could just on every output, regardless of whatever I put into it, it just returns 17 exclamation marks. Don't care, it's totally broken, totally unusable, but that model saved and it trained and it learned how to create 17 exclamation marks. So that lets me know what I can change and how it behaves so I can prevent it from throwing an exception.
MICHAEL_BERK:
Right. Yeah, there's an art to asking the right questions. And if you lack artistic ability, maybe you can start off with a set of features that you know need to be implemented, and then ask questions for each, let's say method in a class or each component of said method, so that you know that whatever functionality you're building out, it is supported. Or as Ben said, it's supported and it needs some work, or it's not. Cool, so we've explored, we have essentially a scope of what is supported for features and what is not. And then the next step you mentioned is sort of providing those options of design. So let's say we have 15 features, and for 10 of them, they are perfectly supported, we got really, really lucky, but five of them we need to do some additional design and make decisions. How do you think about laying out design options and what is your essentially loss metric or formula for saying should we do this or should we not do this?
BEN_WILSON:
Uh, the general rule is you don't make that decision. Uh, that's a peer review thing. So if you're the person doing the implementation, you're going to be heavily biased by that knowledge that you gained while iterating through that knowledge could be biased in a flawed way as well. You could not be, you might not be thinking of alternatives in the way that you potentially could be. So to remove that bias from your decision making. you mentally remove yourself from the problem and you say, okay, what are my real options here? If it's super obvious and it's not really something that you need to debate about or even open it up for debate, don't give the options. Just say, this is what we're recommending to do. And if somebody has a problem, they can just say, hey, did you think of this or is this going to be a problem? But for those five things that you were just mentioning that were like, Hey, we need to do something here because this doesn't work. What should we do? Generally three options that are reasonable are good. It's also good to add in two additional or maybe one additional options that are kind of ridiculous, right? It's either like, Hey, we're talking about the serializer. You know, one of those options could be. we need to build an entire serialization framework for this library and basically do that implementation for all permutations and make it abstract. There's a reason that the people maintaining transformers didn't do that to begin with because it's an extremely, extremely challenging problem to solve because of the nature of how that library is constructed. Knowing that you could potentially put that as an option so that people can look at it and look at the scope of effort involved with that. Is it a solvable problem? Sure, of course. But do we wanna spend 10 weeks working on it? Well, the other ones can be done in two days or one day or three days. So you can put stuff like that on your design criteria judgment list. for other people to look at and just sort of, sometimes they just, they look at that, kind of roll their eyes or they're like, yeah, no kidding. We're not going to be doing that. But a lot of people also see that and say, kind of internally say, thanks for pointing out that you thought of that, that this would be the associated amount of work. And sometimes you're wrong on something like that. I can confirm firsthand experience. I've done a sandbag. Like, hey, I think this is going to be really complex to do for this one thing. And then have somebody who's got a completely different context on that problem, come back and say, nah, man, like there's a, there's actually a clever little thing that we can do here that only takes an afternoon to do. And that opens up that conversation with that person. You get to learn something new and you might get to co-develop something with somebody. So. That's also really important to put those sort of extremes in there. And the other side of the extreme for framework building could be, you know, option zero, we're not building that. Or it's out of scope. We're just not like, yeah, this functionality doesn't exist, but we're punting on that until we hear from people saying that they need this. And they really like, there's feature requests coming in. People are saying, we will use this. If you build it, please build it. And. Sometimes that can be deferred because it's not something that's critical for the functionality immediately, but it's something that, Hey, it might take too much time to build upfront, but we'll get to it after we respond to how people are using it.
MICHAEL_BERK:
So create a list of features, see what's supported, and then for those five features that aren't supported, provide three to five options and get peer reviews. Those peer reviews will typically be more unbiased, and then you can bring in creativity and other perspectives, and that will allow you to solve the problem more efficiently, at least typically.
BEN_WILSON:
Yes, definitely can
MICHAEL_BERK:
Got
BEN_WILSON:
confirm
MICHAEL_BERK:
it.
BEN_WILSON:
it. It dramatically reduces confusion and churn on development. If you have 20 people in agreement on the direction. And once they give that feedback, you update that design and say, Hey, we discussed this offline or we discussed this in this document, I'm updating this and revising this because this person brought up an excellent point that I didn't think of. And this is how we're going to do this. And that's your guide for phase two, which is go hit the keyboard, homie, like start writing code. Uh, you work on that first PR and start building it. And the way that you do that is up to you. Uh, everybody's got a different style. Uh, for this particular one, this particular implementation, I tried something slightly different, which is test informed to development. which I already had all of those examples written up about how each of those pipelines I wanted them to function or just how the examples work. I actually took a lot of that and put them as PyTest fixtures in the test suite and said, okay, I now have these 14 or something PyTest fixtures that generate a Transformers pipeline of all the different flavors of those. And now I'm going to write integrations with each of those to say. What I want you to do is take that like that built pipeline and try to save it. And if I, before I write anything, you know, I just have a skeleton of the flavor that has, you know, save pass load pass, you know, log pass. Of course, nothing's going to happen when you do that, but you start blocking out and writing that code where you're like, Hey, I'm, I need to create the save API so I could save. a pipeline object or save components. So do the first round implementation of that. And what I was doing while I was writing that, I would have another window open and periodically I was just kicking off the unit tests and just watching them blow up one after another and saying, what exceptions are being thrown here? Oh, okay. That's a JSON serialization error. Or hey, I can't actually. You know, that's an object that can't be serialized. I need to actually convert that to something or representation. It needs to extract attributes of that. So it, it was an interesting experience to do that for the first time ever. Usually I, historically I've like written the code and then written the tests, which I know a lot of software engineers do. And there's some people that are, you know, almost pure TDD, which not for me. But the test informed development is interesting because it gives me a sort of guard rails of as I'm building stuff,
MICHAEL_BERK:
Hmm.
BEN_WILSON:
I'm able to test how a single change is going to affect 14 different things that all share a common need or a common functionality. So I was able to see like, Oh, okay. I fixed this thing in this one pipeline and seven other ones started to pass, but then there's seven that are still failing. Like what's going on there? Okay, I can see the results real quick. Fix that. Oh, now those seven that were failing are passing, but then one of the ones before that was passing is now failing. What's going on here? So it was just like a faster development pace for me for this one instance. It was just fun.
MICHAEL_BERK:
Interesting, I'll have to try that out. That's super cool. All right, I mean, I think we did it. We made it to the PR stage. Do you have any other final thoughts or comments before I wrap?
BEN_WILSON:
Something that I don't know how many people who, I know from personal experience in data science work, when I was doing stuff to solve problems earlier in my career, even if the company is touting like, hey, we're an agile company and we do agile methodology, a lot of times the data science team is excluded from that because you have this single monolithic artifact. It's a model. It's gotta be good. It can't be total garbage or it just doesn't work. So a lot of ML development I've seen and what I used to do was focused purely on that, like I need the best model possible. So you're doing waterfall development. You're not doing agile development when you think of projects like that. And usually that's because your experimentation phase is trying to get the best feature set. cleanest feature data that you possibly can and iterating on accuracy until you get that. But that's the antithesis of high velocity development because you're spending so much time trying to get that first release that's really great that people can be really proud of and you can be like, yeah, solve this problem until you got to retrain it two weeks later. So that aspect of retraining... over time is more agile sort of, it's more like just maintenance. But a true agile development methodology as applies to data science is setting some sort of competency threshold in your, what you're putting out there. So like, Hey, our metric is accuracy and we need to make sure that our accuracy is above 80% for. You know, our first. round pass. And to get 80%, that might be, you have 10 features in your input feature vector and a hundred thousand rows in your training data set for traditional ML or deep learning or something. And the agile way of thinking through that is, yeah, we know we have 150 additional features that we could add to this, but we're not going to take two and a half months to test all that out and delay release. If we can hit 80%, which is our, our cutoff. And we just trained it last night on the first 10 features. We hit 82% ship it, get it out there for internal human review and QA and people to see is this thing valuable? And the reason software engineering does that, which is when we get that first PR and we do our first release. we'll mark it as experimental or we'll say like, hey, this is under active development. New things will come maybe, or we could just delete all this. That's what experimental tagging means. But we're doing that so that we can see where the noise is coming from. We want to see what part of this are people, you know, begging for new features to be created or what things are people telling us that, Hey, this is broken. You guys screwed this up. Uh, please fix this. And you do that. You respond to it, do another patch release, but you're focusing your further efforts, not on what your assumptions of what people would value are. You're focusing them on what people are responding to. So if If everybody really likes this one thing that you thought was completely pointless, like you did some discussions internally, like nobody's going to care about like audio models. Like who does that? By the way, we didn't have that discussion. We knew those were really important, but just as an example, you know, some esoteric part of a library or an implementation, you might say, nobody's going to care about that. It's not that important. You release it. And within the first three weeks, 90% of the feedback is about, hey, this doesn't work the way that I want or, Hey, this is, this is awesome, but we really need it to do these other things as well. You don't sit there and think about like, wow, we were really wrong. Uh-oh, I guess, you know, we shouldn't have spent so much time. We didn't spend a lot of time. We were fast. We were agile. We were moving at a high rate of velocity to get that first thing out there. So that when we get that feedback of that thing that we didn't think was going to be super important. We pivot directly to that and say, that's what we're focusing on because that's what people care about. And you can do that with data science projects as well. If you release that thing at 82% accuracy and it's crickets for a month, nobody cares and you ask for feedback and people are like, yeah, we didn't really ask for this or yeah, that's cool and all, but we can't use this. Go build another project. Scrap that. It's garbage. I mean, it might be useful someday. Don't throw the code away, but don't, you know, spend six months on making it the best model ever and only to find out that nobody cares, go find out what people care about, go build that, but build it fast, fail, fast fail early. Fix the, you know, based on what people are telling you about it, like, Hey, this sucks for this thing, or we really like this aspect of it, but it's terrible for all these others. that's what informs your next design and your next quarter of work. So that's, that's my point, my perspective, my soapbox moment on, you know, product development for M Health.
MICHAEL_BERK:
I'm truly inspired and will now go be as agile as I possibly can. But before we wrap, I also wanted to echo that people are really bad at guessing what will be good and bad.
BEN_WILSON:
Yes.
MICHAEL_BERK:
And so just testing it out, like whether it's an A-B test or whether it's you release it to people and they give you feedback, you'd be really surprised at what people actually cling to. And there's the Steve Jobsism of People don't know what they want until you show them. That might be true for some people, but often it's really, really hard to guess what people want. So just iterating is super effective.
BEN_WILSON:
Yes.
MICHAEL_BERK:
But yeah, in summary, hopefully you got some actionable ideas about how you can implement a new technology, especially if you're starting from zero or square one. And so the general formula that you can follow is first, someone comes to you and says, we need to do this. Let's assume they're right. Next, we look to understand scope. And so typically playing is really effective. You go and see what people. would be doing with this tool? How would they use it? Try to solve real problems that those people are actually trying to solve. Once you sort of have an intuition, you can go back to your framework that you're trying to build, develop a list of features, and see where this tool fits into that framework. And you can then, from that list, develop some sort of criteria about whether it's a one-to-one match, we're good to go. It's not a one-to-one match, but it's pretty good. We'll just need to edit the framework or the tooling. And then the third option, which is the most scary, which is you actually have to build something from scratch. But then once you have a scope with all of those possible implementations, you can go and get feedback from other people, emphasis on other, you'll be biased. Once those other people come in, they can provide external opinions that is hopefully unbiased, but they can also provide different perspectives and different contexts. And then finally, once you've updated this PRD or whatever document you're using, You can go build it and have fun. Is that about
BEN_WILSON:
and
MICHAEL_BERK:
right?
BEN_WILSON:
learn and learn during that.
MICHAEL_BERK:
true. Cool
BEN_WILSON:
Yeah.
MICHAEL_BERK:
cool. Nice.
BEN_WILSON:
It's a good summary.
MICHAEL_BERK:
Well until next time it's been Michael Burke and my co-host.
BEN_WILSON:
Wilson.
MICHAEL_BERK:
Have a good day everyone.
BEN_WILSON:
Have fun coding.