How to Test ML Code - ML 091
In this show, we cover some practical tips for writing reliable ML code. Here are some of the questions we look to answer...
Show Notes
In this show, we cover some practical tips for writing reliable ML code. Here are some of the questions we look to answer...
- What are tests and why should you use them?
- What's the difference between unit tests and integration tests?
- What should you test?
- How should you write tests in python? (the answer is to use pytest)
Sponsors
- Top End Devs
- Coaching | Top End Devs
- Enov8, who provides test data management
Transcript
Michael_Berk:
Hello everyone, welcome back to another episode of Adventures in Machine Learning. I'm one of your hosts, Michael Burke, and I'm joined by...
Ben_Wilson:
Ben Wilson.
Michael_Berk:
And today we're going to be having a panelist discussion and we're going to be talking about how to test ML code. And if you have no idea what a test is, well, you're in luck because we'll be starting from the basics, explaining why we should be running these tests, and then talk about some high level philosophies and principles about how you should approach testing. So... We're gonna start with a scenario we like using scenarios to anchor our knowledge, big fans of hot dogs. So we'll just continue. The scenario is we're building a random forest model to predict whether someone wants ketchup or mustard on their hot dogs. So binary classification. Maybe we're looking to scale this model out into production to multiple food trucks, maybe not. But that's the initial premise. The feature set, we have three features. They're the highest quality of all features in the world. I engineered them myself. The first is person height. So when someone walks up to our cart, we know how tall they are. And let's say it's in meters. The second is the time of day. And then the third is whether the person is wearing a hat. So there's one requirement for this. And we need a one minute turnaround from first viewing the person all the way to having a prediction and then informing our ketchup or mustard dispensing policy. Let's say. So Ben, that makes sense to you. Did I miss anything in the scenario? Cool. So let's start off with the basics. We are ML engineers. We're trying to predict ketchup versus mustard. What are tests?
Ben_Wilson:
So the features that you described, the first thing that comes to mind for me, height. Now the vast majority of testing that I ever have done, I don't do it anymore, but I don't write tests for ML stuff anymore like that. But when I was doing that, I would be looking at what my data is that's coming in. And... if we're talking about something like height. we probably don't have all of the data for every person for this use case. We might have it for, optimistically, let's say we have it for 60% of people. And we might have other data about that customer that we could use for inference, for saying that, hey, we actually want a value here in this, and we want to estimate what that's going to be. So we're going to have some sort of feature engineering logic that's going to. infer missing data. And there's a bunch of different ways that we can do that, but that's not important for this discussion. What we're talking about is we're having to write some code that takes data in and fills in an estimated value for us. That could be super complex logic, that could be super simple logic. It could be something that operates on a data frame you're using some sort of statistical inference to populate the missing values from that through some sort of a group by operation. We could be using clustering. We could be doing something that is row-wise. So it's just a function. We're saying, hey, we have this data coming in. We're going to create. decision logic based on the data that is available that we know we have. And if it meets these criteria, here's the estimated height of this person. So in all of that logic, wherever it may go, uh, we would craft our code for that module where it's like, you know, our height inference module, this will be an actual like Python file that exists within our feature engineering, you know, directory within our repository. And we don't necessarily have to craft this as like a class. We could maybe if there's some sort of level of abstraction where we're saying, hey, we're not just doing height, we're doing weight and we're doing, you know, other attributes associated with a person. Maybe we have an abstract, you know, human attribute inference class that is a true abstract class that defines a bunch of methods that we want to have as part of the signature of that class. And then each of the actual subclasses that are implemented from that base class would have those same methods. We could do that, but we don't have to. Let's say we're just looking at the simplest approach possible, which is what I would use if we were just looking at these three features and saying, hey, we need to write something we can test that this logic is correct. We'd create this module, and we would write a bunch of functions. And each of those functions would delegate operations of logic that we would be doing in a sequential order. The first one is probably check if missing. Search through that data frame and say, hey, are there any values here that are missing? Boolean, true or false? That would be a logical constructor for controlling execution when our code is actually running. If we don't need to replace stuff, why bother wasting computational cycles on doing that? That's something that we don't have to test. So that's the first point for our testing discussion. Checking if a column has nulls in it, we don't need to test that. The people that maintain pandas already do that for us. Or the people that maintain Spark. already do that for us. They have tests that are running every single day checking that their dev branch and their released version branch is doing that properly. So no need to test that. But if we're doing something like after we see if there's nulls in there, we want to inspect the data itself that actually is present. And you mentioned, hey, it's... It's people's height in meters. And if we operate under the assumption that whoever's feeding data into that raw data table has that consistent at all times, we're diluting ourselves. I guarantee that that source table is gonna be a mix of feet and inches, centimeters, meters, there's gonna be a lot of different data types that are in there. And we can probably do some thresholding and say, based on is this float value between these ranges, then that's probably the height of this person in this measurement scale. And then we can do casting. And casts like, hey, if it's in meters, cast it to centimeters. If it's in feet and inches, cast it to centimeters, you know, or however we want to do that. whatever our final logic is. So what I just described there is two separate operations, two separate functions that we would have to create. And it could be actually more than that. We could extract this out to say, hey, we have a function that's inches to centimeters. We have another function that's feet and inches to inches, or feet and inches to centimeters. And then another one meters to centimeters. That one's really simple. But then we have our detection logic. And that's going to be custom. That's going to be based on, you know, there's no package that's going to be in existence in open source and certainly not part of a core library that's going to say, hey, is this number representing, you know, human height in what format? Nobody's going to build that because frankly, it's stupid. But it's fun to talk about for this podcast. So we're going to have to write that logic ourselves. And of all the things that we just described, about unit conversion and detection of, do we have to kick this logic off? All of those are, are not as important as that detection logic, because that's where the likelihood that we're going to create garbage is highest that we'll probably have issues with how we implement it, because it's going to be tricky to figure out and get correct in all, in all cases. So of Of everything that we've discussed, that's the most important thing to test. And we would want to start by writing a test that goes against that function. And we would supply some synthetic data to that function. And that doesn't mean we're creating a Panda's data frame or a spark data frame, and that represents our, our actual training data. And then saying, Hey, based on this snapshot of our data, does it do what we expect it to do? Not only is that fraught with issues because A, that data set's probably massive. B, our test environment probably doesn't want to load that much data in, in order to do a unit test and C, we don't need to test that much information. Because there's probably within that data frame, a a lot of duplicated data that is redundant to test. And in order to execute that test, we would have to write code within the test that's converting, that's doing like a row wise operator on a pandas data frame. So we'd cast to, you know, a NumPy array and then take that data and pass it through the function. Well, instead of doing all that complexity in the test, we can just say, what are some situations that we wanna verify that this works correctly? Maybe we generate a really big number and say, hey, how does it handle this? Does it know that, you know, maybe that's in millimeters or something. So we have logic that's like, oh, we're detecting this is in millimeters. We're going to convert that to centimeters properly. And, you know, some weird mixed types were like, okay, this is feet and inches, we need to do this, this cast. But then we would want to write tests that would validate boundary conditions where our logic. where we might be at a point where it might not make that much sense of which one it would choose. So we do some, whoever's writing that test would do that exploration and say, okay, you know, if it's in meters or, you know, in feet, decimal, you know, you know, number of feet points, you know, inches out of 12. at what point is the normal distribution of human height at. So let's test those numbers and make sure that those thresholds work as we expect them to. And that's really all we have to test. So maybe there's 20 cases that we're testing and making sure that they cast properly.
Michael_Berk:
Yeah, so that was a ton of really, really good examples. And just so we're all starting from the same place, a test is just a Boolean condition where we look to see if inputs produce the correct outputs. So there's a bunch of different ways to do that. We have test runners. We can do unit tests, integration tests. But when we're doing feature engineering, it's really important that our feature set or our incoming data set is quote unquote correct. And that's what Ben was referring to. Basically we could have, it's very naive to assume that all of our incoming data will be in the correct units. There won't be a trailing zero that makes it 110 feet or a thousand feet. So we need to make sure that the incoming data does not automatically skew the model and for certain models when data are Extreme or there are there outliers that can be really really problematic for fitting There are different types of tests though Ben so what
Ben_Wilson:
Mm-hmm.
Michael_Berk:
is a unit test what is an integration test? What is the difference?
Ben_Wilson:
So what I just described in that previous five minute rambling of... of talking through the height conversion stuff. That's a unit test. So unit tests are discrete tests that are testing the functionality of something like a function or a method, where we're doing an operation based on constructs within a language or a library, but we're not just using the API directly. We're creating our own custom logic that needs to be validated. And the unit test is going to test certain conditions of that function and the arguments that could be potentially supplied to that. It's only going to test that function. It's not going to test the integration of that function to other functions or other objects within your code base. That test, which would be... you know, testing how does it's not just testing. Can we cast, you know, human height properly in order for it to go into our model, in addition to the model actually running and executing correctly. And, you know, metrics that are calculated from that model are within a range that makes sense. And. That whole end-to-end test of our application that we're writing, that's an integration test. It's integrating all of the disparate parts of our code base into a single test saying, does this thing run? Yes or not? No. Not only does it run, does it produce what we're expecting? And we think of, from a data science perspective, a lot of people think of. that sort of test as being running cross validation and saying, hey, can I get metrics coming out of this? But from a software engineering perspective, that's just one part of the larger picture. You have other things that are happening. Like when you create that model, what do you do with it? Do you save it somewhere? You should. Do you log it somewhere? Do you take those metrics and put them somewhere? Are they accessible? Can you reload the model? and do inference or predictions off of it. What did those look like? Can we make sure that everything is working end to end in the environment that we're deploying to? So we have tests that are, you know, we have an environment variable in our execution environment that says, what environment is this? And there's connections that we're making, you know, database connections for acquiring data or file store locations that we're reading and writing data to. um, passwords and usernames that we need to integrate with other systems that are doing logging for us. So an integration test is making sure that all of that stuff works. So when we take our code base and run it in dev where we're doing all of our changes and stuff and, and breaking stuff and fixing it, it works there, but we can take all of that as one unit, put it into staging environment. which is a completely different isolated environment that has different connections to different services. And, you know, the data exists in a different place and we can promote that to that staging environment and just hit go and everything should run. And the validation of all of that running and producing the data we want is that integration test. And we can automate that through continuous integration tools, and we can promote that to different environments through continuous deployment tools. That's the whole CI-CD thing.
Michael_Berk:
Got it. So a unit test is sort of a modularized and self-contained piece of functionality, whereas an integration test checks how that modularized piece of functionality interacts with the entire system. More or less.
Ben_Wilson:
Pretty much. And I typically write those tests in two different locations. So unit tests usually live with your code. In fact, they always should. In standard repository constructs for pretty much regardless of which language you're using, there's gonna be a concept of your source code or main code or there's tons of different names for it. You can name it whatever you want in Python. and then there'll be a test folder that exists at the repository root, or it could exist elsewhere in the repository, but generally at the root is where you do it for like Java, Scala, Python. And that test directory is what contains all of your tests, your unit tests, and they're mapped to the module names of your source code. doesn't have to be, but it's highly recommended, particularly when you're building something super complex and you might have hundreds of tests that are running or something, you're building orchestration framework code for an ML application and not just a model or project, you could be looking at thousands of tests that are running. Well, if one of those fails in your continuation, in integration suite that you're running, Um, you need to be able to drive to where that test is in your code. And it's all name-based usually. So it's just easier to be like, Oh, this is, you know, height utilities. I'm going to go to, you know, Michael and Ben's project underscore utilities slash height utilities slash, you know, this is our, our PI test suite that we have and go to that, that function that we have to sign. defined that starts with the name test. You know, tests height reconciliation works as expected for meter submission. That might be one test. Another one is test height operation performs as expected for centimeters. And we would have all of these data payloads in there that would say, hey, generate these data points, run it through this function, assert that, you know, this conversion happens properly. that our submitted list that goes into the function, the return value of that matches this hard-coded, expected value of list. And then we just say assert that these equal one another.
Michael_Berk:
Right. And you mentioned a very fun word, which is pie test. So there are a few different ways to actually run tests. I remember I was working on a sort of like a statistics based ETL pipeline at a prior role. And I was pretty early in my career and didn't really know about unit tests or even pie test or any test runners. And so I ended up just recreating pie test. from scratch. And when I put up the pipeline for review, one of my coworkers was like, why the hell did you spend like a day on this? This exists.
Ben_Wilson:
Mm-hmm.
Michael_Berk:
And it turns out unit test is, it's nice. It works. But it lacks a bunch of complexities that PyTest includes. And it allows for a lot more dry. tests, it allows for a lot more advanced testing principles. So Ben, what are your thoughts on the different test runners specifically in Python? And we're just going to stick to Python. There are other coding languages, don't worry, but Python just for this channel.
Ben_Wilson:
I mean, what's funny is that, I mean, the other languages that I'm familiar with and have done extensive development in, they have their own versions of what we're about to talk to you about as well. There's this super fancy one that's, you know, people that are hardcore framework developers use and then there's another one for applied use of frameworks that people tend to use because it's simpler and it doesn't have all of that overhead of complexity. And then everybody's got the third one, which is what you described. which is roll your own, I highly recommend never ever doing that. It's going to be, I mean, you can get away with that with taking native language assertions effectively where you're like, Hey, I'm doing this operation. Assert this equals this. The problem is in order to get something that's useful when a test fails. you're going to have to handle that. So your assertion is going to have to have a message and passing the stack trace along with it so that you can understand where that path of failures occurred instead of just saying, hey, this function failed on this line. Great, but where's the rest of the stack trace? Like, why did this fail at this time? If you use a test framework, they do that. They bubble up that full stack trace to you. Uh, there's other massive benefits to something like, you know, using unit test or PI test. Um, I haven't used unit test that much, so I can't claim to be an expert on that. Um, I've just always found it kind of clunky. And if I, if I'm the one who's in control of the code and I see unit test being done, I just convert it to PI test because PI test has all of these. You know, really beneficial quality of life aspects to it. So you have this concept of function or method inspection that happens. So if your function or method starts with the word test followed by underscore, it's gonna execute that as a testable entity. It'll run that and make sure that it's checking the result of that. And just as you said, the result of any of those is gonna be Boolean. There's going to be some assertion that happens within it and you can create your own custom assertion types or you can use the built-in PyTest assertion types. I haven't really found too many cases where a PyTest assertion type doesn't work, so I usually use those because they're awesome. You can do stuff like assert contains, assert equals, assert greater than, assert less than. assert called once, you know, you can do other things like very clever exception eating and passing on. So if you're writing something, a test that let's say in a function, we're doing that, what we were describing before the, uh, the height check, let's say we wanted to, if the data coming in, if any of those values was greater than, you know, float value that to actually abort the job because we have to actually fix the data engineering pipeline if we see something like that, because we can't cast that to anything that we know. Um, if we have that, you know, if value greater than 10,000 raise name of our projects exception, we create our own custom exception, um, and have some message that we print out like, Hey, this value is more than 10,000. we're aborting this because we don't know what to do here. Which is a good practice, by the way. You should always be doing stuff like that. Even in ML code, you should be having some sort of exception handling. And don't eat all your exceptions. So when we do that, if we write one of those, something that's going to throw, or we call it raise in Python, We want to test that functionality. So we would have a unit test that's validating that. And we would take data and maybe we'd say, OK, pass into this function the value 10,001. Let's make sure that it throws. Because we've wrote this logic to make sure that it throws. Let's validate that. Well, if you're going to handle that in a role your own methodology, you're going to have to write a try statement. And then you're going to have to write an accept statement. And that accept statement is going to have to do a regex match on that title that you have. And then in the accept block, you're going to want to make sure that you have an assertion that, Hey, this message that's coming in needs to match this message that I'm typing out right here. And it's just a lot of code that you have to write. And then you're going to have to have management logic around that. If that doesn't fail, if that didn't throw. then throw another exception for the test itself. I don't know, I'm lazy. I don't want to write all that. And that's just more code to maintain and it's annoying. So PyTest, it's a one-liner. It's actually a context wrapper around an execution. So you would just use the keyword with PyTest.raises, open parentheses, name of exception class. So you'd say like, Michael and Ben's hot dog application with exception. That would be the name of the exception, comma match equals open quotes, the name of what is actually raised there. So that info statement. And then we would call that. After that, then encapsulating parentheses, type colon, new line, carriage return, a couple of spaces, and then we'd say, hit that function with the value 1001. And if it doesn't raise, we're going to get a pie test exception saying, Hey, this didn't actually raise. And you're going to get this really great standard error message. That's going to report it out when you run this test. If that doesn't work, it's going to say, here's the expected value that I had. And it's going to give you all the text of that. And then you're going to get, here's actually what I got. And it would be like, Hey, no exception was thrown. So our logic is broken and sucks. Or. the we threw a completely unrelated exception, like something raised, but it's not the right thing that raised, that lets us know that our logic is wrong. And we need to go back in and fix that to get that test passing.
Michael_Berk:
Yeah, you hit on some points that triggered a little bit of PTSD. I remember when I was writing this full test, I was like, oh, just print it to the console. Without a stack trace, useless is a very, very strong word, but the stack trace is incredibly, incredibly helpful. But also what happens if there's an error in your test? Let's say you miss a, I mean, it'll That's a bad example. I was going to say, let's say if you have a syntactical layer, it won't compile, but there's a lot of features that are built into these test runners that just make your life easier. So there's no real reason not to use them other than it takes a tiny bit of time to learn upfront. But Ben, do you mind sharing your three favorite PI test features? Things that you think have very high ROI or you use daily or that you think are just cool.
Ben_Wilson:
Uh... Um... I'm a big fan of warnings and particularly an ML code. There's a lot of times where you might be coercing a value that you wanna be notified of, but you don't want it to raise an exception. So you might wanna raise a warning. So that would be like importing the warnings module and you'd say warnings.warn level equals info and then the message like, hey, I had to coerce this value to this value, or I had to change this environment variable based on the data that I got in, and this is how execution is gonna go. So I wanna be able to be notified of stuff. I don't want that printing to standard out because that's super annoying to me when I'm running something and just seeing this flood of text. But I want it to go to a logger, and I want that logger to be searchable. So I'll make sure that each of those info statements is unique and provides the information that I need if I'm trying to debug something or just I'm curious about how things are going. So PyTest gives you the ability to make sure that you're raising those. You can do that manually. I mean, obviously anything that's in PyTest, you can do manually because PyTest is using Python constructs in the language, but it's a lot of lines of code to... make sure that you're handling warnings properly. So PyTest gives you that one liner. It's like with PyTest.warns, level, and then expected message, and then close quotes, colon, new line, run the code that you want, that you're gonna intentionally cause a warning to be thrown, and make sure that it produces a warning. And you can do all sorts of levels there. You can... catch debug warning, which wouldn't be printed to info, or he could do error warnings as well and capture those. If you're in the process of maintaining something that's gonna be versioned, like let's say this hot dog job is awesome and it's working so well for us, we're not doing one and done sort of thing. It's making money for the company. So over the last two years, we've been making changes to it and upgrading it and adding new things to it constantly. So we're now on version 1.7.6 and we want to be able to make sure that, you know, throughout that entire history of us maintaining this, that we're getting appropriate. warning messages based on say versioning or something. And we can, we have all that baked into that. So like, hey, we have this expectation that this is, is going to be working the way that we want. Another Pi test feature that I think is pretty good.
Michael_Berk:
And you mentioned assertions and the Pi test assertions themselves. So Python itself, like native Python has the ability to create assert statements. But they can be a little bit clunky. And I also have not experienced an instance where Pi test does not have the coverage that is required or the functionality that's required. And you really have to drop down to native Python. I'm sure there are examples, but for most use cases, I have not experienced it.
Ben_Wilson:
Yeah, I'd say my second favorite aspect of it is its extensibility. So you can write your own test class on top of a base PyTest class and insert your own test functionality as methods within that. So you can create a test harness that's based off of. You get all the functionality of PyTest. But let's say we had 14 or 15 special tests that we're doing that. aren't included in PyTest because they're so niche that it's not applicable. And this is pretty normal. Uh, you check most open source Python based packages out there that, um, are widely used, you'll see a custom test harness that's usually based on PyTest. Um, so it might be, we're validating connection to one of our backend stores for our environment and we want to make sure that we can acquire data for certain types of tests. So that would be in that test harness. So using it, it'll just be available instead of us having to write the connection code every time we wanna use it, which sucks. So I really liked that feature. I think PyTest was very elegantly and professionally designed. It's an amazing software suite. And it's contributed to by a lot of people that are really, really good at what they do. And the final thing is Mox. Mox, not super useful for ML stuff, but if you're doing pure software engineering in Python, mox are a lifesaver. It allows you to set conditions on sort of an ephemeral fake object of something, an instantiation of a class that might require dependencies that you don't want to bring into memory. And you don't want to, you know, if you have tightly coupled code you know, class in a module that actually depends on 30 other classes throughout your code base. Well, you don't want to have to do all of the things required to make it so that you can build that object correctly. Because that's expensive and that would be more like an integration test. So mock allows you to take that class signature of what you're trying to test and synthetically simulate its state by hard coding values and just say like, hey, don't pay attention to all these other dependencies because we don't really care about them right now. We're not testing those. They're tested in their own suites and maybe they are mocked up as well or maybe they're base tests because they are lower level classes and they're sort of like developer API stuff. But at those higher level, usually user facing APIs that depend on so many other things, we can mock that up. and then tests methods within that class, or within that function, within that module, whatever, and get a result really quickly without having the burden of calculating everything else. So, mocks are awesome.
Michael_Berk:
super cool I've never used mox I'll check it out in 20 minutes great so we understand unit tests we understand integration tests we recommend using a test runner py test is a phenomenal library that allows you to manage running these tests And we sort of talked about why you should be writing tests. So it makes your code more reliable. It makes maintenance easier. It, it handles a lot of the edge cases that come up in real world scenarios and make sure that there's graceful handling of those anomalies. Um, but Ben, you, you sort of like started, um, mentioning some things that you should and should not test. Like you probably shouldn't be testing pandas. Um, what are some things slash guiding principles around what should be tested in an ML flow or an ML? pipeline versus what should not be tested.
Ben_Wilson:
So it doesn't make much sense to do stuff like if you're testing ML code, and I have seen countless people do this, and I did it when I was early into this profession as well, I was like, well, I want to make sure that the model fits locally before I try to run this in staging. All it was really doing was taking a sample data set that was a snapshot of my training data. And I fit a model on it and then tried to make sure that the output results were within constraints. And all that's doing that was testing SKLearn or XGBoost or Spark ML. It wasn't testing any logic that I had. It was just testing. somebody else's code, but I had to pay for that. It doesn't do any benefit to my project because I don't have any control over that. The only control that I have is what version of a released package that I'm using. And in order to get away from potential problems, I can just version lock my code. Say like, hey I want to use So I'm going to explicitly say pip install py spark equal equal 3.0.2. Although we wouldn't use that, we'd probably use 3.2. But anyway, you don't need to test somebody else's code. If it's been released and it's something that's widely used by, you know, look at how many people download SKLearn every day. It's probably close to half a million, maybe a million. I don't know. There's a lot of downloads of that package every day. The people maintaining that, They sincerely know what they're doing. I can promise you. And they have very comprehensive testing that happens before any package is released. Now, we are human, people that maintain open source packages. We do make mistakes, regressions happen. It's very hard to write unit tests for every possible potential problem that could arise. And because of that fallibility, bugs can be introduced. but they'll usually be patched with a micro version upgrade, you know, some sort of patch release. So,
Michael_Berk:
Wait, wait, wait, you write bugs?
Ben_Wilson:
oh, hell yeah, everybody does.
Michael_Berk:
I thought you were perfect. Never mind.
Ben_Wilson:
Hell no. After this recording, I'll be going to fix one of my bugs actually. Not in release code, but in a PR. So, the important thing when you're writing ML code is, Don't test somebody else's libraries. They're already doing it. What you wanna test is your code. If you're doing something that's introducing some, you know, custom complexity to your pipeline, you know, we're doing feature engineering work. We're writing our own scoring algorithm because we don't like, we need something that's not in the open source, you know, tooling for that. If that's some custom logic that we're writing, yeah, we definitely need to test that because if you don't, you don't know if it's correct or not. And there's nothing worse than having it execute just fine and then have the results be something ridiculous that doesn't make any sense. Good luck explaining that to your business of like, oh yeah, our accuracy was 99.99998%. This is like an amazing model and yeah, we verified that we're not leaking anything in the label. Then you turn out, you know, your arithmetic was wrong on your scoring algorithm. And the accuracy is actually 57.2% when you fix the bug and you realize your model's garbage. So it's really important to test anything that's custom.
Michael_Berk:
Yeah, no, that's it. Phenomenal point basically you can think of how you invest your time into writing tests as an ROI calculation You could test literally everything under the Sun and have a 100% airtight Application and that would include testing whether your print statements actually print to the console whether pandas Can cat actually does can cat whether this whether that? But is that really worth your time? Is it worth the ROI for building out? of those test suites. So one thing that you can almost immediately check off is all open source libraries. You're using them and you can assume that they are robust and effective. That the underlying tech that you're using is probably going to be pretty good. If you're building custom code on the other hand, You might have missed some logic, missed some edge cases, and it's important to make sure that if this is gonna be put into production and be, for example, served in real time, that there is graceful error handling. And another thing that we sort of started the conversation with, which I think is super, super important, is you need to make sure that your incoming feature set is roughly what you would expect. So this, you often get distributions during EDA. If let's say something is outside of the anticipated region, you can drop it, you can rescale it, you can do whatever you think is necessary to make it work with your model. So it's just an ROI calculation. Like again, you can test everything under the sun, spend two years building one model and not do anything for the rest of your life. That's totally fine. But If you are in the interest of outputting lots of good stuff, you have to think about what is likely to go wrong and whether the issues, when that stuff goes wrong, whether those issues are severe enough to warrant writing a test for five minutes.
Ben_Wilson:
Mm-hmm. And there's a whole other aspect to this conversation, which is, where do we run this stuff? And that's for another episode. We'll do that the next time we have a panelist where we're like, hey, what does CI actually look like? And what does it look like when we're writing a commit branch that we want to merge to our main branch? And how do we run that locally? What do we do locally for that? How do we run integration tests locally or via containers? How do we get pristine execution environments? And then what does our actual deployment look like with respect to running CI checks, not on our laptops because you should not do that for validating that your code is good. But that's
Michael_Berk:
Yeah.
Ben_Wilson:
a discussion for another time.
Michael_Berk:
Yeah. Now we have at least 15 more hours of material, but unfortunately this is not a 15 hour show. So I will quickly recap. A unit test is a discrete test that checks if an input creates an output, roughly. Sort of a modularized piece of code where you look to see if a single operation is working as expected. Integration tests check how these modularized pieces of code interact with the entire code suite. And then some pro tips from Ben who is a pretty proficient in this stuff, I would have to say. One is use PyTest over UnitTest. PyTest has really great insertions. It's also very extensible. And then it also has some other features that, for instance, allow you to share resources between different tests so you don't have to recalculate each time you call it. Warnings are also a great feature. And then finally, my pro tip is make sure that you test the values of the incoming feature set. When you get a height of 10 million, that's going to really throw off your linear regression or your random forest, random forest less so, but it can lead to lots of weird behavior. And if you just make sure that your incoming data are correct, well, the model is probably robust because it's open source. So then that really reduces the issues that you will be facing when it's live and in production. Did I miss anything?
Ben_Wilson:
No, it's good.
Michael_Berk:
Beautiful. All right, well, this has been a great episode on testing. Until next time, it's been Michael Burke.
Ben_Wilson:
and Ben Wilson.
Michael_Berk:
And have a good day, everyone.
Ben_Wilson:
Take it easy.
How to Test ML Code - ML 091
0:00
Playback Speed: