How to Create Team Utils - ML 122

Have you ever written code and thought, "hmm, I wonder if my teammates would use this." Well in today's episode, we show you how to go from concept to production-level code. Spoiler alert: you're going to have to write tests!

Hosted by:

Ben Wilson •

Michael Berk

RSS Spotify Apple Podcasts YouTube Amazon Music

Show Notes

Sponsors

Transcript

Michael:

Welcome back to another episode of Adventures in Machine Learning. I am one of your hosts, Michael Burke, and I do data science, machine learning, and data engineering at Databricks. And I'm joined by my beautiful cohost.

Ben:

Ben Wilson, I help contribute to open source software at Databricks.

Michael:

Cool. Thank you for that then.

Ben:

Hehehe

Michael:

So today we have a panelist episode and this is something that has been inspired out of a true need that I feel and this need is to learn about writing reusable code because I'm lazy. I don't want to rewrite the same thing over and over and if I can just copy and paste or do a pip install instead of actually typing my life is a lot easier. So this was prompted. Uh, I would maybe say yesterday, Ben and I were having our bi-weekly one-on-one and I had wrote this reusable framework for writing or for running API calls to Databricks. So you have to authenticate via header. Then you have to like use the request library in Python and do some stuff. So I wrote like a 50 line wrapper that made working with this API a lot more concise and just a lot simpler. and bent toward a part per usual, so I need to refactor. But it is something that I have used on virtually all of my accounts so far. And that prompted the discussion of what's the difference between a utils library and a framework? Because all of us have our favorite functions that we reuse, maybe read data from x, y, z. Maybe you want to wrap the authentication for that, or do a rolling origin back test for time series. Those are some classic utils, but what happens when someone else on your team wants to use it, how do they know that it'll work? How can they trust it? So that's where we start building quote unquote frameworks. But Ben, I don't even know what a framework is. Can you, can you define this term?

Ben:

I don't know if there's one tried and true definition for speaking from an applied engineering perspective. I don't think anybody agrees. And I've heard so many people misuse that term. Like people sending me something or taking me aside in a discussion in front of their laptop and they're like, Hey, can you check out my framework? Can you look at it? And it's like, Yeah, the code is cool. I can see how it's useful, but this is just a bunch of utilities. And it's not really a framework. It's just a collection of code snippets that simplify your work. So I would just call that a utilities library. When you get into the world of frameworks, you're talking about abstraction with purpose, have created a library that is meant to solve a real problem that people are trying to solve that it potentially isn't solved very well with other tools or just doesn't exist, but it has a purpose. It's like, Hey, I'm trying to do this thing in order to do that in any other way. I would have to merge together these 50 different and write all this different code in order to have them talk to one another and process my data in a certain way in order for an action to be taken that solves that problem or solves that series of problems. So in order to think about a framework from that perspective, it's not just the collection of backend implementation complexity and those utilities as a singular unit. What it is it's all of that buried under abstraction. Like you don't expose that to the person using this framework. What you expose to that end user is a very simple high level API interface that makes it compelling to use this framework to solve that problem. So that's what, in my mind, that's what differentiates a framework from a collection of just random helper utilities. And

Michael:

Got it.

Ben:

there's nothing wrong with a collection of helper utilities, by the way, nothing

Michael:

Right.

Ben:

at all. I've never met a serious professional engineer who doesn't have stuff like that they use when they're doing applied work. There's plenty of software engineers who are building stuff like frameworks who don't use stuff like that because there's no need to do stuff like that because you're focusing your energy on that abstraction layer. So there is no code reuse. in that way. But from the applied side, yeah, definitely it's recommended.

Michael:

Got it. So I think I'm starting to understand the difference, but let's walk through an example. So I am selling hot dogs, per usual, and I have a very advanced machine learning stack to forecast the ingredients that I need based on anticipated sales. So I know that it's going to rain tomorrow. Probably won't be much foot traffic. I probably won't need many buns, but July 4th. Oh, man, we are going to have a lot of people. So that's my use case. And I have, let's say a set of functions that pull data from my database. Maybe it'll create a model somehow or like log it to this cool thing called ML flow. But there's a bunch of sort of disparate functions that do a variety of things. They're untested, there's not very dry. I don't have doc strings or anything. So that's what I'm starting with. What should you do next?

Ben:

Um, that's a lot to unpack, but, uh, first thing to do is write tests. Um, so

Michael:

But

Ben:

before

Michael:

why?

Ben:

even thinking about abstraction of something like this, this is what I would call project code. Like this is, you're trying to solve a business problem. We're trying to forecast how much mustard do we need to put on the truck tomorrow and how many hot dogs? So. That in and of itself, the mission of this project is to keep us from taking too much stuff or not having enough stuff and preventing perishables from having to be thrown away because they're sitting at room temperature out in Central Park for hours of that day. So it's a valid use case and we should build it. What you explained at first was moving from a rough prototype to an MVP. And a rough prototype, a hundred percent. You don't, you don't need tests. You don't need to write a bunch of stuff. You might have two or three integration style tests that you're building while you're developing this. You might not think of them as integration tests, but as you're working through a notebook and writing all of these the script down, which you're then going to convert into functions or potentially create these classes and methods and stuff. That process, you're executing it as you're going. You're saying, am I writing stuff that even runs or does it just throw an exception instantly because I have a syntax error or I'm using this library wrong? So we're doing that integration test and we're validating that this thing is executing, beginning to end as we're going. And then when we get to the end before we're ready to show it to people, we run it all, right? Say like, Hey, does this produce these forecasts? Are they good? Are they garbage? So you kind of have your code, your script, your execution, your prototype. That is your integration test of saying this, this actually runs. When you move to that MVP and you're in the process of taking that script and compartmentalizing actions that are happening. the first one you mentioned, hey, I'm connecting to my database. Well, that's some code that can probably be encapsulated. Like, hey, I have auth to pass, I have like some sort of token. Where do I get that from? An environment variables, config file, token key that's in my key chain, wherever it may be. You have to get that secret and then apply it to the connector, which... Typically when you're connecting to that database, you're using a framework to connect to that using an SDK, right? Some high level API, it's like, you know, name your database or infrastructure of choice, you know, the name of that, you know, main entry point class, open close parentheses, dot query or something. And that framework has all of that, the implementation complexity buried from you. You don't see it. Unless you go into the source code and look at it and then you're like, Oh, yeah, that's way bigger than I thought. That's pretty crazy that this has to deal with all this stuff. I'm glad I don't have to write all this stuff. And when you go through that, the act of you establishing a connection and pulling data, that's not a testable action. The framework already tests that and validate that people can generally access data from their, their technology. So you wouldn't write a unit test for that. What you would write a unit test for is perhaps your next phase of operations, which is it's pretty exceptionally rare that our use case here is going to have data ready for modeling stored in a table because

Michael:

Hmm.

Ben:

it's expensive to do that. It's a lot of work. Most data engineering teams are going to be like, no, we're not doing that. We're going to capture the raw data and maybe we'll clean it up for you. here it is in some format that analysts can use and you guys can use. And, you know, engineering can use for monitoring, uh, to make sure data capture is correct. So what we're going to do is manipulate that data, maybe join a couple of other data sets to it in order to get that data set that we really need. Maybe we'll be filling in no values or, you know, averaging something over time, whatever it may be. That code is stuff that you and I are going to write. And when we write that logic, we better be sure that we're doing that correct. And we don't want to wait for an integration test to the end to say, man,

Michael:

Right.

Ben:

our forecasts are all messed up. Why is it saying we need like 50 times more relish than

Michael:

Mm-hmm

Ben:

mustard? Like, there's no way. We find there's a data quality issue that we caused. the investigation of that is going to take a really long time to figure out what went wrong. Why is relish so messed up? So we can short circuit that investigation and make it so that we don't even have to worry about that if we write a test that isn't getting the data from the source system because we don't need to test that. You either get it or you don't. Like you're, the only time that you wouldn't get it is probably if your token's invalid, you'll get a different set of problems there. Right.

Michael:

Right.

Ben:

And that's not something we would test. What we would test is that, that logic, that manipulation logic. So what we'll do is we'll mock the connection to that database. We'll create a synthetic data set in tests and we'll say, Hey, these five rows of data represent this, you know, 5 million rows of data that we're actually would be pulling in. So our test uses this mocked response from that client. and then it executes through our manipulation code. And we would intentionally put in conditions within the data that our code is supposed to fix. And then we'd validate that our code is fixing that correctly.

Michael:

Right.

Ben:

And

Michael:

Ben:

then.

Michael:

I have, I have one question before we move on. Uh,

Ben:

Mm-hmm.

Michael:

why can't I just write an integration test? Why do I need sort of unit tests along the way?

Ben:

I mean, nobody's going to hold a gun to somebody's head and be like, you don't have unit tests in your ML, like applied ML code. You were removing your machine learning engineer

Michael:

I hope

Ben:

card.

Michael:

not.

Ben:

You can't be called a data scientist anymore. Nobody's going to do that. Nobody's going to force you to do this stuff. Maybe, maybe at your individual company, some engineering manager or your tech lead is going to be like, Now we're not checking code in without tests. What are you doing? But if you don't have somebody like that there forcing you to do that, it's not like the computer that's executing your code in production is going to know that it's not unit tested. Nobody

Michael:

Right.

Ben:

cares. The one person that should care is us, right? The people that are writing it. These tests. And this extra work exists for a very good reason. And that reason is we can have confidence that as I'm changing code in this and as you are changing code in this over time, and as seven other people on our team that are either there right now, or are going to be hired in the next two years, as they're adding and changing and modifying the logic that's in this, that every time they make a change, they can be confident, we can be confident, and the people that rely on our solution can be confident that we can identify potential problems before it gets to production. And that's the whole continuous integration thing, which you don't have to do, by the way. I mean, everybody uses that buzzword nowadays in MLE, like we

Michael:

Ben:

have

Michael:

Ben:

to have

Michael:

god.

Ben:

CI. You don't have to have CI. You can run unit tests on your laptop. You can run unit tests in staging. You can use GitHub actions in manual execution mode. You can...

Michael:

Less cool. All of these are less cool, but it's an option.

Ben:

Yeah, it can just be a step in your release process to say, did somebody manually go through, execute all of these tests on this branch and make sure that we're not about to cause an incident in production. And if that's something that needs signed off and somebody just needs to run those and make sure, fine. What matters is that they're executed and that you have confidence that, Hey, we didn't break something. I think this is something that software engineers have learned many, many decades ago, this sort of innate humility. You write enough code and you break enough things, you're going to get that humility whether you're prepared to get it or not. Which that overriding humility is, hey, I'm not as good as I think I am and I make a lot of mistakes. You can find the most senior person out there who's contributing super complex code or building these amazing products. Look at their GitHub commit history for a particular branch. You're going to have, you know, on something sufficiently complicated, there's going to be 12, 15, 20 commits to that. In a CI system, you're looking and it'd be failure, pass, failure, pass, pass.

Michael:

Right.

Ben:

And that's because everybody knows that they're not perfect. So we write these tests to catch all the stuff that we're going to miss. And we are, you know, you're going to miss it. Uh, not a, a branch goes by where I'm not changing things or I'm fixing something that I screwed up because this stuff is not simple when you get to a certain level of, of implementation complexity, you kind of just break stuff. So they're, they're important. The reason, to answer the other part of your question, the reason we don't just say, Hey, let's just use an integration test. That should be fine. Is that they're super expensive. Um, so what happens if we're doing this project with the hot dogs and the condiments and we're forecasting not just for our truck that's going to be in central park and for the, you know, 25 things that we're going to need to be loading on that truck. What if you and I. are the lead ML people at this company. And we're writing this for 800 of our trucks that are going all around Manhattan every day. So now we have 800 times 25 models that need to be built. And maybe the feature engineering is slightly different for different regions of the city. because different aggregations need to be done or different features need to be brought in. So there's divergences in the code. And when you get to something that's in production that has been in production for a while and has been tweaked and fine tuned, you're going to find scope creep like that. Scope creep from a perspective of pure abstraction. Pure abstraction would be like, hey, we're writing a universal project framework here, that will take any truck, any condiment and forecasts perfectly or as close to perfect as we could come. every day that we run this thing. And that might be our P zero minimum viable product that we release. I guarantee within three months, there's going to be exceptions to that where we're going to go back in and be like, what's going on with the Bronx? Like, why is it so different from the Manhattan model? Is there something like additional data we need there to make this better? So we're going to add that. And then a month later, we'll be like, dude, what? what's going on with Central Park during these months of the year. It's so different. We have to put some extra logic in there. So we're going to be bolting on all of this conditional stuff to this solution. It's no longer going to be perfectly abstract. There's going to be a lot of case statements in there and logical trails. So when it comes to doing the full integration test in order to validate that all of that stuff still works for a single code change. Maybe we're just going in and changing 10 lines of code to add in new features. Well, you gotta run every single one of those models and then check them all in an integration test. Maybe it takes six hours to run it. So what happens when you and I are working on a... a new feature together. We both have two branches. We're working on two separate parts of this problem that we need to work, like solve. I'm testing some stuff out. You're testing some stuff out. And we have CI set up, but our CI is kicking off integration testing. I would like to run everything. Every commit that you or I make to our branches is going to trigger CI. So now every time, even if we're, we're fixing a linting error, because we have pilot turned on, right, we should,

Michael:

Of course.

Ben:

or, you know, any of the other alternatives out there, some are, there's one that was written in rust recently that's like 10,000 times faster than pilot.

Michael:

Oh, it didn't.

Ben:

Um, so we have some sort of linter running and we get a failure on that. Like, Oh, I got a. I got to change this. Well, now we have to write code that terminates the running integration test, cleans up our wherever we're running our tests and restart from that new commit push and wait six hours to know if what we changed was correct. And in order for us to write a validation of something like all of those models with all of that divergent logic, I would be absolutely terrified. to have to write that integration test. Like what do you use to say that this isn't broken? Do we have to parse and crawl through the forecast for every single one? Do we have boundaries? Are we using statistical process control to say we're within bounds? What do we do with data that goes out of those bounds? Does that fail the entire test? Do we now have to stop what we're doing and fix that? So it becomes really, really complicated. which is why software engineers don't do it. We do end-to-end testing,

Michael:

it.

Ben:

but we rely more on unit tests because our brains can handle it.

Michael:

And it's cheaper, yeah, as you said. Yeah,

Ben:

Yeah,

Michael:

it's

Ben:

it takes

Michael:

really,

Ben:

milliseconds

Michael:

it's...

Ben:

to run a unit test or it should.

Michael:

Right.

Ben:

You should be writing tests that worst case scenario, you're running a unit test that's validating something that is computationally expensive. It could take on the order of 30 seconds to a minute, but anything more than that, you should be thinking about like, should I mock this and test it in a different way or how can I make this go faster?

Michael:

Right. Yeah, so tests essentially provide that security both for yourself and other users that code will do what is expected. And also, one thing that Ben pointed out is that as that code changes over time, we can still have confidence that it generally works as expected. So that's super useful. But all right, so we have, again, we started off with this set of functions. Maybe we added some tests, and now we're more confident for ourselves. In your experience, is there a big jump between sort of this ad hoc script that is an end to end product with tests versus something that is used by an entire team and let's just keep it like a small applied data science team for.

Ben:

So if we wanted to take this exact use case and start talking about abstraction. So we've gone from our script, which is a prototype, we've grouped together, you know, elements within our script that perform similar tasks, and we've gone through and identified repeated code. Like, hey, I need to query this table. or these 17 different tables for these different, you know, conditions that we need to do to meet this, this requirement. Well, maybe in our script it's copy, post this time. It's just like, Hey, I'm writing my full connection string to S3. Like I have my Bodo three connection and I'm providing my bucket, my target location and providing my, my keys. And we're just taking that and changing. one particular string in that, but copying the block, that like eight lines of code, 17 times. Perfect opportunity to say, let's make a function for that. And we're just parameterized that one string, simplify our lives. And now it's a one-liner every time we want to use it. And then if we start looking at, well, the cleanup functions that we're doing, you know, they're all kind of like the, the null filling. We have logic that we we agree is really good. Maybe we wrote it ourselves. Like maybe you wrote it. And then I come in and I look at it and I'm like, is there an open source framework that does that for us? So we spend 15 minutes looking online. Maybe we ask ChadGPT4 and say like, hey, homie, can you tell me if this exists in some popular Python library? And if it. comes back with saying, no, it doesn't, or I don't have any information on that. We did a little bit of Google searching. We don't find anything. Then it's like, Hey, our custom logic, let's put that into a function. Uh, cause all of our data sets are using that. When we put that into a function, we write a test for it and it's good to go. Where it starts to get tricky on us digging ourselves into a hole is when we think that we need to abstract more than we actually need to. We're like, hey, we have this process for applying a smoothing filter based on weather data, like you mentioned. Hey, the

Michael:

Right.

Ben:

forecast said that it was going to rain tomorrow. And we happen to have a data Manhattan or the boroughs of New York City for the last 10 years. So we can pull that data, we can apply this function to it and then join that data to our source data set and that creates this feature for us. And or maybe it's maybe we'll do seven features for each of the days in the future. And we look at that and we're like, yeah, that's cool. But what if we also want to abstract these other things too, instead of weather data, we want to do, you know, social data that we have as well. Cause we were recording tweets about our company and these geo locations. And where we can really screw up is if you and I spend two or three weeks writing an abstracted interface to support. both of these use cases and then N number of potential future use cases. So

Michael:

Great.

Ben:

that's premature optimization or premature abstraction. And I promise you that we will be throwing that code away with Glee a year

Michael:

Oh yeah.

Ben:

from now because it's going to be way too complicated. It's going to try to solve all the problems. But we don't know if we're ever going to use all of those other things. So we just wasted weeks of effort on something that even if you and I thought it was cool, we're not going to think it's cool a year from now when we delete it. We're going to, we're going to be like, what the hell were we thinking? This was stupid to spend time on. Uh, cause we never needed it. We didn't use it. Or it's going to be an exception factory that is going to create. because we're not going to implement it perfectly. I promise you, we will screw it up. And no matter how many unit tests we create, there will be some sort of bug that maybe is only uncovered when data does something funky that we weren't predicting. Something will happen with this. And our process of fixing that level of abstraction in order to make that not happen is going to take 100 times more work than it would be if we separated the. the area of responsibility of those two operations into two

Michael:

Right.

Ben:

separate functions.

Michael:

Yeah. So just to put a little life into these examples. Um, I, so I'm a few years into my data science career and I've been learning a lot more about software and I've been doing a lot of sort of ad hoc and customer related solutions and Ben brought up two sort of overarching themes. One is does it exist? And two. Are you over engineering? And of course you can under engineer, I guess that's a third topic, but hopefully you're not doing that. But I've done it as well. So in the first piece that does it exist, I was building a testing framework for a customer that would essentially allow them to parse a set of files, run an ETL process, time it, and then change either ETL configurations or the underlying compute, or just other for other. parameters about the job, and they could then sort of systematically optimize. And I was trying to get all parquet files in a single directory and also sort of limit which directories we were searching for. And there's some nuance to why we were approaching it this way, but basically I created a list of files by walking through the directory and I thought I was so smart. I went to os.walk, copy and paste the source code. and then converted it so that it worked for Databricks environment, the dbutils. Irrelevant if you don't know what that means, but basically I thought I was doing some great work, wrote 30 lines of pretty snazzy code, and this senior data scientist that I'm collaborating with took me aside after the presentation with the customer. He's like, let me see your code. And there's something called a glob filter. which essentially does sort of regex-esque matching of directory structure.

Ben:

Mm-hmm.

Michael:

And my prior code ran in like 90 seconds. Glob filter ran in like three seconds. Um, and it's a lot easier to work with. So there's an example of does it exist? I did sort of Google around. I did use an open source thing. I used OS walks code instead of actually writing a recursive directory walk. So partial credit, but there was a much more efficient solution. And sometimes you just have to have someone look at your code who knows more than you and say, ah, you're an idiot. So that's example number one.

Ben:

Not so much an idiot, it's well-intentioned and you learn something from it. And that's the important thing. Where it would have gone wrong and I would have said, yes, you're an idiot is if you're like, no, my code is awesome. I'm sticking with this. I'm sending this to the customer. I'm not going to change it. That's where

Michael:

Yeah.

Ben:

somebody goes from being well-intentioned and from the same place of ignorance that we all come from. no matter what we're working on to becoming an actual idiot, which is, I think what I wrote is amazing and there's no better solution out there. So you didn't do that and punting over to that new thing and then learning that. And now that's something you have in your own memory bank of code knowledge. Like, hey, if I use Unix commands. those are really fast and that's what it's doing under the hood. It's actually using like file system commands from the operating system.

Michael:

Mmm.

Ben:

And the OS module in Python does that as well. It's just, there's a little bit more overhead on OS.walk

Michael:

Yeah.

Ben:

because it has to support things that are apart from what you were trying to solve.

Michael:

Yeah, and I assume that the glob filter parallelizes under the hood, you know.

Ben:

I mean, it's multi-threaded. The

Michael:

Yeah.

Ben:

operating system file searching is multi-threaded.

Michael:

Yeah. And the OS walk code is not. So I think that's a big, a big impact as well. At least the code that I implemented.

Ben:

Yeah, and there's trade-offs there of why you would not want that to be multithreaded within Python because now you're looking at potential side effects. How do you mutate the tree and make sure that you're not destroying components of it? Or order of insertion. If you're recursively searching through a directory in parallel, what's the source of truth? You would have to lock writes to that. that in memory location for that tree in order to do an update or a modification to it. So how would you know what the actual order of execution is? So that's why it's synchronous in Python.

Michael:

Got it. Yeah, that makes sense.

Ben:

But.

Michael:

So this is an example of all the backend stuff that I still don't know about. And it's super cool, but having someone who knows a little bit more, or even just using open source frameworks, it'll probably handle 99% of the issues that you would come across.

Ben:

Yeah, provided that you're looking at it in such a way that I always encourage people to do what you did because it's a learning process. I don't think it's very constructive for a mentor or somebody who's just on their own trying to learn something to always just be given the answer. Like, hey, what's the most efficient way to do this thing? use this library or like, Hey, just do it this way. I'll give you the code and you run it. It'll work. You don't learn anything doing that. You don't get this deeper understanding of the why

Michael:

Right.

Ben:

things are the way they are or how something works, but you learn a lot by trying to build it yourself. You know, you, you took a recursive directory walk algorithm. and made it so that it works on basically Databricks file system, which is object store. That gives you an understanding of how that code was written and how you would go through a tree traversal and build up a graph of those relationships. And it's invaluable to learn that. And I think some people who try to go out and build things, spend a lot of their time, just trying to find the quick win by going to Stack Overflow or Googling it, looking for somebody's blog posts so they can look at. somebody's implementation. And then I'm not saying that people just blindly copy code from that. Sometimes

Michael:

Right.

Ben:

that happens, but they're looking for the bits of information that solve their problem without forcing themselves to go through the process of really learning it. And even though it might take you a little bit more time to just put forth an effort, even if you can't figure it out, it's totally fine. Try to work through it, like think through it, write some code, even if it's totally broken, put the ego aside, ask somebody more senior to take a look at what you're doing, but explain beforehand, like I'm trying to understand this a little bit more. I hit a roadblock. I put an hour of my time into this. Are you familiar with this and could you point me in the right direction? Most senior people will be like, oh yeah, I remember doing that. Or probably something very similar. Here's how you want to think about it and that will unlock stuff in your mind. You're like, oh, now I get it. You'll remember it, not just next week, you're going to remember it next year. You'll be able to answer that question for the next person that comes along.

Michael:

Right. Yeah, a hundred percent. Yeah. And so the second story that literally happened yesterday and it's centered around over engineering. And so I have been doing sort of these performance benchmarks for a customer. I'm working with two different teams. One team is a lot more stringent. We'll keep it very politically kind. But they're a lot more stringent and less trusting and want to actually have reproducible runs that are for sure correct. And they, they want to know that we have optimized as completely as possible. And it's not just good enough from our perspectives. And then another team is just like, look, we are a low level ETL service. We need to get data from point A to point B. We don't really care how much it costs. It should just be fast. And then let's move. And so I came in with the team one perspective when I was working with the second team, and so I built a fat framework that would do API calls have like parallelizable runs. It was pretty cool. I'm not going to lie. And they, I demoed it yesterday and they're like, this is awesome. Thank you so much. Can we just copy all of the data right now? And I was like, sure. But. What about all the benchmarking? What about the runtimes? How do you know it's going to be faster? And like, it's there. If it runs in under 24 hours, we're happy. And so there's an example of sort of over-engineering and theoretically I could have scoped a bit better and worked more incrementally with that team gotten feedback and my big framework didn't take that much time to write, but, uh, it's sort of a tricky balancing act. managing expectations, but also building the minimum viable product that meets a stakeholder's needs and the stakeholder can be yourself. It could be your boss. It could be an external customer, whoever. So Ben, I was just wondering if you had thoughts about how do you, so we've talked about sort of making code good enough. How do you know when to make code less good?

Ben:

That's an amazing question actually. It's one of those things that I don't think there's a ton of people that can answer that because it requires you to have experienced and live through. Both of the two polar opposite paradigms of building solutions with software. So from the applied side, particularly in consulting, you want to like make your customer happy. You want to build what they're asking for. And sometimes. It's almost like you want to prove to yourself and your customer that you're good. So you typically, at least I did, would like go air on the side of over engineering. I'm like, Hey, they might need these features or Hey, they asked for this thing. Or they said, Hey, it'd be cool if it could do this thing. So I would just build it if I had the time. And it was like, and a lot of times. particularly towards the end of my time working with clients, I had plenty of time because I was getting very fast at doing applied engineering work with software and I'd be able to write stuff really quickly. So it would be this prescribed budgeted time of like, hey, you got three weeks to deliver this. If I delivered everything that was the actual minimum viable product and did all the testing you know, sent it to them and got their feedback and maybe did another round of changes based on what they asked for. In other words, if I apply it, if I used a software engineering mentality and went through that, I would have been done in probably one week, maybe a week and a half by the time they accepted the code. And then I could spend the last week and a half writing up documentation, making sure that I have like full test coverage that I set up. You know, CICD for them for this. And I, you know, made the repo look really pretty and refactored some code, whatever. But a lot of times in consulting, you're not doing that. You're delivering a code base that sometimes you're not checking in while you're doing it. So you're trying to impress people, trying to be like, I got this figured out and they rattled off all this stuff in the discussion about things that would be cool. I'm going to build all those and I'm going to build some more stuff too. So they're going to be like, wow, this is amazing. So moving from that world to the other world, which is Hey, we're only going to build what we need to build. Nothing more is hard because in meetings, in planning, in discussions with, with other engineers, people are like, yeah, I think that's out of scope. You know, people are usually nice about it, but they're like, yeah, I looked at your design doc, let's not do these seven things. These aren't really must haves. Let's just focus on the core thing that we're trying to do here. And now I've been doing it long enough that. I'm that person who's doing that, who's like, do we really need to do that? Let's punt that or let's wait for feedback. Let's only build that if somebody asks for that. And that's totally fine. But that's the mentality difference between software engineers and applied engineers is that software engineers, it's not that they're lazy. It's the exact opposite. But a lot of people think that like, why don't they build more features in this product? Why don't they just make it? do what I need it to do or why is it so broken in this part? That's because you're trying to within a limited amount of time that you have to work on this one thing because after you do that, you're going to work on something else. There's no breaks in between. Knowing that you have an allotted and approved amount of time budget and part of that time budget is designed to make sure you're building what is actually needed. and getting peer review for that and then building what you need to build and getting peer review for that and changes and writing all the tests and making sure that things work. Within that limited scope of time, you have to make sure that you really only building what is absolutely needed for the product requirement or product request.

Michael:

Right.

Ben:

Anything apart from that, it's not that, hey, it's a nice to have, it'd be cool to build this if we have time. It's. let's not build something that is either never going to be used or nobody's asking for. So even if you have it as an idea or somebody spouts off in a meeting saying, it'd be cool if it does this thing. Sure. It would be cool. It would be cool if my car could go vertical and take me to the moon on a, on an afternoon sometime.

Michael:

That would be cool. You should do

Ben:

It's

Michael:

that.

Ben:

But definitely not going to be looking to have somebody build that for me because it's expensive. Right. And the same thing goes with software. The more code then the more features that you apply to something, the more of a maintenance burden it becomes. And it's not just the maintenance burden that applied people think about, which is, Hey, if this thing breaks with this feature that I created, I now have to go and fix it. maintenance burden in production software is what happens when one of our libraries that we're using upgrades and they deprecate something, like remove it from the next version of the library? Or what happens when there's a security vulnerability in this package? We have to pull this package out. How much work is that to be refactoring your code to avoid that? Or what happens when we want to change execution frameworks and our code doesn't work in our prod environment anymore. And we have to take this multi-year effort to refactor everything into a different language. The more junk that's in your code, the more bullshit that you've created over the years, that's just more work that you're doing in order to fix that stuff, which translates into less productive new things that you're creating that make your company money. It's just a... The one finite resource that we have is time. And the more bullshit we build means that it takes more time to migrate, fix tests, validate, or improve. And if we have some tightly coupled framework where we build features X, Y, and Z, and we only, we only needed feature X. Well, when we want to build, you know, feature alpha on top of that, we now have to update. X, Y, and Z instead of we would have only had to update X. So it's

Michael:

Right.

Ben:

just, everything takes more time, which sucks. So that's

Michael:

Right.

Ben:

the software engineering mentality for it.

Michael:

Yeah, that makes sense. Yeah, an example is instead of writing the outputs of my performance runs to a table, I just printed them and copy and paste them into a Google spreadsheet. It works. And theoretically writing to a table would be super fancy and super cool. It would be an extra seven lines of code, but I didn't really wanna write it, so I didn't. It's working great. So yeah, there's an example. There's all sorts of ways you can cut corners. And then there was before the... recording Ben and I were talking about the sections of this podcast. And I wanted to conclude with this very hopefully useful section. I'm hoping Ben will give me some tips and tricks. So both of us have worked sort of as applied data scientists and data engineers, and we have to write a lot of code and some of this code you see come up again and again. So Ben, can we just sort of take turns maybe? come up with like three code functions, frameworks, functionalities, something that you have reused a lot over your history. And I'll start it off with this API wrapper. Basically, I don't want to worry about authentication. I want to create a class that holds my host, that holds my token, that holds everything else. And then I can just have a function that wraps the request library, handles authentication, and returns the JSON object. So that's one thing that I found very valuable and have reused in almost every project. Your turn.

Ben:

I think a lot of mine are utilities that are, it's like applied usages of framework code that are so onerous to rewrite from scratch, but are so commonplace within specific niches. One example would be... Like applied window functions where certain industries all do kind of the same thing to very similar data. Like, hey, I want to get a variable number of windows over my same temporal series of a whole bunch of different products that we're trying to sell. And I want basically the ability to do like smooth plotting of my historical trends. because the raw data is super noisy. Well, I just want a weighted moving average over time, but I wanna be able to cut it in order to reduce noise on like these five different time horizons. So when you think about that, you're like, that's super simple to do. It's like, it's not that complicated. But when you write that in like Apache Spark and you use the data frame APIs and write the window and... define the partition and then you realize, okay, I got to do this like five times for this one customer. This other customer wants like 20 different ones in here. So you start looking at, okay, if I, the difference between five and 20 in doing it, the scripting way is 15 instances of pasting within, you know, a script. So it's like, all right, I'll create a function for this. And Like where do I need to abstract this so that it works for all of the customers that I interact with in this industry? Because they're all doing the same thing. Well, they have a different column names. They have different definitions of what their timestamp is and different, like how do I need to collapse this in such a way that the window will actually apply correctly? Just write an abstraction for that and that becomes a utility. And when another customer needs it. I can tell them like, Hey, I already wrote this for this, you know, and I've given it to these seven other customers. Do you want that as well? And I feel like, no, we don't want code that you've given to other people. We want it written for us. I'm like, all right, I'll do that. I'll write a bespoke implementation for you. But 99% of the time they're like, yeah, if you can give us that, that'd be great. It'll save us a couple hours of work. Here you go.

Michael:

sold. All right, I'm going through my GitHub actually, and going down memory lane. One thing that I actually like is sort of a tabular to plotly Sankey converter. So Sankey diagram is sort of sources and targets with potentially like nodes in the middle. And I've always found that library or that function sort of a bit wonky to work with. So having a wrapper that converts tabular with source target and maybe some colors and makes a Sankey diagram that has been really

Ben:

Yeah, I've done some stuff with. Many, many years ago when I was doing just pure data science work and not doing like production ML, you know, more like advanced analytics. I've worked for a couple of companies that people were really particular about the way that data was visualized for reports that had to do with our outputs. And I remember like a couple of the companies I worked for, they used software stacks that were proprietary and I'm not going to name names, but I wasn't the biggest fan of them. And I didn't feel it was worth my time to learn the nuances of those proprietary software stacks or the tooling that they had. So I was like, yeah, I'm going to see if I can build something with. you know, Python visualizations, you know, map plot lib with some skins over the top of it, or I'm going to use Vega light because people really love interactive graphs and like they like the ability to hover over plot points and see, you know, pop-up tool tips. So I learned how to do a lot of that stuff and it was easy to mock up graphics with those provided you could write. the appropriate configuration files and write a bit of CSS to go on top of some of these that make them look like something you would see on a professional website. People got really excited about that, but it's a ton of configuration that you have to write and it's almost impossible to fully abstract that. I would have different versions of charts that would just be these massive functions or I would create a class that had all these different methods that allowed me to. generate all these different visualization styles based on data inputs. But even then, I would still classify stuff like that as a utility. One of the visualization packages that I wrote years ago for use at a company, there was about 50 modules within that Python library.

Michael:

Cheers.

Ben:

And there's a couple hundred thousand lines of code that went into building all of that stuff. And there was some transformation logic and I learned a little bit of JavaScript in order to do some cool stuff with animations. I still wouldn't call it a framework though. Even though a lot of people would look at that and be like, oh, it's a cool visualization framework. Like no, Vega Lite is a framework that's got the rendering engine built into it. And that is as abstract as they come. It's just give me your data. I produce a chart for you. That's a framework. Well, mine was, was a utility that allowed me to customize how a framework worked in order to meet the needs of my company at that time and prevent me from having to copy paste way too much configuration everywhere.

Michael:

Yeah, I think pretty visualizations are frankly undervalued. I mean, I remember when I first joined Databricks, the DB SQL automatic color palette was just disgust. It was like so ugly in so many different ways. And I remember I gave feedback and I think they changed it since then. But yeah, a pretty picture is really nice to look at. So why not

Ben:

Mm-hmm.

Michael:

make it pretty? All right, I'll come up with one more. My GitHub is a mess. I just realized, but, um, one other useful thing that I have seen, uh, especially for teams is having read functions sort of globally available and so handling authentication for going into a database and pulling data, if you're not using sort of a managed metastore or something like that, having modules or at least. just some functions to make data access as easy as possible, that's super, super valuable. And abstracting it away as just like table name, give me a select star, or run query against table name, and handle, for instance, JDBC connections, that is super, super valuable. So Ben, what's your last one?

Ben:

I don't really have another example that comes to mind, but based on what you just said, that's starting to approach what you would call like an internal framework where you're trying to abstract away complexity from an end user and simplify that access paradigm. Like, hey, I have, our data exists in eight different locations, but those eight different locations different storage paradigms. Like, Hey, we write some files directly to ADLS Gen 2, some to Azure Blob storage, some to Delta, some to, you know, a MySQL database. And if your team needs to access all four of those locations, you really don't want to be handing out usernames and passwords or, you know, authentication tokens to... individual users and saying, Hey, don't lose these or don't leak them in code. And, or, Hey, make sure that all of this is, is set up. You're pushing too much complexity onto the applied users of something where they now have to think about how do I get my data, how do I authenticate to it? And it's not something that's in the purview of, of a team that's trying to interface with that data. So it's great to create an abstract framework that would be like, Hey, Based on our naming conventions, we have a code word for each of the four data sources. And then we have table names or database dot table name, whatever you want to use for accessing what that team needs. Have the IT department handle, you know, access control for that. So your ACLs are set up and your team's group key could be individualized, but have some way to make sure that your users are handling, that auth is being handled in a secure way. And that's an important tip that I, why I wanted to talk about this as a use case is, and it's very important for ML engineers, I think, who are at a company or working on projects who think about building abstraction like this. It's very important when you're building a framework to focus on what the point of your framework is. So in that instance, this condition that we're talking about, we're connecting to these different data sources and we want to handle auth for them. Handling auth should be calling out to a service that actually handles auth. So if we wanted to design this framework and say, we're going to be handling your permissions to get to this data, and then we're going to make it easy for you to get to this data by the simple API interface that's source and or source system and database and table name. If we design that framework to actually implement auth, we now have to handle auth. So that could be, that's not, hey, we're gonna talk to our ACL system, or we're gonna, you know, we're talking to our open auth too, or, you know, whatever auth system is in place, Active Directory or something. If you say that you're actually going to handle that, that means you're implementing active directory handshakes or you're implementing token validation, hashing, you know, secure key transport. You're making sure that what token is being passed in your framework grants authorization as a principle to the source system, basically root access to that because that's what an auth platform is. It has the keys to the kingdom. It can access anything, but it's the gatekeeper for each individual user to say, do I have permission to talk to this? And that's a situation where look in the open source. I mean, actually the first solution should be, do we have a paid service for this? Because we should be paying somebody to do this for us because it's really complicated and really boring work. And if no, we're a small startup. We haven't, we don't have any. contracted security companies that we're working with, can we get away with open auth or something? Or can we open API authentication with OAuth 2?

Michael:

it.

Ben:

Or even just basic auth, we'd use your name and passwords. What's the simplest thing that we can build that meets our security posture needs? And do we need that in our framework? The vast majority of the time is going to be no. unless you actually are building a real framework, like real deal that you actually need to be the gatekeeper because you're the one holding, if your framework is the one that's controlling access to the data or that is the access to the data, then yeah, you probably have to go down that route of implementing auth. But if you're a middleware framework between a user and some other layer, some other service, That other service, if it has auth or it has a layer that you can put in between you talking to that service, always do the software engineering route, which is only build exactly what you need.

Michael:

Yes,

Ben:

Even

Michael:

sir.

Ben:

if it is a requirement for your project, doesn't mean it's a requirement for your framework. It could be a product requirement, but delegation is key. Delegation of authority, delegation of responsibility, delegation of functionality.

Michael:

Yeah, and if it already exists, just use it.

Ben:

use it. If you want to learn more about it, then do what we talked about earlier in the episode. Try to implement something yourself on a toy project so you can get better understanding of it. Or read through the examples on that company's website or that open source package's website and try it out. See how easy it is. And then look at an advanced example and you'll probably be like, yeah, I'm not building that.

Michael:

Yes, I can confirm that happens very often. Cool, well, I will wrap. So today we talked about frameworks versus utilities and sort of differing levels of advanced versus non-advanced code. And at a very simple overview, utils or utilities is just a list of functions. It does stuff. A framework. abstracts a lot of the underlying functionality so that users are exposed to a high level interface. And typically, a core difference is the utils, you have a lot more power and a lot more freedom and a framework sort of gives you a simple user journey that you can do this or do that. And there's a lot more backend sort of functionality hidden from the user. And then I tried to see if we could come up with some. key categories of levels of like how complex code should be. And this is what I heard. So start off with ad hoc. That's sort of an end to end test of basically whether the script runs. And it's basically a list of commands. But if you're going to be reusing this, you should probably incorporate tests. If external people, so external to you are going to be using it, you should probably have a lot of tests. And then if it's going to be used throughout an entire organization or even go into open source. you really want to add a lot of abstraction and sort of bury the complexity. And then a couple of quick pointers. You should always have tests. This is sort of like a minimum benchmark of ensuring there's quality in your software. And also helps for a bunch of other reasons that we talked about. Um, always check if something exists, if it exists and you can do a PIP install. It's a lot faster than typing out everything and then make things as simple as possible, but not simple. quote Albert Einstein. So Ben, anything I missed?

Ben:

Now that hits it.

Michael:

Well until next time, it's been Michael Burke and my co-host

Ben:

Wilson.

Michael:

and have a good day everyone.

Ben:

See you next time.

How to Create Team Utils - ML 122

0:00

01:04:15

Playback Speed: