Maintaining Backward Compatibility in Software Projects: Strategies from Industry Experts - ML 164 - Adventures in Machine Learning -

Maintaining Backward Compatibility in Software Projects: Strategies from Industry Experts - ML 164

Today, host Michael Berk and Ben Wilson dive deep into the multifaceted world of software engineering and data science with their insightful guest, Sandy Ryza a lead engineer from Dagster Labs. In this episode, they explore a range of intriguing topics, from the impact of the broken windows theory on code quality to the delicate balance of maintaining backward compatibility in evolving software projects.

Hosted by:

Ben Wilson •

Michael Berk

Special Guests:

Sandy Ryza

RSS Spotify Apple Podcasts YouTube Amazon Music

Show Notes

Sandy talks about the challenges and learnings in transitioning from data science back to software engineering, including dependency management and designing for diverse use cases. They touch on the importance of clear naming conventions, tooling, and infrastructure enforcement to maintain high code quality. Plus, they discuss the intricate process of selecting and managing Python libraries, the satisfaction of refactoring old code, and the necessity of balancing new feature development with stability.

Michael and Ben will guide us through these essential discussions, emphasizing the significance of user-centric API design and the benefits of open source software. They also get practical advice on navigating API changes and managing dependencies effectively, with real-world examples from Dagster, Spark Time Series, and the libraries Numba and Pydantic.

Join them for an episode packed with valuable insights and strategies for becoming a top-end developer! Don’t forget to follow Sandy on Twitter and check out Dagster.io for more information on his work.

Socials

LinkedIn: Sandy Ryza

Transcript

Michael Berk [00:00:05]:
Welcome back to another episode of adventures in machine learning. I'm one of your hosts, Michael Burke, and I do data engineering and machine learning at Databricks. I'm joined by my cohost,

Ben Wilson [00:00:14]:
Ben Wilson. I do quarterly planning at Databricks for my team.

Michael Berk [00:00:19]:
Just your team? No other teams?

Ben Wilson [00:00:21]:
Just my team. That's it. And ask other people for assistance, and then promise assistance to other teams. But, yeah, that's what I'm doing this week, I should say.

Michael Berk [00:00:32]:
Congratulations. We we appreciate all everything that you do for us. So today, we're speaking with Sandy. After graduating from Brown where he studied computer science, Sandy started his career at Cloudera as a data scientist. He then took a variety of roles, including product manager, ML engineer, manager, and then finally, he settled at Daxter Labs as a lead engineer. So, Sandy, you lead the open source side of Daxter and have a long history of open source contributions. Why open source?

Sandy Ryza [00:01:08]:
Man, tough question. You know, I think I naturally gravitate towards open source, but I I think I've also rarely tried to, put down in words what the appeal is. With open source, you feel like you're kind of contributing to some sort of broadly useful edifice that can be reused as as as part of the world. You know, software and products go away, but a lot of open source technology from a very long time ago is still out there in the world. There are a lot of people outside of enterprises that still need to process data, still need to work with technology, and open source kind of, allows them to leverage technology without needing to pay or go through some sort of, like, difficult on ramp or process with a with a business. So, you know, I think, you know, open source is more easily used in science. There's just all of these benefits that come out of a a a product being open source that you don't get with fully proprietary software. It's also good for business in other ways, but that maybe that's less of the deep, philosophical appeal for me.

Michael Berk [00:02:26]:
Got it. So you feel that it's just more meaningful work because anybody can use it. And if it's a good product, it it helps the world generally.

Sandy Ryza [00:02:34]:
That's a much more succinct way of discussing. Yes.

Michael Berk [00:02:37]:
Cool. Alright. So you mentioned also that it has this interesting permanence, and, it doesn't really go away unless someone deletes a repo. And if it's getting some usage, it probably won't get deleted. So speaking of that, I was going through your GitHub and found an old repo. I think the last commit was, like, 7 or 8 years ago, and it was called spark time series. So I used to do a lot of time series forecasting, then built a bunch of time series forecasting models. Why use Spark for time series specifically? And, also, I'm curious, why didn't that model get any adoption?

Sandy Ryza [00:03:13]:
Yeah. So I think that's an interesting counterpoint. Spark time series is a is a project that has, lived on in perpetuity to my dismay, perhaps. I often get emails asking for support with time Spark time series, and it's like I, you know, I wrote on the GitHub that I would not be supporting this anymore. Wish I had the time to, but just wasn't able to. So, sometimes open source software can outlive its utility. Let's see. So I developed Spark Time Series when I was working at Cloudera what feels like ages ago.

Sandy Ryza [00:03:53]:
This was kind of in the early days of Spark. And I had been working in Cloudera's data science organization with a lot of customers in the, like like, financial services, sort of dealing with financial data, which often takes time series formats. And I wanted to help them understand that the tools that we were building could help them work with the kind of data that they were working with. At the time, Spark was very focused on big data processing. And sort of the idea was like, okay. If you're doing small data, maybe you're fine with pandas. But if you're working with large data, maybe you should work with Spark. But that kind of bifurcation, I think, is a little bit difficult.

Sandy Ryza [00:04:37]:
Like, okay. I just switched to an entirely different set of software and APIs once my data reaches a certain threshold. I think that's not a very good development experience. So so even though time series often is a bit more of a small data problem than big data problem, one, people often work with, like, thousands of or, you know, millions of time series in parallel. So even if each individual time series is very short, you can have, tons and tons of them. And second of all, you know, if you can use the same tool for small data that you use for big data, then you have to spend less time refactoring and rewriting your code.

Ben Wilson [00:05:13]:
I've got a question for you about the difference between your experience and because I have my own experience, which is very similar, doing a couple of open source projects that rely on another framework and then also working on a framework itself. How do you view the difference between writing code that is highly dependent on a very popular framework that you have zero control over versus you writing and maintaining something that you have? You're obviously using other open source packages ever like, all of us do. You know? Oh, I need to work with data frames. I'm using pandas or I'm using NumPy. Like so we all have dependencies, but those are very tightly controlled and not not a lot of breaking API changes. But when you're dealing with a hot open source framework, what's the different like, how is your experience different between that?

Sandy Ryza [00:06:06]:
Yeah. Definitely add some challenges. So at the time that I was working on Spark time series, I was actually a core Spark committer as well, which meant I had some ability to change the underlying project. But also, it had so much momentum that my, you know, whatever I could do individually was, outweighed by everything else that was going on on the project. So I think your question still applies. Yeah. I mean, it really depends on how how stable the project is. I think Spark has changed quite a bit since the time that I wrote the Spark time series library.

Sandy Ryza [00:06:41]:
Our RDEs were the central abstraction back then. That's basically been entirely replaced with data frames and datasets. And so that means you end up having to kind of tread water just to stay afloat of what's going on. This is even worse than the world of, Hadoop and MapReduce, which I had worked on before Spark, where they had 2 different, versions of the core MapReduce APIs that just lives together. And anyone writing on top of MapReduce just had to, you know, write 2 versions of their own API in order to work with those different MapReduce APIs. So I I at least appreciate, that Spark kind of made a more concerted effort to move people into the future. Although our our DDs, I believe, are still around, so we will still use them. So, yeah, it it it it makes it tough not just for yourself as the developer, but for the users building on top of your library that kinda have to, deal with everything that bubbles up from the core framework.

Ben Wilson [00:07:44]:
Yeah. We felt that pain. We maintain a bunch of stuff on the team, including SparkML. Like, that's our responsibility, which I'm sure you're familiar with, is largely RDD based for a lot of those algorithms.

Sandy Ryza [00:07:58]:
Yeah. I I think I actually wrote, like, one hot encoder for that a long time ago.

Ben Wilson [00:08:05]:
Yeah. We actually just did PR for a string indexer, yesterday.

Sandy Ryza [00:08:09]:
Nice.

Ben Wilson [00:08:10]:
To, like, update that to support a new functionality. But when you're talking about one of those core, like, those core algorithms, when you start looking at, like, oh, how does linear regression work? And you look at the code, and you're like, oh, this is this is all RDD. And then in a future version of Spark that's gonna be released, you know, sometime later this year, RDDs are banned from being used in Spark. Like, you can't use them. The APIs are sort of, you know, they're deemed unsafe for

Sandy Ryza [00:08:42]:
Perficiently buried to Yeah.

Ben Wilson [00:08:45]:
But from from the actual new Python focused API, you can't call algorithms in a safe way that utilize RDDs. So it has to be done in, like, the secure data frame API, and you look at how that's done. You're like, I don't know how long it's gonna take to actually rewrite this core algorithm because I now have to rewrite an optimizer that's I understand it in RDD land, but now abstracting that to data frame operations. It's like, can we do this performantly? I mean, you know, do a prototype real quick and, like, wow. This is 10000 times slower. We need to we need to think about this a little bit. And, like, those sorts of changes when, you know, like, hey. My whole library that integrates with this product is written in something that they're taking away from the community.

Ben Wilson [00:09:38]:
At what point do you start to say I mean, you already made the the comment on the the repo saying, like, I don't have time to support this. I I like it, and I think it's cool, but I I don't have the the capacity. When do you actually tell users, like, hey. As of this date, this is an archive project?

Sandy Ryza [00:09:56]:
Yeah. I mean, first first of all, that sounds like a challenging, perhaps fun job you have ahead of you. Second of all, I'm kinda surprised it's taken this long, given how long ago it sparked moved away from. 3rd of all, yeah, I mean, you know, it really depends on what's driving your commitment behind the original project. Right? In my case, it was this kind of proof of concept, for Spark and time series data. I think it it outlived, outlived its utility, but a lot of people depend on Spark ML. Right?

Ben Wilson [00:10:38]:
There's a few. You know, 1 or 2 companies out there.

Sandy Ryza [00:10:41]:
That's right. That's right. That's sounds like, sounds like archiving it is not always in the cards.

Ben Wilson [00:10:48]:
Oh, yeah. We're we're definitely not arch archiving it.

Sandy Ryza [00:10:53]:
But, but, yeah, I I do think that, in in many cases, a a change like an underlying framework is is big enough to kind of kill projects that live on top of it. Or they, you know, they can, you know, be the final straw that, like, leads the contributor to stop working on it, and the project kinda languishes into oblivion.

Michael Berk [00:11:18]:
I have a question for both of you. So when you're building a new project, you have to choose your stack. Let's just stick with Python for now. How do you go about selecting libraries? Obviously, they have to be stable. Obviously, they have to do the thing you want in a relatively performant way. But what's the process of saying, hey. We'll use this versus not this. We'll build it in base Python, or we'll add a PIP dependency.

Michael Berk [00:11:44]:
How do you think about that?

Sandy Ryza [00:11:47]:
Yeah. I, you know, I wish that we had a more rigorous way of thinking about it. We definitely made some choices, in DAGSTER that we've come to regret in the future. An example of one of those choices is Pendulum. So I don't know if you're familiar with Pendulum, but it's this, Python datetime processing library. Like, the kind of analogy in in my mind is in the old Java days, there was, like, the original Java datetime APIs, which were terrible. Then this library called JOTA time came out, which is, like, third party library everyone used instead. And so pendulum is, like, kind of the JOTA time of Python.

Sandy Ryza [00:12:26]:
It has, like, much better time zone support than the standard, Python libraries. Like, kinda makes the right assumptions and the right defaults than compared with original Python. So so so we chose to depend on it until later just discovering it was very slow. And now Python has come out and actually produced better times of handling libraries. So we're kind of in the process of trying to undo that pendulum decision, which has been very painful because some of our APIs actually return pendulum objects. So there's this, like, whole backwards compatibility question. You know, I think it's especially dangerous to depend on the library when you're actually surfacing types from that library instead of your own public APIs because it makes it, a much more difficult, project to rip it out.

Michael Berk [00:13:20]:
But that's super interesting. So, Ben, question over to you. If you were looking to incorporate a new Python library into MLflow, what's the vetting process, and how do you ensure that something like this wouldn't happen?

Ben Wilson [00:13:32]:
There's no way to make sure it doesn't happen. I wish I had a good answer for that. My analogy for that probably resonates with you, Sandy, typing in Python. Not of course, there's no compile time typing validation. That's not possible in the language, but you go back to Python 3.2, the concept of applying types or type hints to things was kind of very vague. There was an external library that somebody had created. They kinda supported most primitives, but the actual kernel didn't really do anything with it. And linters would like, depending on the linter you used while you were developing, you could kinda get, like, oh, there might be an issue here.

Ben Wilson [00:14:20]:
You're saying this is an int, but you're actually passing in a string. We're gonna like, there's some weird stuff that could happen here when you run this. You fast forward to Python 311, and all of a sudden, there's an entire type system that's part of Python that any IDE you're using to write in, it'll just automatically detect, like, hey. I'm doing an operation here with this function call that requires an integer input, and you're now passing in a potentially dangerous condition because you have an option that's coming in. It's an option int. So I don't know how to do this operation on a none. So your program could blow up at runtime. You're not gonna get what you get when you're compiling Java or c or anything, which is like, you cannot do this because this is unsafe.

Ben Wilson [00:15:08]:
You're just gonna get a warning, but that that's at least helpful. So if you use those 3rd party libraries and these plug ins to that core language and make that as a dependency on your on your project, you're not just doing that for you. You're also potentially doing that for anybody who's gonna be integrating with your tool. They now have to use this plugin. And then as time moves on and people are upgrading, like, Python, and Python's the core lang is starting to build a lot of this functionality in, there's sometimes a divergence. And then you're like, alright. What do I wanna push to a user to have them have to think about? And am I gonna be breaking the ability for a user to upgrade a runtime environment? Like, hey, I can't upgrade to Python 312 because it's incompatible with this library that you're now making me use. So that's really what we think about is is not so much, like, oh, is this gonna be a maintenance burden? Like, everything's a maintenance burden unless it's CoreLang, really, in my opinion.

Ben Wilson [00:16:14]:
And even sometimes CoreLang could be a maintenance burden. But what are you putting on like, what are you pushing to a user? Like, are you creating a headache potentially in the future for them? And that's what we try to avoid.

Michael Berk [00:16:29]:
Got it. That makes a lot of sense. Okay. Because it just seems like there's a the full gambit of open source tools that could do a lot of really cool stuff, but, adding in those dependencies. I remember I one of my earlier commits to MO flow was trying to add in, like, a library that did some of

Sandy Ryza [00:16:47]:
the functionality, and it was just like, reject, close. And, it makes sense now, but, yeah, it's just with the

Michael Berk [00:16:56]:
whole world of open source tools, you'd think that if they were stable, it'd be really cool to to just be able to use them at

Sandy Ryza [00:17:03]:
will. So sort of thinking out loud. Yeah. There's there's a really a

Ben Wilson [00:17:08]:
the principle of minimalism, in my opinion. So if there's no compelling need to add in a new dependency and the potential compatibility issues you would have, I personally try to avoid that. Wasn't the library, like, Numba or something you were trying to add in?

Michael Berk [00:17:28]:
Numba. It was something like that. Yeah.

Ben Wilson [00:17:31]:
Because it's like you were like, hey. This makes, you know, NumPy processing so much faster. We're like, that's awesome. This is cool, but users can use that. And while they're doing their data science work, we don't need to use that because when you look at the processing that we're doing over these array structures that are coming in, we're not doing stuff in, like, transposing arrays or messing around with tensors. We're just taking a tensor and then serializing it or deserializing it for a user, and that we don't get the benefit of that additional library.

Sandy Ryza [00:18:06]:
Numba is the one that does, like, jitted,

Ben Wilson [00:18:10]:
Yeah. Yeah.

Michael Berk [00:18:11]:
Jitted. C compilation.

Ben Wilson [00:18:12]:
It's amazingly fast.

Sandy Ryza [00:18:14]:
Yeah. Numba's so cool.

Michael Berk [00:18:15]:
Yeah. For any listeners who are unaware, basically, it can convert simple Python code and some basic NumPy functions into compiled c, just with some decorators. So if you're doing stuff like Monte Carlo simulations or just loop based operations, it's really effective. Yeah.

Sandy Ryza [00:18:35]:
It's it's especially hard as platform software. Like, if you're building a project or an application inside one company, there's a lot of things you can control that you cannot control if you're building software that everyone is gonna use. So, like, an example of this that we've faced is with Pydantic. So the I don't know if you're familiar, but with Pydantic, It's an amazing library. It's like data classes plus validation and, you know, converting back and forth between JSON schemas and stuff. But there's just 2 versions of it. There's Pydantic 1. There's Pydantic 2.

Sandy Ryza [00:19:07]:
They have, like, subtly different APIs. And if we were just, you know, building some project inside the extra labs, we could make a choice about we're gonna use Bidentiq 2. But we cannot make that choice for every one of our users. We can't say, like, everyone who wants to do data engineering has to use Bidentiq 2. And that just means that we have this entire pedantic I think there's literally a large file called pedanticcompatlayer inside of inside of Dexter, then the compatlayer.py. This just, you know, creates this layer on top of the different versions of Hydantic so that, Daxter can work with both of them.

Ben Wilson [00:19:44]:
We had to build that last summer, for MLflow, actually, because our first use of Hydantic was prior to 2 point o release for the AI gateway, because we're like, hey. We want strong typing here, and we want all the benefits of type validation when a user puts you know, passes a request into the server. And Pydantic is amazing for that. It's, like, it's pretty well adopted. It's it's solid. So we built it all out, and then they released 2 point o. We're doing dev testing. So we pull, like, dev branches of important packages to do in CI, and we we found out about that, like, 2 weeks before the release.

Ben Wilson [00:20:20]:
We're like, oh, no. This is gonna break everything. So we, like, scrambled real quick. We're like, we need to create different versions and and then eventually abstract it away into a a separate file that's very similar to that. But then you have to own that for 2 years or so before everybody finally migrates to, like, hey. Pydantic one x is end of life. We're it's out of support, so we're out of support for it, and they can finally delete all that code.

Sandy Ryza [00:20:48]:
Yes. Yeah. Now that you're bringing up CI, it's a whole other challenge. Right? It's like, in addition to having the band tech compat layer, you need a entire, like, test matrix that tries a little bit of versions with each other.

Ben Wilson [00:21:01]:
Yeah. We have a CI, suite that it doesn't even run on PRs anymore because it takes so many hours to run. So we just run it nightly on main branch, and it use like, basically, every ML library that MLflow supports, it now has to do a version ranging. So we test the latest micro version of every minor version of supported lists, and it it takes 10 hours or something to run on, I think it's using, like, max GitHub actions enterprise account VM limit. So it's something like 256 VMs that are running for 10 hours just to do all those tests every day.

Sandy Ryza [00:21:44]:
Yeah.

Ben Wilson [00:21:45]:
Because, otherwise, when you you because you're you have to do testing on behalf of your users because you have no idea what versions they're using. You make a change on your main branch, and it breaks this older version of this package. You're like, well, I I just shipped a regression for my users.

Sandy Ryza [00:22:03]:
And so these are packages that depend on MLflow or vice versa?

Ben Wilson [00:22:07]:
ML they're optional packages that users would be using to serialize and deserialize models. So, like, XGBoost, we're we're testing, like, 14 versions of XGBoost, the 14 most recent ones, as well as their their main branch from their Github repo.

Sandy Ryza [00:22:24]:
So we'll

Ben Wilson [00:22:24]:
compile and then and run. So they

Sandy Ryza [00:22:27]:
serialize data using these libraries, then you have to deserialize it at some point to represent it inside MLflow. Yep. Yeah. It's gnarly.

Michael Berk [00:22:39]:
Yeah. I I have another sort of well, potentially saucy question, that's very related to this. Lang chain. It has been one of the fastest adopted Gen AI libraries out there and is arguably the the defacto for building agents, RAG, etcetera. But it's also it has a reputation of being highly unstable. So in your guys' opinion, if you were building langchain from the beginning, do you think it's hard to make all these decisions upfront and make them be good and lasting? Or, I mean, not to point fingers, or did did they, like, make mistakes and not know how to build a library? And, again, in theory, let's replace Langshan with some other project. Yeah.

Sandy Ryza [00:23:25]:
I think it's such a challenging question. Like, looking at the DAGSTER experience, we have made you know, our initial APIs are around solids and pipelines, which have been entirely ripped out and removed. So, and then we we you know, we've also made a bunch several kind of, like, fairly, fundamental changes to our APIs since then. I think it's really hard to just say get it right the first time. Like, I think any kind of useful platform software is in some way a research project. Like, to build something, like, that is new and useful, you have to be able to try a few different things. And if you're if you could get it right the first time, it's probably because either it's like, maybe you've already built it internally inside some company, and that's great. And I think, you know, some parts of decks that have been stable are are kind of based on things that, we've built internally in our, like, previous careers.

Sandy Ryza [00:24:24]:
But if you're actually building something new and useful, there's gonna be some uncertainty that you have to manage. That said, I think there's very different different people in different organizations take very different approaches to, how they kind of manage that and how they deal with it for their users. You know, I kinda always think of, like, the the Mac approach versus the Windows approach, where, like, Windows, a lot of people, you know, have complained about Windows over the years, but there's, like, Windows binaries from, like, decades ago that will still run on moderating modern operating systems. They just made that commitment to, to to backwards compatibility where Mac is just breaking their stuff all the time.

Michael Berk [00:25:07]:
Yeah. Think about all the different charging ports on iPhones. It's been, like, 17 in the past year.

Sandy Ryza [00:25:12]:
It exactly. It's literally, like, physical backwards compatibility. So, you know, in Dexter, what we tried to do is you know, to some degree, you you have SemVerra for managing this. So when we were pre one that owes, like, anything is anything can happen. It's scary land. You're choosing to adopt Dexter. You're sort of, choosing to be on this journey with us. But then we made this kind of weighty decision to, move the expert to 1 dot o.

Sandy Ryza [00:25:43]:
And, you know, that's just this commitment to, like, okay. If we change stuff around, we still gotta support the old stuff. And, you know, maybe it sucks for us as a bunch of extra difficulty to our lives, but the there's this huge advantage that comes out of it that people can actually build on top of the software and know that they're gonna be able to, you know, keep running their code in in future versions. There's always the question of, like, when's 2.0? Should there be a 2.0? I think, ideally, there should not be a 2.0, and, the the, you know, the hope is that our current APIs can stand the test of time.

Michael Berk [00:26:26]:
Interesting.

Ben Wilson [00:26:27]:
Yeah. It's a really nice take. Yeah. I I couldn't agree more with everything you said, and particularly that the one point that you mentioned about alluding to when you're first at something, you have no idea where you're going, particularly if it's not just you're the 1st package that is is trying to tackle this thing, But the underlying ecosystem that you're building against is actually evolving as you're building this package. That's when you're gonna break stuff all the time. But there's ways to manage that that, like, a very seasoned software engineering team that has a lot of experience in, you know, managing software either in private repos or in public in an open source repo. They're gonna just handle it differently. Like, oh, we can't actually make this change right now without alerting users months in advance, or we need to think about preserving backwards compatibility so we don't break workflows.

Ben Wilson [00:27:32]:
That stifles velocity, particularly for, like, a small team. Like, when you're trying to build backwards compat in an open source library, you build your new implementation. You maintain the same interface, but your back end now bifurcates or is now potentially going to 4 different paths that you have to maintain throughout the entire time. So the code becomes way more complex, and maintenance becomes an incredible headache when you're doing that. So I I get why line chain does what it does. Like, hey. Although they are changing in recent times, you can see, like, hey. They're marking APIs as beta.

Ben Wilson [00:28:11]:
Like, hey. This is a breaking change. Adopt it if you want. We're gonna version this now. So we're on version 2. We could go to version 7 before we make this GA effectively, and it becomes the de facto API. And they're marking those in preparation for 1 point o. But on the if you go back a year ago with their library, they were doing exactly what Sandy just said.

Ben Wilson [00:28:33]:
Just like, hey. This is a research project. People are using it. It's awesome. Everybody's excited, but we're just trying to figure this out. And, yeah, you break stuff a lot when you're in that mode. Yeah.

Sandy Ryza [00:28:45]:
And, you know, it it can be challenging to balance the I'm making this piece of software. It's a re research project goal with the goals of trying to run a company, where, like like, ultimately, you're trying to get a bank to use your software. You don't necessarily wanna be pitching it as a research project. So I think that's where a lot of, a lot of software kinda ends up getting in a a little twisted up because you have these different pressures that are coming to present the software in different ways. In in a certain way, we've been kind of lucky to have a had a, like, kind of, like, smooth and consistent rather than, like, discontinuous, growth trajectory with with Dagster, where it wasn't like, okay. We released it, and now everyone is using it. Like like, I assume from the perspective of the the Langchain folks that there's a couple months of just insane ramp up Yeah. Which, you know, is exciting and awesome, and, you know, great for the project in in a bunch of ways.

Sandy Ryza [00:29:48]:
But, I think it encourages these kinda, like, growing pains that you don't necessarily hit if you have a little bit more chance to, like, iterate with a core group of users at the beginning, and make sure that what you're doing is really good. I I think, you know, that strategy you talked about, like, calling some APIs beta has worked really well for us. So when we we have our sort of, like, bed of stable API, whenever we add a new feature like, recently, we added, asset checks, which is the ability to define data quality checks for the assets that you define in tags for. We mark those as experimental. So we give ourselves the license to iterate on those for some period of time before, kind of adding them to the larger corpus of stable APIs.

Ben Wilson [00:30:34]:
Yeah. We did the same thing in MLflow. We're not we don't have rules around when we remove it out of experimental. It's more like, okay. When was the last time we touched this? And how many people are using this? Have we gotten positive feedback about it? But if it's like, hey. We released a feature. We ask a bunch of people, like, what do you think about this? And they're like, what is that? We've never used that. And sometimes that's that's more like, okay.

Ben Wilson [00:31:02]:
We have a docs problem. Nobody can even find this you this this use case. Let's create a tutorial, if they get excited. And sometimes, it's just kinda crickets. They're like, yeah. We know about that API. Nah. It's not for us.

Ben Wilson [00:31:16]:
And if everybody's saying that, we're like, alright. This is staying in experimental marked with deprecation in the future, and we'll just delete it.

Sandy Ryza [00:31:24]:
Yeah. I mean, one of the things that I actually enjoy the most is ripping out features and codes. You know, if it's like, this isn't being used, we can get rid of this. You know? Often, often it is a but by removing one thing, you create the ability to add a bunch of other stuff. You know? Oh, wait. You know? The the existence of this feature forced us to make these assumptions elsewhere in the product and then we couldn't do x y. Now that we are deprecating this and ultimately taking it out, that means we can do x y.

Michael Berk [00:31:54]:
Yeah. I haven't done it for large features, but for small ones. It it's the same sensation as cleaning your apartment or, like, just cleaning in general because you're like, ah, it's so fresh, so clean. Now I can put something on there without it feeling gross. So I can definitely relate to that.

Sandy Ryza [00:32:10]:
It's very Marie Kondo.

Ben Wilson [00:32:13]:
Much favorite experience, though, is going through and looking at something and then basically saying to myself, like, who the hell built this piece of junk? Like, what is this? And anytime that I actually say that to myself, I know who wrote it. It's me. I go back to the commit history, like, yeah. This is my feature. Time to delete this piece of junk. Oh, okay. And that's that's the best feeling. Just

Sandy Ryza [00:32:40]:
Doing, like, how do I my chaos. Process where you, like, are are, like, tracing who created a bit of code back through many commits, and you finally get to the bottom, then it's you. It's like

Ben Wilson [00:32:53]:
Or why is this so flaky or buggy? Like, you know, who who thought this through? And then you're like, oh, I did 3 years ago. Yep. Delete or refactor the entire thing, and it's that's a great feeling.

Sandy Ryza [00:33:07]:
Yeah. I think I think, every time you go through one of those processes, it brings you a step closer to enlightenment. Yes.

Michael Berk [00:33:16]:
Yeah. So we've been talking about how the speed of development can impact quality sort of as an underlying theme. Sandy, are you familiar with the broken window theory?

Sandy Ryza [00:33:28]:
Like like Giuliani kinda, they're state policing.

Michael Berk [00:33:32]:
Yes.

Sandy Ryza [00:33:33]:
Certainly.

Michael Berk [00:33:34]:
Yeah. So the concept is basically if there's a broken window in a neighborhood, there's going to slowly be more crime. I actually like equating it more to litter. Like, if you see one piece of trash on the ground, people are more likely to litter. So this principle holds over to software engineering. I was wondering how you balance pushing out perfect immaculate type hints, everything looks great code versus just getting it over the line, making it work, and then we'll fix it later.

Sandy Ryza [00:34:05]:
Yeah. You know, I think the the kind of cop out answer is you wanna make your tooling and, like, internal dev infrastructure, make it so that it's easy to do the right thing. So, I'm thinking about our journey with, like, typing, for example. So, like, Dijkstra has been around for a while. When Dijkstra was originally released, typing had not permeated the Python ecosystem, so we did not have types. And it's been this kind of, like, long journey to bring types in. You know, originally, like, oh, you could, like, run my pie on your local dev box. And over that time, we've kind of released more and more infrastructure that makes typing stuff easier and easier to I think now we're at the point where it's like, I feel like I have a worse, you know, development experience if I'm not typing stuff because, like, I'm less likely to catch errors.

Sandy Ryza [00:35:04]:
Like, even in my, like, very, you know, granular just trying to ship one quick change, like, and get it out today, I feel like I'll be faster if I'm adding types. And, you know, what gets us there, like, you know, having all sorts of, you know, pre commit hooks that that do typing stuff, having integrations with IDEs, definitely was not there at the beginning. Okay. So so that's kind of like one half of the cop out answer. The other half of the cop out answer is is just enforcing stuff. So sometimes, it's a little bit more annoying to, you know, do things the hard way. But if you have a step in CI that forces you to do things the hard way, then you just it it become it becomes natural, from, like, your chain is not gonna get in if you don't flint it and, you know, run code coverage on it and all that and all that kind of stuff. And it doesn't even though maybe it's, like, requiring a little extra work of you, you because you get in the habit of doing those steps, it, like, feels easy.

Michael Berk [00:36:08]:
What's the line between being an annoying nitpicker and actually having real qualms with the code and bringing it up in a valid way?

Sandy Ryza [00:36:17]:
Yeah. That's a great question. You know, trying to think of through my own experience, like, as a tech lead, a lot of my job is reviewing other people's code. And there's always this question of, like, how much do you nitpick? I think the things that I personally tend to nitpick the most on are names, which probably anyone who works with me would attest to. I think, like, ultimately, the clarity of a code base or, like, a any piece of code comes down to, like, can I look at this name and understand what it is? If you can name things that, like, there there there's a lot that you can kind of get away with if you can if you give things good names, because they're kinda like these anchor anchor points. So I don't know if there's any, like, sort of rigorously correct answer or, like, or full answer to your to your question, but you kinda choose the things that you that your experience have taught you are really important about writing good code and latch onto those and then try to be permissive about some of the other ones. Sometimes I'll comment. I'll be like, if I were to write this code, I would write it like this.

Sandy Ryza [00:37:27]:
You make the decision of whether you care about that or not.

Ben Wilson [00:37:34]:
Yeah. I find that sometimes like, for a new somebody who's new to committing to an open source project, that response works really well when I've done it, where it's like, okay. I I know where where this person is in their career journey, and they could probably use just seeing an example of, like, hey. Maybe this is a it's the same exact logic just written away that validates my one big net that I have with any code that I look at. And I have a 10 second rule on functions and methods. And that is without context, just looking at this function implementation, like, I'll zoom in a screen too. So that's the only thing on screen, like in GitHub. And if if it takes me more than 10 seconds to grok what the heck is going on, the function's too complicated.

Ben Wilson [00:38:29]:
There's too much stuff going on in there, or the names are bad, which is another one. Like, if you can't read it kind of like you would read English and be able to understand, like, okay. There's this this function call here that's that's somewhat complex. There's 7 arguments in here. It's gonna take a a couple of minutes to, like, figure out what the heck is going on and why these arguments are passed. But if the name of what it's calling explains exactly the action that's being taken, it's like, yeah. Got it. Okay.

Ben Wilson [00:38:58]:
Next line. Makes sense. Makes sense. Makes sense. But if it's like, okay. There's this nested for loop in here that's has all these, you know, conditionals within it at different levels, and then it's referring to, you know, the outer loop reference for each iterations. Like, this sucks. Like, create a new function for this and simplify this.

Ben Wilson [00:39:20]:
It should be, like, a one liner for this larger control flow. That's the stuff that I tend to to nitpick on is, like, readability.

Sandy Ryza [00:39:30]:
Yeah. Yeah. I mean, like, I would totally agree, first of all. I think it's interesting for me to think about this, like, trade off for, like, okay. Would I take a poorly named function with no unit test, but, like, a really clear implementation or a well named function with good unit test, like, a totally gnarly implementation? I think I would take the latter all the time. Like, you can I'll I'm willing to put up with, like, a gnarly implementation if it's extremely clear what it's supposed to be doing and what it's, and and and that's actually enforced and and verified. You can always come back and refactor that in the future to make to make it more clear. But if it's, like, not actually clear what the expected behavior is, that's gonna bleed it out into other parts of your code base and not just confuse someone who's looking at that function, but confuse someone who's looking anywhere that function is invoked.

Ben Wilson [00:40:20]:
Yeah. I I've seen commits of people that are like, well, it's a private method. It's fine if I number I I name it, underscore a 123. And it's like, what does a like, underscore a 123 do? And you look at it, and you're like, all the variables are are just named of letters of the alphabet or there's and there's anonymous returns everywhere.

Sandy Ryza [00:40:44]:
Like, that's that's the kind of code that's like, yeah, it's like obfuscated code. Like, they ran Yeah. Little all similar.

Ben Wilson [00:40:51]:
Just like, hey. We're we're not writing Fortran, like, algorithms here. Like, we don't need to to run this on a mainframe. Like, we can use long variable names. It's fine. Verb noun, please, for methods and functions.

Michael Berk [00:41:06]:
Yeah. My favorite one is I was supervising an intern, and, he wrote this the names of CTEs that were about to be joined as Shaq and Kobe. And I was like, I appreciate the reference. I appreciate the irony of joining those those 2 CTEs, but, let's rename it. But, yeah, that that was, yeah, name naming is so important because you can just quickly glance at something, skim through it. Also, with the the table of contents type of functionality in GitHub or even some IDEs, it's really easy to see what's in your code, what's potentially missing. So, yeah, names names are the best. And, also, I think there's one more point that there's sort of selection bias in what we're looking for when we're reviewing.

Michael Berk [00:41:50]:
When we review something, we assume that it does what it's supposed to do. If it doesn't, obviously, that's a massive issue. So the things that we're talking about are sort of superfluous and more, for longevity of the code base instead of ensuring the thing does what it does. So, Sandy, I had a question for you. Did you start as a software engineering pro? And if not, how was the transition from data science into software?

Sandy Ryza [00:42:20]:
Yeah. I mean, so first of all, I definitely did not start as a pro at anything. However, my first job actually was a software engineering job. So I joined Cloudera as a software engineer. I was initially working on MapReduce and then Spark, kind of these, these fat JVM projects for processing lots of data, but in transition from there within Cloudera to doing data science. So, basically, I was like, okay. It's cool that you can process tons and tons of data, but, like, what are people actually using that for? And, also, machine learning is cool. I wanna do machine learning.

Sandy Ryza [00:42:57]:
And this was, like, 2014 or something like that. Very sexy time for data science. So so so so my transition was kind of the the the reverse direction of, like, taking my software skills and trying to use them in this, more wild west setting. It was kind of interesting coming back. So I was data scientist for a while. I actually spent a year as a product manager writing 0 code. And it was interesting after that kind of like, can I still write code? Like like, you know, like, is it like riding a bicycle or is it I don't know. It's not like riding a bicycle, but, the the the that that is kinda, like, lose, lose my ability to to think in this way.

Sandy Ryza [00:43:44]:
And so it was interesting and very fun kind of coming coming back to these pure software roles much much later, and finding that it was still something I, enjoyed.

Michael Berk [00:43:56]:
Yeah. What got stale out of curiosity?

Sandy Ryza [00:43:59]:
Oh, what got stale coming back to it? Yeah. I was honestly surprised that how quick it came back. I don't know if I can point to anything in particular that was a really tough adjustment. I guess, you know, the the horrible dependency management is is one of those things that you don't really deal with when you're writing a, a machine learning model that you do when you're doing writing platform software. You know, I I I guess the other thing is just, like, designing software for really diverse sets of use cases. You know, this is maybe less about the data science to, to software engineering transition and more about the, like, writing application software versus platform software transition. When you're writing application software, you're thinking about, like, what, you know, works for me in the context of this particular domain. What are the you know, there's all these assumptions you can make about who's gonna use your software because it's just the people inside your company.

Sandy Ryza [00:45:13]:
Whereas if you're writing, like, open source platform soft software, you kinda have to believe that anyone anywhere in the world with any sort of crazy environment could be using your software, and you have to kind of make fewer assumptions and impose fewer constraints.

Michael Berk [00:45:30]:
Got it. Flip side of that, did the hiatus help in any way? Did it give you a fresh perspective or allow you to write better code when you returned?

Sandy Ryza [00:45:39]:
I definitely think that definitely yes. A couple of things there. I think through data science, you have to really learn how to interact with the domain. So I would be, you know, trying to answer questions and build machine learning models. Like, I worked at a health insurance company and a trucking company, and it's all about, like, I need to understand how truckers think, or I need to understand all the complex coding, of how, you know, medical diagnoses are are coded. So it gives you this, like, sort of as a as a software engineer, it can sometimes feel like, oh, like, this is the user stuff that I kind of have to understand before I get back to writing code. But, with data science, that that so much of it is actually building this understanding that it kind of gives you this almost, like, infinite, tolerance for for engaging with, the world outside of, like, the internals of what you're building. And I think that is really useful when building and designing software.

Sandy Ryza [00:46:46]:
From being a product manager, I think I it really helps me kind of refine this, like, more outside in kind of thinking. Like, nowadays, the kind of software engineer that I am is, like, I'll I'll write the function signature before starting on the implementation. I'll write the tests before writing the function. And I think that came out of, as a product manager, you're not you're always thinking not about, like, the thing that you're implementing, but thinking about, like, how it's gonna be used, what the, essentially, like, what the interface is. And I think you can actually take that way of thinking and actually, like, apply it more directly to the software engineering process.

Ben Wilson [00:47:31]:
Yeah. I I couldn't agree more. Everybody that I've that I've worked with, because I used to be at at data science. I'm the inverse of you, actually, with the start as data science, learn how to code as I went. And people that I work with now who have that prior experience of solving problems in that application programming state and people that have done product management work in the past, it's amazing how differently they approach, not just implementation because I don't really like, you see the first few commits from what they're doing, when they file a PR, but we do pretty extensive design docs, at Databricks. And you can see the difference in how they present the problem and what they focus on. Like, we have standard templates for designs, but a person who is always just been an IC software engineer are is focusing more on design decisions that are related to functionality internal to the code, like performance or, like, what is the API compatibility layer between these, you know, these systems, and how do we wanna focus on what that that, you know, handshake signature is between these resident product people, the section of the doc that talks about what does the API need to look like, what is an example use case, and they'll do up a prototype with some some super gnarly code. And I I do this frequently.

Ben Wilson [00:49:02]:
Sometimes I don't share that code because I'm like, I'm just making this work just so I have an API interface that I'm proposing, and then showing an actual example of, like, using that API and seeing the the output of it, just so that I can get other people to look at that and be like, yeah. This makes sense, or, no. This sucks. Like, here's a different way that we should do this. And you're

Michael Berk [00:49:23]:
like, alright.

Ben Wilson [00:49:25]:
But, like, the the docs actually are different in in their composition based on on that history.

Sandy Ryza [00:49:31]:
Totally.

Ben Wilson [00:49:32]:
Totally. Over time that the people who do the product focused one, those APIs get out of experimental faster.

Sandy Ryza [00:49:41]:
Understand.

Ben Wilson [00:49:42]:
Because they don't need modification because so much thought and energy was put into what's the user journey here.

Sandy Ryza [00:49:50]:
Yeah. A a tool that I have found very useful for designing APIs within DAGS are actually to start writing the docs before implementing the API. You know, the the challenge with, like, adding some new concept is, like, are users like, can you explain to someone this to someone who does not have this as part of their mental machinery? And, you know, by the time that I'm trying to add something, I've normally thought about something so much that it's, like, very deeply in there, which I think, can be good, but can also lead to very unintuitive APIs because of your, you know, relying on these, like, cognitive assumptions that aren't gonna be there for someone who's approaching them from the beginning. So forcing myself before I actually, like, go in and, do the implementation to actually, like, write out how I would explain how this works to a user often leads me to, like, kind of radically shift my approach.

Ben Wilson [00:50:42]:
If it's okay, I'm gonna steal that technique.

Sandy Ryza [00:50:44]:
Yeah.

Ben Wilson [00:50:45]:
That's a fantastic idea. I've never thought of that. Write the docs first. Like, we we kinda do a a hybrid thing like that where it's, like, explained to an executive, like, in the TLDR, but that's not the same as how do I teach a user my idea. And if I can't

Sandy Ryza [00:51:02]:
do that in a succinct way, then that's kind of a a

Ben Wilson [00:51:05]:
self, you know, self writing mechanism to be like, okay. Maybe this is more complicated than I thought.

Sandy Ryza [00:51:12]:
Yeah. And I'm I'm always surprised by how often I am how often my previous idea how how can I kind of sort of prove myself wrong through this process? Interesting. I

Ben Wilson [00:51:25]:
think that's a massive time saving exercise. I I'm actually going to use that. That's a great idea.

Michael Berk [00:51:33]:
Yeah. Do you have any other genius tips for us, Sandy, on how to build code?

Sandy Ryza [00:51:37]:
Oh, man. Yeah. First of all, not really. 2nd of all, there's the one piece that it kinda reminds me of is the Amazon style, like, press release driven development. You kinda, like, write the the the marketing before you write what you're working on. I I don't think that's quite as useful for API design, but I still also find that sometimes useful for kind of, making myself believe that I'm working on something useful or changing what I'm working on to be something useful. Got

Michael Berk [00:52:13]:
it. That makes sense. Okay. Cool. I have one final question for both Ben and Sandy, both of you. Why'd you write books?

Sandy Ryza [00:52:25]:
Who first? You. Yeah. So the background for those aren't familiar, I wrote, or cowrote Advanced Analytics with Spark, which, is a book on how to use Spark to do data science and machine learning and all that, all that fun stuff. I actually recently got a email from O'Reilly saying that they just translated into Polish, and they're gonna send me a copy of that, which is kinda cool. Let's see. I think, you know, maybe a couple things. I enjoy writing. I think it I enjoy trying to express ideas clearly.

Sandy Ryza [00:53:12]:
It was something that I had not done before. So it just felt like a kinda new exciting thing. I think it was, like, a useful career thing at the time just to kinda get my name out there. Not a particularly great financial decision, in terms of, like, the amount of work that you put in versus, the the compensation that you get out of it, but, I I still think it was totally totally worth it. Not really fun.

Ben Wilson [00:53:44]:
Yeah. For me, it was it was sort of a response to the patterns that I that I noticed. And my book is very different from yours with respect to the target audience and what I wrote it for. It was because I was doing what Michael, currently does at Databricks in the field, which was work with customers who are trying to get ML in production. And I just noticed patterns of alright. I'm getting the same exact maybe not same articulated questions over and over, but it's more of I'm noticing the same patterns at different places that have nothing to do or very little to do with implementation details. A lot of it, which is why the whole first half of my book is, like, answer the why you're doing this first, and then go talk to people. Don't don't build this this thing because you get this one line item, you know, quarterly goal from executive management saying data science team is gonna be solving this problem for us.

Ben Wilson [00:54:49]:
And then they just build an implementation that nobody ever uses. It never makes it to production. Because even if the code is perfect and the implementation works great, nobody cares. Like, they or it didn't solve the problem that the business actually wants. And I saw that pattern and actually lived that myself at several companies of what happens if I don't talk to the marketing department or the sales department before writing my first line of code. You're gonna build something that technically works, but nobody's gonna use. And that's the worst feeling. You're like, I just wasted 6 months, and everybody's pissed at me.

Ben Wilson [00:55:28]:
Now I gotta redo this. So I I have seen

Sandy Ryza [00:55:31]:
in the world of data science, truly a tale as old time as old as time.

Ben Wilson [00:55:37]:
Yeah. And that was, like, the the big response that I got from the book was that mission that I was on. There's a lot of technical stuff and and examples that some people have reached out, and they're like, oh, it's so cool that you did this thing on on, like, all these different ways of doing time series. And then I hear from 10 other people, they're like,

Sandy Ryza [00:55:55]:
yeah.

Ben Wilson [00:55:56]:
I know why you did did the technical explanations to to cover certain topics, but the overarching theme, particularly the first half of the book, like, that's what this book's about. Like, yeah. You got it. Like, it's more communication and talking to humans, not about, well, I need to decide between these 37 algorithms, which one's most efficient and which has the highest, you know, accuracy. It's like, nobody cares. Like, build something that works, that that you can maintain. And, yeah, I couldn't agree more with the, the compensation, of the 100 and 100 of hours that you put into developing a book, particularly really big technical books. Yeah.

Ben Wilson [00:56:40]:
My my pre right before final review is when they did Luke Manning will do the cuts, where they're like, hey. We're targeting a book of this length, and your book is either too short or too long, so either add content or remove stuff. When they did mine, they're like, yours is 60% too long. We just did the the typesetting estimation, and it's 1287 pages. Wow. So remove more than half of this.

Sandy Ryza [00:57:10]:
Damn. I

Ben Wilson [00:57:11]:
was like, but remove what? And they're like, you figure it out. We don't do we don't do feedback to authors. You're the author. You figure it out.

Michael Berk [00:57:20]:
Wow. So

Sandy Ryza [00:57:21]:
just going through and being like, right on the leg.

Ben Wilson [00:57:24]:
These 12 chapters, gone.

Sandy Ryza [00:57:27]:
Yeah. It's definitely a don't quit your day job kind of endeavor.

Ben Wilson [00:57:31]:
Yeah. Yeah.

Michael Berk [00:57:33]:
Got it. And were was it good career move? Sandy, you said at the time it was, but I'm curious for both of you. Are you happy you wrote a book? Would you have done it again if you were in that same place?

Sandy Ryza [00:57:44]:
I think it was a great career move for me, and I don't regret it.

Ben Wilson [00:57:49]:
Yeah. Same here. It was rough. I mean, it's working 2 full time jobs for, like, a year and a half, but, yeah, I would do it again. Would I do a second right now? No.

Michael Berk [00:58:04]:
Sandy? No.

Sandy Ryza [00:58:07]:
Not right now.

Ben Wilson [00:58:09]:
Heard.

Sandy Ryza [00:58:10]:
Maybe a novel.

Ben Wilson [00:58:12]:
There's an idea.

Michael Berk [00:58:14]:
Yeah. There is an idea. Okay. Awesome. Well, this was a very fun conversation. We're about out of time, so I'll wrap. Some things that stood out to me was, first, it's hard to make the correct decisions when you're first building a project. When you're in the pre one point o phase, it can be the wild west, and 1.0 refers to your production release where you're gonna support backwards compatibility.

Michael Berk [00:58:36]:
Post 1.0, now you have to maintain and ensure that stuff doesn't break when you make changes. When you're looking to build stable code bases, when you're reviewing a PR, obviously, you want the thing to work. But on the style side, just use pre commit hooks, linters, IDEs. Make the tool annoy the person instead of the the reviewer. And then if you're gonna do one thing in the review, make sure that naming is good. A cool strategy for ensuring you're building the right features well, start writing the docs and or the marketing pieces before you write the code. Then finally, if you wanna get rich, don't write books. So, Sandy, if people wanna learn more about you, your work, or Dagster, where should they go?

Sandy Ryza [00:59:20]:
Yeah. I'm on Twitter as s_ryz. That's Riz. Dagster, check out dagster.io. I'm also on GitHub as, github.com/sryza.

Michael Berk [00:59:34]:
Amazing. And check out Spark dash time series on GitHub. Scintillating. Cool. Well, until next time. It's been Michael Burke and my co host, Ben Olson, and have a good day, everyone.

Sandy Ryza [00:59:46]:
Well, thanks for having me on.

Maintaining Backward Compatibility in Software Projects: Strategies from Industry Experts - ML 164

0:00

59:52

Playback Speed: