The Impact of Process on Successful Tech Companies - ML 145

Michael and Ben dive into the critical role of design in software development processes. They emphasize the value of clear and understandable code, the importance of thorough design for complex projects, and the need for comprehensive documentation and peer reviews. The conversation also delves into the challenges of handling complex code, the significance of prototype research, and the distinction between design decisions and implementation details. Through real-world examples, they illustrate the impact of rushed processes on project outcomes and the responsibility of tech leads in analyzing and deleting unused code. Join them as they explore how process and organizational culture contribute to successful outcomes in tech companies and why companies invest in skilled individuals who can work efficiently within established processes.

Hosted by:

Ben Wilson •

Michael Berk

RSS Spotify Apple Podcasts YouTube Amazon Music

Show Notes

Sponsors

Transcript

Michael Berk [00:00:10]:
Welcome back to another episode of Adventures in Machine Learning. I'm one of your hosts, Michael Burke, and I do data science and data engineering at Databricks. And I'm joined by my co

Ben Wilson [00:00:19]:
host. Ben Wilson. I review PRs at Databricks.

Michael Berk [00:00:26]:
Today, we are going to be talking about how to design a product. And this is something that Ben thinks about many times a day and sometimes even at night. And I have also been thinking about it recently because for a lot of customer implementations, they say, here's a problem. Fix it with a solution. And sometimes that's sort of challenging. You don't wanna just start typing without thinking about how things should be designed. So, Ben, you said before we started recording, what separates a $1,000,000,000 company from something else? Do you mind restating that?

Ben Wilson [00:01:00]:
Not in the same words. I don't think. Yeah. You know, the the acronyms that everybody uses. Right? Company, which isn't even technically applicable anymore. But those type of high-tech companies that they dominate in their space, they hire the best engineers. They have the best product teams. They have this amazing vision and, you know, arguably a monopoly in their their space, that the SEC hasn't investigated yet.

Ben Wilson [00:01:34]:
But they're they're so successful and capable of producing things that it seems like nobody else can do, which isn't true. But the thing that sets those companies apart from people that are, you know, hiring software engineers and hiring data scientists and ML people that are putting stuff into production. In my opinion, and this is just my opinion, but I think the thing that the fundamental thing that sets those groups of people apart from one another is process. And I'm pausing there for a moment just to let that sink in. Wait. So not what?

Michael Berk [00:02:16]:
Just me that it's not the person, it's not the intelligence, it's not their fancy Harvard degree. It's just the culture and the process and the organization? Yeah.

Ben Wilson [00:02:28]:
So from my experience and talking to a number of different people who have worked at those companies, those big high-tech FANG companies. Like, most of our engineering department comes from that background or they come from, you know, a very fancy Ivy League background. They're very smart people. They're very capable. But everybody, regardless of which of those big tech companies they come from, they all fit in seamlessly into how we do business because we behave in the same way of there's a process, and there's a set of things that are done that has been learned over many decades throughout the industry of effective and efficient ways of building things in complex systems. And then I also come from a background before where I worked at companies that didn't have that. We still had brilliant people working there. A lot of them, I noticed, were really frustrated with how things happened or how things were done at a a small start up or in a in an industry that is not software first.

Ben Wilson [00:03:45]:
Right? So their business is about something else, but they just happen to have you you need a software team. And really good people are just like, man, why do we keep on shipping broken stuff? Or why did we build this thing that nobody uses? Or why was this allowed to actually merge to production? It's garbage, and it's broken, and I'm getting paged all hours of the night because of this garbage code or this doesn't scale. You know, there's all these issues that come up. And even at the big tech companies, you have that. Right? We're we're human. We're infallible. Like, nobody's infallible. We create broken stuff all the time.

Ben Wilson [00:04:28]:
The difference is it doesn't happen as frequently and as as insidiously at the big tech companies because you can't have that. And it's not because you hire, you know, savants and fill all of your teams with, you know, the most brilliant geniuses on planet Earth, software gods and goddesses. It's not how it works. It's about process. You get smart people, and there's smart people everywhere. They're working at at those tiny, you know, tiny startups. They're working at at companies that are not software first, and they're also working at the tech giants. But if you were to take a room full of geniuses and give them carte blanche to do whatever they thought was right individually and provided no guardrails whatsoever.

Ben Wilson [00:05:23]:
See what that team produces in a year versus a team of smart but not genius level people and give them process of this is how we think through design. This is how we build software. This is how we build product. See which one is gonna produce more successful outcomes for, for their user base. It's gonna be the 2nd team. Hands down. Promise you.

Michael Berk [00:05:53]:
Got it. So that's a pretty freaking bold statement that process is the key and the intellectual property of these organizations arguably should be the process. Yes. The code that was generated from the process is is valuable and actually is what makes the money. But the process is the leg up or the advantage that these companies have over other companies. Is that about right?

Ben Wilson [00:06:16]:
You're not gonna have that IP if you don't have some sort of rule set about how you build that. Or it's gonna fall apart, and you're gonna you're gonna have a broken product or a broken platform. You know, what would it be like if Netflix, you know, created a streaming back end service that went down every 2 hours? It's just completely, you know, zero response paginating buff like, buffering video for an hour because the thing just went down in the back end. Nobody would everybody would cancel their subscription. Everybody would bail. They'd be like, this sucks. This thing doesn't even work. What am I paying for? You can't have that.

Ben Wilson [00:07:02]:
So it's not that Netflix has, you know, the world's smartest human beings all working and everybody building whatever they think is right. That's not how that works. You have a bunch of super smart people working together around the process of how to build and how to introduce features And not just the engineers, you have the product team that's working with those engineers. You have the entire support staff that's making sure that there's this vision of how things are built and how things are tested and how are how they're deployed.

Michael Berk [00:07:40]:
So okay. I'm sort of on board. And before we get into the the actual product design piece and then some tips and tricks, Why then are all these FAANG companies paying so much money for software engineers, ML engineers, that type of thing, if it's not about the people?

Ben Wilson [00:07:57]:
I mean, it it is about the people. So if you were to take a mediocre, you know, person, somebody who, if given a new feature to develop, might take them 3 weeks to build versus somebody that you give the same guidelines and same process, and they get that code shipped in 2 days. Who are you gonna hire? Right? You're that that capability that some people have where they can do things really well, efficiently, People that can work with others and get stuff done. So they pay more because they want to keep those good people. And those those people who are receiving those paychecks, they're not idiots. You know? They're not gonna be like, oh, it's fine. I'll I'll take half that pay. No big deal.

Michael Berk [00:09:00]:
You know? But the basis of that question was you said that it's about the process and not the people, And that might be oversimplifying. But can you Sure. Clarify why you should be paying so much if you can just train someone up in 2 years to have the exact same skill set?

Ben Wilson [00:09:18]:
You can train anybody up to at regardless of time, if you take time time to market out of the equation, anybody who is sufficiently skilled enough to be able to produce that final result. Right? They know the fundamentals of computer science. They know how to solve a problem with code. If you're you can train anybody up with guardrails and a process on how to actually get that built in a sustainable way. The the reason that some people get paid way more than other people are is really that time factor. How long is it gonna take for that person to get through the process of figuring out how to get this done in an efficient way and solve all of those problems and have context and have the ability to do independent research to learn as they're going, to get better as they're they're going. So, yeah, people are different, of course, and have different capacities for performance. And that's really what the the big tech companies are looking for is people that are they stand out from the crowd.

Ben Wilson [00:10:35]:
Hey. They're they're an exceptional person at what they do.

Michael Berk [00:10:40]:
Got it. So it sounds like you're saying that you can train up someone to be sort of in the ninetieth percentile, just by correct process. But those that top 10% of people, there is some natural talent, and there's just time factors of having done the right process for a long period of time that makes them more valuable than a recently trained person. Is that correct?

Ben Wilson [00:11:03]:
If we're talking happy path, yeah. However, these aren't mutually exclusive. It's not like, oh, all geniuses are going to figure out the best way to produce software. So you can take not just saying that, hey. We'll take that mediocre person and compare them against a genius. And the genius without any guidance is gonna pretty be able to, you know, do this amazing thing in in a fraction of the time of the mediocre person. It's not how it works. That genius without that that process is gonna make garbage.

Ben Wilson [00:11:37]:
I promise you. They might get lucky, you know, a couple of times and make something that's that's super awesome, and then you merge it into production. You deploy it. And then you find out that they made decisions that are incompatible with the rest of the software suite or services that you have because they didn't have context. They didn't know that, hey. I needed to get approval from these other 10 teams that have to interface with this thing. I just broke the ability for us to release this product because I put in all this effort and work to build this new feature that can't talk to anything else. So the, like, geniuses can actually cause more harm if you don't have process because they build stuff faster and get stuff out there quicker.

Ben Wilson [00:12:28]:
And Right. Yeah. You can create even more havoc with those people.

Michael Berk [00:12:35]:
Okay. Cool. I think I understand. So now let's get into some tips and tricks, and I'll start off the tips and tricks with a real world use case. Prior to the new year, I was working with a large retail organization, and they have been using Databricks for a couple years. And they said, hey, Michael. Do you mind designing a data ingestion framework for us? I was like, sure. That is squarely my job, so I will do that.

Michael Berk [00:13:01]:
And I built some prototypes and said, hey. What do you guys think? I didn't even go through the process of building a design doc. I just built prototypes and was like, hey. Do you think you could maintain this? Would this meet your needs? Give me feedback. And they said, yeah, looks good to me. And 3 weeks later, they started onboarding a bunch of teams to use this in a production setting. I hadn't written tests. I I frankly put that together in, like, a week max.

Michael Berk [00:13:31]:
There was no design doc. And, so I immediately said, let's pump the brakes a bit, and let's figure out if this is actually meeting your use cases, and let's do sort of a beta test. And so that's where we're at now, and it's at a good place. But this process seemed a little bit rushed, and it was not what I intended, to have happen. So, Ben, if you were put in this position and you were the customer that I was working with, how would you go about approaching this ingestion framework that is theoretically gonna handle thousands of data pipelines with terabytes of data being ingested over, let's say, a week period? How would you go about ensuring this would meet the quality and standards of your organization for the next 5 to 10 years?

Ben Wilson [00:14:16]:
Interesting question. So you use the word framework, which is, interesting to me. So in the in the realm of software engineering, you know, a framework would be something like Apache Spark. That's a framework for doing ETL. So you you're able to deserialize data, put it into in memory. You have data transformation tools. You can, you know, run libraries on top of this the system, and it gives you a mechanism for in the sense of ETL, you you load the data, extract it, you know, do your transformations on it, and then you put it somewhere. So if I'm a customer asking for a framework, I would expect that the consultant that I'm talking to to explain to me that we're not looking for a framework.

Ben Wilson [00:15:17]:
We're looking for an abstraction over an existing framework that is customized to how the business does what it does. That being said, if I wanted a unified framework that is customized to how, you know, we have all of our data in this one storage format, and I just need it to be transformable into this particular architecture of fanning out, you know, medallion data. So I need to put my raw data in bronze, and then I have this augmented data in silver. It's cleaned. And then I have gold tables that are used for analytics, like, precomputed aggregations and stuff. That's just, like, an ETL pipeline. And if that's what I wanted an a consultant to do, I would say, like, I would list out here's all the the jobs that we need done. And all of the datasets at the end, here's their actual, you know, entity diagrams.

Ben Wilson [00:16:28]:
Here's the data model of what I want you to create. Go build that. So at that point, I would be the one designing it and then getting approval internally at my company saying, is this what we want? Is this we want this consultant to help us with? And the design would be provided for the consultant to just go and, hey. Just go build like, write the code for this and make sure it works and prove to me at the end of the of your contract that you built what I asked you to build. Else, you're gonna go fix it.

Michael Berk [00:17:02]:
Yeah. Let me clarify what, the the use case was. They did not know what a single pipeline should look like. They wanted internal processes so that they can scalably build and deploy data pipelines.

Ben Wilson [00:17:19]:
So build and deploy. I mean, that's Spark. Like, that it like, Spark is the framework that enables that. Alright. So wanted a

Michael Berk [00:17:26]:
couple of the data engineers that are actually building the pipeline are not super technical, and they want to be able to ensure ideally, have some templates that the data engineers could use. Ensure there's data quality. So, make sure that there's data validation both in flight and post table creation. And, yeah, they just basically want internal processes. I think that's a better way to put it. And, ideally, there's, like, a utils package that you can just import and say, alright. I change these 6 parameters. It will do this task, and it will work every single time.

Ben Wilson [00:18:02]:
Okay. So they want a custom DML on top of a framework. Sure. They want config based pipeline orchestration. Like, you can get that in open like, pure open source by using something like Airflow. It's like a config based. Here's my YAML config or here's my JSON that defines what the input and output is, and here's some transformation logic that I can write in SQL or something. And Spark supports that.

Ben Wilson [00:18:32]:
Like, Databricks supports that. It's one of the benefits of its of the platform. And I'm all for doing stuff like that. My question to that customer would be, okay. Your data engineers, by virtue of that job title, they should be able to write these. Like, that's a core requirement in industry for a data engineer. Like, you should be able to write Python or Java or whatever the SQL code, whatever it is. If it's manipulating data, you should be able to to write that and write a pipeline.

Ben Wilson [00:19:10]:
But and so if they if somebody comes back and says, like, woah. My people don't really wanna do that. They just wanna write JSON or YAML configs. It's like, okay. We'll build, you know, something where you can specify a configuration, which will then use some sort of builder pattern to construct the pipeline that does these actions. I would at that point, if that's what somebody told me, I would say, now we really need to design this. And we need to know what this thing can and can't do, what it should and should not do, and what it will not do so that all of that is documented very well for the people who are gonna be using this and also for the people who are gonna be maintaining this and building additional features in the future. Right? And something like that would scare me if I was assigned a task like that because I would it wouldn't scare me if the person I'm talking to is like, hey.

Ben Wilson [00:20:13]:
I know exactly what we need to do. Here's our product requirements for this. Let's, you know, work on a design together that we can, you know, meet these needs and then art you know, clearly articulate to the end users and management. Like, hey. This is not gonna solve these problems, but it'll solve this big list that we want.

Michael Berk [00:20:37]:
Got it. So you were alluding to something that I think is a valuable technique of, defining what it must have, what it should, could, and won't have. Can you clarify sort of the the history of this and the the value of defining these in a simple bullet list?

Ben Wilson [00:20:57]:
So the simple bullet list is the TLDR of your work that you did while doing your research. So if you have engineer in your job title somewhere, whether it's software or traditional dealing with mechanical or electrical or nuclear or whatever, you should be thinking about this thing, like, these these topics while you're determining whether something is feasible or not. And that prototype research phase is you figuring out what is actually possible. And it it's not a let's boil the ocean and figure out all the things that this thing could do. That's a massive waste of time and completely irrelevant. Should always have that product requirement of here's the 1 or 2 sentences explaining what this project is supposed to do. I need a simple interface for non developers, like DBAs, basically, who can move data from one place to another and apply simple transformations. So if that's our pro if that's our mission statement, that gives me guardrails immediately of, like, okay.

Ben Wilson [00:22:14]:
What manipulations could we do to the data within the confines of how this business deals with data? What source systems do you have? Like, where is this data? Are we pulling data in raw JSON format from web traffic, or is this, you know, tabular data that's stored in some data lake somewhere? I would need to know all that stuff to know, like, what do we have to think about in the

Michael Berk [00:22:41]:
in the grand design of what we were potentially could be building? Right. Yeah. That's a that's an interesting point. So starting off with a clear and concise problem statement, and then following that up with research. And also, I like that you mentioned that you just immediately jump in and try things. And I've been doing that more and more recently, and I found that really valuable. Switching to another thing that we can actually discuss fully openly, we just finished converting a lot of the docstrings and ML flow to Google format. And I was working on that using OpenAI and just using a an LLM to convert.

Michael Berk [00:23:18]:
And we were looking to see if we can enforce, Google style. So instead of having the colon param, you have args and the the Google style docstring. And the first thing that I did was I cloned the repo and just started trying shit. I googled around. I asked chat gpt to do it for me. I looked at Stack Overflow and just tried to figure out what what could happen. And I ended up changing a bunch of rules in our linter, which is, a Rust based tool called rough. And basically added a few rules, removed a few rules, ended up changing something like 30,000 lines of the MLflow repository, And, then PR ed it just to see how that would look in the CI, build.

Michael Berk [00:24:04]:
And, of course, it exploded. But just within that hour, I got so much amazing information just by playing and just by trying things. And if there was an error, I stopped and tried something else. I had no goal. I was just looking around. From that, I'm gonna take a step back and say, alright. These are the 6 things I tried. This is the pros and cons of each, and this is what I think we should do.

Michael Berk [00:24:28]:
And then I'll PR that and then look for feedback. And so this process is super, super valuable where you can just go and try things and break stuff. Breaking things is incredibly valuable. And, you don't always have to have sort of a methodical approach in this prototyping phase. So after that research and building, now we sort of get into the design decisions. What does that look like? Is it this Moscow list, or are there additional components of the design doc that should be built out?

Ben Wilson [00:24:58]:
So your design decisions are meant to there's a couple of purposes. One is there's like, during your your research, you found a fork in the road. You're like, hey. I tried 10 things, in the hour that I gave myself to decide how we might wanna solve this big problem associated with the design. And out of those 10 things, 3 of them, they all kinda worked. And I really like the first two. I hate the third one, but I don't have enough context to make a decision here on what is the right way to do this. So I need to dig in a little bit deeper and start thinking about what are the potential pros and cons of these three options.

Ben Wilson [00:25:49]:
And then you write those down. And writing them down isn't for you. I mean, it does help you. It helps you kinda think through the problem and look at it. Like, am I capturing the pros and cons here? Like, what did I not think of? What could potentially go wrong? Maybe this decision and its three options that I'm going with has impacts on another decision that I need to make. So let me go down that, you know, one layer deep into the rabbit hole and figure out is this compatible with this other part that's tightly coupled here. And you might be able to call out the like, one of those. You might be like, hey.

Ben Wilson [00:26:29]:
Option 3 just doesn't work. It worked for, like, while I was doing the first part of this, but then the second part, it's totally broken, so I can delete that. But between the option 12, if you look at interrelated things or things that you might wanna, you know, integrate with later on You never know. Like, we're speaking in very abstract terms, so each each case is gonna be different. But you definitely want to think about how is this gonna burn me in the future or burn my team or burn my company. And that's really what putting that stuff down is for help you think through those things, help you see it in front of you in text or in diagrams, whatever it needs to be. And then the follow on benefit of this is you get that peer reviewed. So these these design docs, they're peer reviewed.

Ben Wilson [00:27:26]:
They have to be. It's super dangerous if you're, like, the one person doing all this stuff, and we like, nobody is capable of thinking of all of the possible things that could go wrong or go right, or alternatives that you could have. Everybody's got history of experiences, of things that they've built, things that they've broken, and that's why a big diverse team of people from different backgrounds always builds better stuff than one person working on their own. And that's why the peer review process is so important. It's not just, hey. My boss who's, like, way more senior than me. They they signed off on this. They said it's good.

Ben Wilson [00:28:08]:
They're not gonna have all the context. Maybe somebody who's more junior than you in years who's also on your team has a point of view that's actually valid that says, hey. I I think this might actually blow up because I did this thing, like, 3 years ago, and I broke something like this. And Then you go and explore that, and you're like, yeah. They're right. Cool. Let's not do that.

Michael Berk [00:28:32]:
Alright. Yeah. So I have a bunch of questions on this phase. The first one, I'll set up with a learning that I had a while ago. So you, probably 6 or 8 months ago, first define the term implement implementation detail to me, and it really stuck. It's basically the concept of something that you can figure out later. That's it's as simple as that. And when you're working through a very complex design or just a complex system in general, it's hard to differentiate designed design decisions from implementation details.

Michael Berk [00:29:11]:
And it's even harder if you're not super well versed in the stack or if you haven't built things like this before. So question number 1, what is your rule of thumb for differentiating a design decision versus an implementation detail?

Ben Wilson [00:29:31]:
So what you just described doesn't come up that often in our teams. It did come up for me at companies I've worked at before and certainly companies I've helped out when I was working in the field, you would see, like, implementation details bleed over into design docs in teams that don't have technical competency. It's it's natural. It's not like, oh, this is terrible. If you do this, you're an idiot if you do that. It's it's irrelevant to a high a a team that has high technical competency because everybody on the team and everybody who's reviewing the design decision knows whether something's possible or not. So if you know enough about computer software when you're doing one of these things, you know what you can and can't do with the linked list. You know what in Python, what a dictionary can and can't do.

Ben Wilson [00:30:36]:
You know when to use a dictionary versus when to use a set, when to use, you know, list comprehension versus 4 loops. Those are all nuances that are, like, core CS, like, fundamentals, and you ensure that everybody working on software at your company has those competencies when they get interviewed for a job at the company. So you don't need to put all that stuff in there. It's irrelevant. Everybody knows that you'll figure it out. Right? That's the technical aspect of building something. What you don't wanna do is make a decision on something that has to do with architecture and just assume that everybody, like, what what's gonna go on. So an example would be an implementation detail would be like, hey.

Ben Wilson [00:31:39]:
Are we gonna store metadata, about this object that we're recording information on? And the design decision in that, situation would be a Boolean. We're yes. We're gonna store metadata, or no. We're not gonna store metadata. An implementation detail would be, are we gonna store this temporarily in a data class in Python, or are we gonna use the named tuple? Nobody cares, dude. Like, we'll figure that out when it comes to it. And that's that's an implementation detail that is so irrelevant to the point of whether we're going to save metadata, yes or no. So the design decision is at the heart of what are we going to build, not how are we going to build it.

Michael Berk [00:32:28]:
Yeah. So let me be really annoying here. And I've I have some solutions, but I'll still be annoying. So I think you hit the nail on the head with the comment about when you're hired, you should have some sort of technical baseline. But my question to you is, how do you handle this if you're working out of your technical capacity or working with people who, when reviewing this, it will be out of their technical capacity? And taking an extreme example, oftentimes, I'm designing things for customers that have very they're not very advanced, but they have Databricks features that are not generally known to most data engineers, let's say. Or they're small, components that they just haven't seen before. And so should I go and spell it out for them, or should I just have enough confidence to trust myself? Or is there some sort of middle ground? And, conversely, on their side, if they don't know about the Databricks features, for instance, how do they go about defining what is feasible, and, thereby, what is the decision versus an implementation detail?

Ben Wilson [00:33:40]:
So that's what you figure out in a prototype. So your prototype code doesn't necessarily need to be in your design, but it's super helpful if it is. You've seen tons of my designs that I've done. They're filled with code. Yeah. That's just how I do them. I'm a nerd that way. I've seen other people's designs where there's not a single line of code because they don't need to do that.

Ben Wilson [00:34:04]:
It's not an a requirement, or what they're designing doesn't really need to be explained. Like, the concept can be communicated with just English. And then there's other things where you see some somebody's design doc, and you're like, hey. This is, like, 95% code samples. That's cool. Maybe a bit much, but, hey. It works. Just make sure that that complexity fits what you're you're working on and what you're designing.

Ben Wilson [00:34:35]:
But before you go through and figure out what design decisions there are, you should do a really quick, nasty prototype. And by nasty, I mean, it it really doesn't have to look good. Write a function that serves a purpose of doing just the bare minimum of what you're trying to prove out. For the example of metadata. Right? That's it's trivial. But if we wanted to show how we're going to like, one potential avenue of how we're gonna get information out of this object. Write a function that takes in that object and then strips out the information from it, puts it in, you know, a dictionary, and returns that. Is that what the final code is gonna be? Probably not.

Ben Wilson [00:35:21]:
Doesn't matter. But I wouldn't put that code, in the design doc because I know that everybody who would see that would be like, yeah, dude. Like, we we get what you're you're saying here. Take the metadata. We got it. Know your audience, basically. But if it's something like, hey. I'm interfacing with this library that nobody here has any experience with because we've never had to import it ever in anything that we do.

Ben Wilson [00:35:54]:
And it's 3 months old, and it's super popular, and everybody's using it, but we don't know it yet. I will go out of my way to do a ton of prototyping on this thing to show myself and my audience and to show my audience who's gonna be reading this document that I did my homework to validate that these things are actually either possible or not possible. And that's that's that hidden that hidden purpose. One of the hidden purposes of a design doc is to communicate that you looked at all this stuff so that people are like, yeah. This person, they they were thorough. They went through and validated all this stuff, so we're we're okay with signing off on this.

Michael Berk [00:36:39]:
Yeah. Yeah. It actually hits home when you mentioned blogging metadata. This was actually something that I've struggled with with some of the designs in the past. And it's because I am not super comfortable with all of the different ways you can log metadata. And, basically, how to handle it asynchronously, how to ensure that reads are super fast and writes are super fast. How do you condense the table after logging the metadata? There's all these components that if you don't go and build a bunch of prototypes, it can be challenging. But my question to you is, alright, if I'm prototyping, there's theoretically an entire world of software that I would have to learn.

Michael Berk [00:37:19]:
And I could take at least a couple days upwards of a few years. How do you define the cutoff between I should figure it out myself versus I should ask someone?

Ben Wilson [00:37:34]:
It depends. So if you're presented with a a fairly straightforward problem and you don't know where to even start, phone a friend. Like, ask a more senior person. You should have a mentor that is assigned to you or some tech lead who can kinda point you in the direction. But, like, I don't know the exact answer here, but here's what I would do. And, usually, those people have enough experience that they can give you the correct vector in which to start your journey. But if you're experienced and you've been around for a long time doing this stuff and this is your 800th design that you've done and you're presented with a problem that there is no clear answer on how you would even begin to start solving this problem, then designs are even more important because now you have to start thinking of, is there something that exists even within this language that we all are writing in that supports this mechanism that we're trying to do? Is there some package that somebody wrote that solves this problem that we can use? What is there that we can leverage to solve this? And if you exhaustively go through all of your resources of knowing where to go, who to ask, getting other opinions, sitting down in a roundtable discussion with a bunch of other senior people. Like, do we have to, like, build this from scratch ourselves? Like, create a new library that does this functionality? And if if that's the case, you write that down, and you get a a peer group of very senior people that all agree, like, hey.

Ben Wilson [00:39:18]:
This is not possible unless we do this. We need to write a new c plus plus library that does does this thing and then put a Python wrapper around it, and then we'll interface with that. That now becomes a requirement to meet the need of whatever you're trying to do, and then you weigh whether it's worth it to do that or not. In the case of everything that I've worked on at Databricks, usually, 99% of the time, when we see something like that and we have seen, you know, requests that are like that. We're like, well, in order to do this, we would need to do like, build this entire library to do this. We'll go through the motions of of listing that out and saying, okay. This is gonna take us 6 months to build. Is this worth it? And then, you know, product and engineering management will look at that and be like, yeah.

Ben Wilson [00:40:15]:
Let's not do that. Let's go do this other thing that's kind of similar to that that we can get done in in 2 weeks, and it's gonna meet 90% of the use cases. And those edge cases that are not supportable, they're not supportable.

Michael Berk [00:40:30]:
Right. Okay. I think I understand. And just sort of reflecting a little bit, it seems like this is one of the core values of having a wide knowledge base is the speed at which you can design things. If you've built 7 different versions of logging metadata, well, you can pretty quickly say this is the best one we should use. This is why and build out that pros and cons list with minimal prototyping. But if you haven't done that type of thing in the past, sometimes it's a bit more, challenging because you actually have to go in and test things out. And it just takes longer, a longer time.

Michael Berk [00:41:08]:
So, Ben, opening the floor a little bit. Do you have any general tips that you on the MLflow team use to make your team more effective that other people can employ?

Ben Wilson [00:41:21]:
I mean, the big thing to remember is we're not going through and doing designs for every feature that we build. There are probably, like, 90 high nineties percentage of feature PRs that are filed and merged from our team have no design associated with them. We don't need it. The team is freaking incredible, and they know how to build stuff. So if it's something that is very clear in 1 or 2 sentences of, like, hey. We need to build this thing. And you you you're on our stand ups. You know? You see, like, what we work on and how we talk with one another and where we get stuck on certain things, what needs discussion.

Ben Wilson [00:42:10]:
And sometimes you'll see, oh, file this PR. We open it, look at it for a couple of seconds, and be like, yep. Cool. We'll take a look and see if we see any typos or anything. But, usually, it's like, yeah, it's pretty good. It's knowing when you have to do that 4 to 5% of, hey. This is big. This entire feature is gonna be maybe 20 or 30 PRs to build this.

Ben Wilson [00:42:39]:
And the reason that you go through that whole design process is not just to make sure you're building perfection. That's illusory, and not real, by the way. You're always gonna have a there's always gonna be a better way to build something. There's always gonna be a more efficient way of writing your code or, you know, are we building for performance or are we building for readability or are we building for testability? There's so many questions, so many different interpretations. That's why software development is is an art as well as a science. And because of all of those, like, that limitation on perfection, you're never gonna hit anybody's subjective or objective version of perfection with with an implementation. What you're trying to do is minimize the loss of time. That's the end goal.

Ben Wilson [00:43:39]:
It's all about velocity, man. Like, how many people do you have? How many hours in a work week can people productively work on new things or work on fixing broken things? And how can you minimize the amount of wasted time that those people have to go through? Because you you get a bunch of brilliant people together, You they really don't wanna be just plugging the the leaks in the boat all the time. They wanna be working on cool stuff and building new things and building value that people actually use the cool stuff that they're building. That's the goal. That's why most people are in open source development or working in software in general. Like, most of those people are are nerdy people that want to build cool stuff that other people actually use, and that's what gives them joy. It's like, hey. I built this or we built this together, and look how many people are using it, and they love it.

Ben Wilson [00:44:38]:
This is awesome. It's not, hey. Look how smart I am. I'm the best. Nobody has that attitude. Everybody's like, man, I'm an idiot. I'm I'm glad that this actually worked out. This is cool that people, they're using it, and we're you you do the design so that you have less stuff potentially to fix or to prevent you from releasing something that people are excited about, then they use it, and they realize that it sucks or it's broken.

Ben Wilson [00:45:06]:
You wanna prevent that. Because if you use something that you think should work and then you use it, it's like, hey. Hang on. This is totally broken, man. Are you gonna try it again? Are you gonna lose faith with the the project or lose faith with the the team that's working on it? Yeah. Maybe. So it's it's all about making stuff that's not totally broken.

Michael Berk [00:45:32]:
Yeah. So can you clarify things that are a loss of time? So I I just thinking through a list, it seems like maintenance sort of required sometimes. Maintenance is a loss of time. Bug fixes, definitely a loss of time. Refactors, maybe a loss of time. Regression's definitely a loss of time. And then things that aren't being used, so features that just don't add value. That's also a loss of time.

Michael Berk [00:45:57]:
Do you agree with all of those, and are there other things?

Ben Wilson [00:46:01]:
Kinda. So a regression is an incident. That's bad. Like, really bad. You broke something that used to work. I've done it. Everybody I know has done it, who works in the software engineering for more than 5 years. When you do that, it's an opportunity to learn, though.

Ben Wilson [00:46:23]:
Like, hey. Why did this happen? Not, hey. Let's place blame on the person that broke it. It's how can we prevent this from happening again? Our process is broken in some way. How do we change our process in the execution of delivering code so that we don't have these sorts of failures happen again? So that's a serious incident, and it's usually, like, the fault of whoever was in charge of doing that. Right now, that's me. So, you know, the it's very much on, like, whoever the the tech lead is to really take that to heart. Like, this is very bad that we did this, and, I'm the reason that this broke.

Ben Wilson [00:47:08]:
And I need to make sure that my team doesn't get exposed to this again. Yeah. It takes a it wastes a lot of time, but it also makes everything better. The waste of time if it if you build something that every time you have to touch that code or interface with that code, it just feels painful. You're like, hey. We wanna add this new feature to this, but every time we need to do that, we have to touch, like, 40 other places in the code in order to enable this new feature, that means your your code architecture sucks, or you you have a very complex product that could potentially benefit from refactoring. But, usually, when you're looking at tasks like that, it becomes a real problem when you can't refactor it because you don't have enough time. It's like, hey.

Ben Wilson [00:48:11]:
In order to build this feature, we need to fix the code in such a way that it it's gonna take 5 times longer to refactor the code than it is to build it, like, the the core feature. That's a problem. And design can help prevent that. But that's not you don't really do that on product design. You do that on PR reviews. So when your peers are looking at the code, they're the ones who should be like, let's change this to a different code architecture. Let's write this in a different way because you're putting, you know, 30 actions within this method. Let's break this method up into 10 different methods so that we can test them all and also decouple all of this logic because this is a highly or something that's internal to this is something that's probably gonna have to change a bit.

Ben Wilson [00:49:13]:
And the more crap you pack into a single execution statement, the more dangerous it becomes when you have to modify that, which can cause a regression, which is bad. So it that's technical competency. That's that's experience about knowing how to build stuff, separation of concerns, and focusing on code readability is is big on that too. So if you can understand how something works just by glancing at it and be like, yeah. Totally grok how that like, what that's supposed to be doing. It's a minimal set of instructions that this that this function or this method is responsible for. Cool. But if you look at something and you're like, hang on a second.

Ben Wilson [00:49:59]:
I'm gonna need about an hour to go through this because there's so much going on here. You know, this one function is 600 lines of code long. What the heck is happening here? You should be scared. And you should but you shouldn't be afraid to tell the person who wrote it. Like, can you please write this so a human can read it? You know? The computer can read it just fine, but if other people don't understand what you wrote, are they gonna fix it when it breaks?

Michael Berk [00:50:29]:
Yes. Code is for humans. That that's why we don't write in binary.

Ben Wilson [00:50:33]:
Exactly.

Michael Berk [00:50:36]:
So alright. You touched on everything but one in that list, which is things that aren't being used. Does that That code? Features like a built feature that is not being used by the the customer or whatever you're building that the feature is not being used. Is that part of the design doc? Is that the responsibility of the TL to find product market fit? Like, where whose responsibility is that? And would you count that as a loss of time if you build something and it doesn't get used?

Ben Wilson [00:51:08]:
No. Engineering is not involved in that. Like, that that's product. Got it. But the tech lead works with product and should be communicating, and you should be analyzing if you have customers that are using something. Engineers don't. I mean, I haven't met any that and I definitely include myself in this list that are sort of emotionally attached to stuff like that. They're like, I don't wanna get rid of this feature.

Ben Wilson [00:51:37]:
Like, I spent so much time on it. Every single one of us is like, I wanna kill that code. I wanna delete it. How many can we pull the data on usage and see? Is there less than 3 people that have used it this month? If so, can we delete that crap? Like, let them know, like, hey. We're getting rid of this mark slap deprecated on it and say, like, we have, you know, 850,000 users using these other APIs, and we have 3 people using these APIs. That's a waste of time. And when we have to do maintenance on this and update the code so that it works in a new environment or something, it's just a massive waste of time. So being ruthless about deleting code that provides no benefit to anybody, you should be, straight up, you should be ruthless about that.

Ben Wilson [00:52:33]:
Like, the code doesn't care. It doesn't have feelings. And if the people on your team are are get upset at seeing their code disappear, I would I would question, like, who did I hire on this team? Like, why are they getting so emotionally attached to this? Because they should have enough experience to realize that anything you write is ephemeral. You know? Somebody else is gonna improve it or fix what you screwed up or make it better or add features to it and or eventually delete the code. You know, once you get experienced enough, you start thinking back. You're like, I can't even remember what I wrote 6 months ago. Or you look through a commit history in a in a large repository. And that that's the funniest thing that happens to me, actually, is I'm like, look at some function.

Ben Wilson [00:53:21]:
I'm like, who the hell wrote this? Like, this is kind of this is kind of crap. Like, who wrote this? And then I look and, like, oh, Wilson. Yeah. That's me. Let me fix that. I think that's funny when that happens. And it's it's also funny when the opposite happens where you look at something, you're like, that's really clever. That's pretty cool that we have that in the code base, and you click on it, and you're like, I wrote that? When did I? I'm not that smart.

Ben Wilson [00:53:51]:
Like, when did I come up with this? So but when you write enough code, stuff like that happens, and that means that you're no longer attached to that code. You don't even remember writing it. You don't care. You're focused on, am I building stuff that people want to use? Is it is what I'm creating useful to people? That's really all that should matter. And if it's not, delete it. Get rid of it. It's a burden. Yeah.

Michael Berk [00:54:21]:
But what if you use the list comprehension instead of a 4 loop, and that's super fancy? Isn't that super painful for you to delete, Ben?

Ben Wilson [00:54:30]:
No. No. A note on that. So refactoring for efficiency is a massive pet peeve of mine and many of my peers, or people making comments in code about stuff like that where it's irrelevant. So if you're going in and looking at how somebody built something and you're not and you're like, I don't really like how they're they're doing this implementation. It's not very performant. And you look at it, and you're like, this is called one time on a a collection that never grows more than 10 elements deep, like, it can't grow further than that, your efficiency improvement between a 4 loop and a list comprehension and a filter that's happening on that iterator, you're saving a 10th of a millisecond. The computer doesn't care.

Ben Wilson [00:55:33]:
Like, you're not saving anything here. So that comment should be focused on we use list comprehensions because they're easier to read. Because everywhere else that we're operating on collections uses list comprehensions. Let's do that because it's more readable because 4 loops can get ugly when you look at them in Python. That's totally valid. Like, let's refactor for readability. But if if somebody's going through and saying, I don't think we should use a dictionary comprehension, we should use set operations because that's more efficient for, you know, the Python interpreter to, you know, create the binary code at in the just in time compiler. Nobody cares because that'll make the the code way more complex to start throwing a bunch of set operators in there, and you have to to cast things to different types and then recast it back to something else.

Ben Wilson [00:56:30]:
So the code complexity increases for almost zero benefit. So it's that's where wisdom comes in, when you're doing PR reviews to know what is important. And at the end of the day, the most important thing is, does the code work as intended? And then right behind that is, can we actually read it and understand it? And then below that is if performance is not directly coupled to that first element, like, does it work or not? Which that's a huge different that's a very different concept when we're talking about collection operations. If we're if we're iterating over 10,000,000 elements in a collection, then those little things matter. Are we doing, like, some sort of filter operations where we need to pop out the first element that we hit? Well, we should be using something that short circuits. Like, use some sort of search, you know, a greedy search algorithm that's going through and will terminate and not have to iterate over the entire collection. That's wasteful because now you're talking about something that returns in the 10th of a second versus something that takes 10 minutes to return. Then, yeah, that stuff matters.

Ben Wilson [00:57:48]:
But that's all of that stuff is figured out as part of the design doc to sort of understand what what is the nature of what we're dealing with here. Are we dealing with, you know, small amounts of data that need, you know, transferred across objects really quickly, or are we dealing with massive amounts of data that we need to think differently about how we're implementing this and then write prototype code to validate? Like, hey. What happens if I do this versus this? Oh, we should definitely go with option b here because it's a 1000 times faster. Yeah. That would be in the design doc.

Michael Berk [00:58:28]:
Yeah. Yeah. People tend to undervalue readability and overvalue efficiency because that's what computer science teaches. But,

Ben Wilson [00:58:38]:
yeah. And then there's I find that it happens more with, like, junior devs who wanna sometimes wanna try to flex in code. I've seen it in I don't see it in, like, PRs and MLflow or anything, but I've definitely seen it in the other projects I worked on where you get somebody who's a couple months out of out of college or maybe they're an intern or something, and they just write something that's super complex. You know, it's super low level implementation of something, and they're it's like they're trying to show you how smart they are in their code. Like, that's their only motivation. And you provide feedback. You're like, hey. You know what you're doing here? There's something in the language that's built in that does almost the same thing.

Ben Wilson [00:59:28]:
It's a little bit slower, but it's, like, 40 characters to type all of that. Your implementation, it's it's 85 lines of code. Yeah. It's much faster And, you know, cool c function that you wrote. That's awesome. Now we have a compilation stage that we have to do in order to install this for this one function. Like, do we need this? What's the real purpose here? Kudos to you paying attention to into, you know, your c plus plus class in, in college, but this is not needed. So you very rarely see anybody who's somewhat senior ever doing stuff like that.

Ben Wilson [01:00:14]:
Are they capable of doing code flexes? Yeah. Probably more so than that junior person is even remotely aware of. They'll never gonna do it, though. There's no there's no point. It doesn't solve any problem. It just makes people confused. They're like, why did they do this this way? This is really hard to read.

Michael Berk [01:00:35]:
Yeah. Yeah. I'm excited to be able to one day, a, be proficient enough in software engineering and, b, be put on projects that are complex enough to warrant seeing a senior person flex? Because I don't think I have. And one day, I will, but it'd be interesting to see.

Ben Wilson [01:00:56]:
Just I mean, there's there's tons of packages out there that are super complex when you look at them.

Michael Berk [01:01:03]:
But that warrant the complexity?

Ben Wilson [01:01:06]:
Yeah. Oh, yeah. I mean, if you really wanna get into ludicrous levels of complexity, look at source code for languages. Like, look at the core implementation of of how some languages are implemented. You you'll see some crazy stuff in there. Look at solvers for ML, you know, algorithms. You may not know it, but you work at the same company where a lot of that stuff was invented by the people that work there. And you look at their source code, and you're like, this is easy to read but insanely complex because it that's the only way to solve it, or that's the simplest way to solve this problem.

Ben Wilson [01:01:48]:
It's just the problem is so complex that it's like, okay. This is intense. They get something like the matrix based solvers in Apache Spark. Like like the core Scala code for that. It's it's intense.

Michael Berk [01:02:03]:
Yeah. Yeah. So that's where the first point of me not knowing what the hell is going on and being able to appreciate it. That's the that's the bottleneck, not the code existing. It definitely exists.

Ben Wilson [01:02:14]:
Yeah. But even those people that are writing that stuff, they're writing it in a way that is as readable and easy to understand as the problem can allow it to be. Right. So if you look at that code, you might not realize, like, okay. This is pretty crazy what's what's happening here. And then you you reverse engineer it and walk through, like, what's going on here? How is this done in the distributed system? Hang on. These are RDD operations, which are then, you know, node to node. Like, workers are communicating to one another pieces of this matrix and start writing it out on paper.

Ben Wilson [01:02:51]:
What's going on? Like, man, the person that designed this is a freaking genius of, like, how they thought through that. And then you ask, like, who committed this? And you're like, oh, it was Sean Re. You know? Yeah. You can talk to him. He's on Slack. So, yeah, there's, you know, crazy stuff that's out there, but they they're senior enough and skilled enough that they write it in such a way that is not intentionally flexing. Right. They're flexing, you know, because they made it so simple.

Ben Wilson [01:03:25]:
And a lot of people, if they were presented with that same problem, would write something that would be 10 times the amount of code and way more complex and less performant than it is. So it looks fancier, but true elegance is and true skill in complex software is making it seem as simple as possible. That's the end goal.

Michael Berk [01:03:48]:
Exactly. And the first step to simple designs is a design doc. Yes. So with that, I will summarize what we discussed. So first things first, one really interesting opinion that Ben has is that process is what makes the great companies great, and it's not necessarily geniuses. People are important, but process is the key to to shipping good software. And then when you're building out a design, there are a few steps. So first you should generally define your mission statement or your goal.

Michael Berk [01:04:21]:
2nd, just build stuff, prototype, research, understand the use case. 3rd, you should write decisions with pros and cons and, heavily think about the cons, specifically, maintainability, bugs. And then, basically, how would people interact with that feature if they wanna extend it or modify it or just use it? And then finally, get peer review. Ask people senior than you to review. Ask people less senior than you to review and just collect as much feedback as possible. And then some tangible tips. There's this concept of an implementation detail and a design decision, and that cutoff can be arbitrary, but it's typically a function of how technical you are and how technical the audience is. So setting it to the level of the reader is typically really helpful if you want robust reviews.

Michael Berk [01:05:08]:
Asking for help is important where you don't know or when you don't know where to begin. But, try try things first. Try googling, try prototyping. And, you don't need to do a design for everything, just the more complex things. So anything else you wanna say before we close?

Ben Wilson [01:05:26]:
No. I think that was a perfect summary.

Michael Berk [01:05:29]:
Well, until next time, it's been Michael Burke and my co host Ben Wilson. Have a good day, everyone. We'll catch you next time.

The Impact of Process on Successful Tech Companies - ML 145

0:00

01:05:41

Playback Speed: