How to Transition from Academics to Industry - ML 114 - Adventures in Machine Learning -

Michael_Berk:

Welcome back to another episode of Adventures in Machine Learning. I'm your host, Michael Burke, and I do data science and machine learning at Databricks and today I'm not joined by my cohost, Ben Wilson. He got a manager mandated vacation, which sounds like the best thing in the world. But today I'm joined by a good friend and former colleague and overall badass Noah Silbert. He has a master's degree in statistics and a PhD in linguistics and cognitive science. And after being a professor and research assistant for about a decade, he joined Netflix as a senior data scientist where he focused on dubbing and subtitles. Soon after he joined Tubi and that's where we met and he focused on causal inference and advertising optimization. And so Noah, just getting right into it, what is your favorite research paper you've ever published?

Noah_Silbert:

I've ever published? Oh, that's

Michael_Berk:

Yes.

Noah_Silbert:

a good question. Let's see. All right, I'll go with there's one that. I worked on with a colleague and it was, so there was a particular mathematical model that I worked with a lot as an academic. We called it general recognition theory. It's basically multi-dimensional signal detection theory. So it's like a multi-dimensional sort of probabilistic decision-making model. And I had used it to, I collected a lot of data. and used that model, built some like multi-level Bayesian versions of this model and fit it to the perception data that I collected, like speech perception data. And then with this colleague of mine, I, we did some work that was like purely theoretical. So we, we did some mathematical analyses. It had to do with model identifiability. Basically we realized that there were some issues with the model and by issues, I just mean there were. The most general form of the model. there's a bunch of identical or a bunch of different, sorry, different configurations of the model that would produce exactly the same predictions. And so

Michael_Berk:

Hmm.

Noah_Silbert:

flipping that around, what that means is there are certain, depending on how you structure the model, there are cases where you could fit very different models that would allow you to draw very different inferences, but they couldn't be differentiated by the data. So I don't know if I'm explaining this very well. Anyway. We wrote some papers that were purely theoretical, focusing on exactly these identifiability issues. And it was a lot of fun to do these just sort of like, you know, we were in some simulations, we did some mathematical analysis, and there was something that was kind of fun and satisfying in a way. And so I guess to pick one, there's one in particular, we wrote this one big long paper as part of that originally, there was one section of it that I enjoyed working on that was We decided it didn't fit and then we spun it off and wrote it as a separate paper. And so I'll say that one was my favorite just because it was fun. It was fun, it was an interesting model to work on, an interesting paper to work on. I'm sort of hesitating and hemming and hawing here because I'm also acutely aware that it's, like it's almost certainly one of my least cited papers. Like it probably had basically zero impact, but it was a lot of fun to work on. So

Michael_Berk:

What was it

Noah_Silbert:

from

Michael_Berk:

called?

Noah_Silbert:

the perspective of, oh boy. I

Michael_Berk:

Yeah,

Noah_Silbert:

don't

Michael_Berk:

let's go.

Noah_Silbert:

even remember. I can, I'll send you the title. I don't remember. I mean, I know the topic was, in this one was. it would get too much into the weeds.

Michael_Berk:

Okay.

Noah_Silbert:

I think they even described, it had to do with optimal decision-making in this modeling framework. That was like kind of the key to it. So I think that was part of what was fun is sort of digging in and understanding, yeah, what it meant to make optimal decisions with this model. And

Michael_Berk:

Got it.

Noah_Silbert:

when optimal decisions would produce certain patterns of data, of decision-making and data.

Michael_Berk:

Cool. Yeah. So a lot of our lists, we have a pretty wide range of demographics in our listener base and a good chunk of them actually are currently in academics and some might be looking to make a transition into industry. So you worked on a bunch of papers. You worked as a research scientist, but then you decided to go into industry. Can you walk us through a little bit about that decision? What was it like for you? Why did you make the decision and were there any challenges along?

Noah_Silbert:

Sure, absolutely. Yeah, there was one really key moment for me as an academic, sort of two related key moments. One of them was I spent a lot of time and effort writing grant proposals in my tenure track position and worked with a consultant, had a lot of good support, got pretty good at it, and ended up getting a grant funded from the NIH.

Michael_Berk:

Quick question on that. How much of

Noah_Silbert:

Yes.

Michael_Berk:

grant funding comes from writing a good grant versus having a good idea? you get a percent

Noah_Silbert:

That's a good,

Michael_Berk:

breakdown.

Noah_Silbert:

oh, percent, I don't know. I mean, you have to, I think you have to have a good idea, but you have to be careful. Like a good idea means in some sense a fundable idea. So like it has to be good in the context of a granting agency. I mean, another way to put that is it has to be something that you can convince the panel is a good idea. So it can't just be something

Michael_Berk:

Got

Noah_Silbert:

that

Michael_Berk:

it.

Noah_Silbert:

you think is a good idea, right? It has to be something that according to the mission of the granting agency. they will also say like, yeah, that's a good idea. Writing the grant though is really, really crucial. I mean, it's like, it's maybe a cop out, but you know, 50-50 honestly, it might even be more on the writing side, just because that whole system. There are way more applications and way more people seeking grants than there are grants to give. And part of what that means is that... When, and this will probably be, I'm trying not to be overly cynical. This will probably still be kind of on the cynical side of things. When a panel is reviewing a grant, I think there's a strong incentive for them to take any flaw that they can find, any reason to not fund a grant, and kind of, and use that and say, all right, like put that one on the trash sheet. Not necessarily on the trash sheet, but like below the threshold for funding. So. Part of what that means is when you're writing a grant, you have to work really hard to not provide them with any easy excuses to not fund it. And that's, you know, the grant, the consultant, the grant writing consultant that I worked with was great. And he like, he drove that point home to me. And I absolutely learned that from him. Like he really, really drove home the point that it's just like, there's a bunch of known things that will make it very likely that a grant won't get funded. And so knowing to avoid those things and working really hard to. write a grant in a way that it won't get dinged for those sorts of errors is really important. And so, you know, he was fun to work with. I worked really hard on it, submitted it, it got scored but not funded. I was able to resubmit it and then it got funded. And the reason I'm bringing this up in relation to my move to industry is I got an email from the program officer at the NIH and she let me know that it was going to get funded. And I was really excited and happy. for about 30 seconds. And then I thought, okay, what's my next grant? And then the next thought was, I don't know that I wanna do this. I don't think I can do this. Like, this is just not, this is not how I want my, like how I wanna do research. So some people, you know, one of my colleagues at the University of Cincinnati has a fantastic track record of getting grants funded and she really enjoys the process and she's clearly really good at it. Based in part on knowing her and talking to her, I worked really hard to shift my mindset and find ways that I could get value out of the writing process, whether or not my grant got funded. But ultimately in that moment, part of what I realized was like, yeah, I definitely got something out of it. I thought long and hard about the design of the experiments I was proposing, how I was gonna apply that modeling framework I was talking about previously to, in this case, disordered voice perception. Like I developed some new kind of angles with this model to try to use it. Some of my colleagues worked So I was in a communication sciences and disorders department at University of Cincinnati and some of my colleagues that I worked with were voice specialists. And so I started getting interested in how perception of disordered voices works. And so I got a lot out of that process of writing the grant. But ultimately, you know, it became clear to me in that moment that even though I got a lot out of it, it didn't feel like it was enough for just like the stress and the... I don't know the costs of it. And you have to do it. Like you have to get grants to fund your research. And so part

Michael_Berk:

Yeah,

Noah_Silbert:

Michael_Berk:

necessary

Noah_Silbert:

what was appealing

Michael_Berk:

evil.

Noah_Silbert:

about industry was the possibility of being able to do cool research that I like doing, which kind of by definition is already funded, right? And so to back up again, so there was that moment of writing the grant and realizing, like, oh, man, I don't think I can do this. I don't think I want to do this. And then a close friend of mine. who was also an academic, got a job at Netflix. And talking to him about that, I thought, this sounds amazing. And then just through him, he connected me with a manager that was hiring and I interviewed and it did indeed seem like it would be amazing. And so yeah, I got offered a job and took that.

Michael_Berk:

Nice.

Noah_Silbert:

So it

Michael_Berk:

What

Noah_Silbert:

was

Michael_Berk:

did

Noah_Silbert:

a combination

Michael_Berk:

you like?

Noah_Silbert:

of the, oh sorry, go ahead.

Michael_Berk:

No, I interrupt.

Noah_Silbert:

Oh,

Michael_Berk:

Go ahead.

Noah_Silbert:

I was just gonna say, so it was a combination of like, looking at the academic landscape, looking at what my job was gonna be like, and feeling like it wasn't really what I wanted, that I wasn't gonna be that happy with it. And then just by chance, getting this opportunity to get a really good industry job, kinda out of the gate.

Michael_Berk:

Nice, is it everything you hoped and dreamed?

Noah_Silbert:

Yeah, I mean, so one of the things that was interesting about that transition also was I didn't really know what data science would be like. And I worried a little bit. Like I knew if I went into that and then the problems I was working on were not interesting that I would be very unhappy with that. So I was a little anxious. I was a little nervous because I just I didn't know what kind of problems I'd be working on. And then very quickly learned like, oh, yeah, like. These problems are exceptionally challenging and very interesting. And yeah, that quickly, I quickly learned that that was not something I needed to worry about. There was a ton of very cool, very challenging, very interesting problems to work on.

Michael_Berk:

What was the team that you were surrounded by? What were they like?

Noah_Silbert:

At Netflix? Yeah. I mean, they were great. Very supportive, helping me kind of learn the ropes. It was an interesting team. I'm blanking on what the team was called. It was an interesting kind of odd hodgepodge because the other data scientists on the team and the analytics engineers worked on, in some ways, very, very different problems. And I mean, part of this too is the team got re-org not too long after I. got there and some of the people that were originally on the team were like it split off. So some of the people I worked with closely at the beginning did a lot of work in the experimentation area and worked a lot on some interesting like big data problems with just how to how to like what essentially what to measure and how to store all the data that you needed to store how to store the things that you were measuring in a way that was reasonably efficient but still accurate and could be presented to stakeholders in a useful way. Um, after the reorg, the team I was on, so I worked exclusively in and around dubs. So on the product side, understanding how people, how and when people consumed dubbed audio, but also on the production side, because Netflix pivoted in 2016, I think it was towards producing a lot of their own content. And so they were doing a lot of work to produce dubbed audio. It wasn't just a matter of having some titles that had dubbed, but, you know, they were. producing originals and producing dubbed audio for those originals. And so I got to talk to people in LA. It's like, I went and did a workshop at a dub studio in LA and it was super cool to learn, like see the nuts and bolts of how dubs get produced.

Michael_Berk:

Yeah, it's quick quick aside for that. I, it just popped into my head. I remember during your interview. Uh, so I joined, uh, to be a little bit before Noah did. And I remember the data science team was talking about hiring him or not. And one of the things that we looked for was whether people showed interest. And I remember one of our data scientists noticed that Noah checked the dubs for like every single language and noticed that. Even if you reload the page or switch videos, it's still maintained that language and we were like, this man did his homework just a little aside.

Noah_Silbert:

Yeah,

Michael_Berk:

Sorry. Continue.

Noah_Silbert:

no, that's fine. Sticky settings. It's important

Michael_Berk:

Yeah.

Noah_Silbert:

to have settings be sticky when they should be and not

Michael_Berk:

Facts.

Noah_Silbert:

when they shouldn't be because it can be

Michael_Berk:

Facts.

Noah_Silbert:

super annoying. You know, HBO is an example. Okay, we don't need to go into that.

Michael_Berk:

Yeah.

Noah_Silbert:

Yeah, the flip side of that coin is some of my interactions with the HPO web interface. So one of the other data scientists on the team had kind of an analogous job focused on subtitles. But then some of the other people on the team were doing work with, I think one guy did work with like the box art. So the thumbnails that would get presented and there was some cool, you know, like explore, exploit experiments that they were, experiments and modeling that they were doing. One or two people, they were doing some really interesting work on the creation of trailers and how trailers get made and trying to automate or semi-automate that process. That was very cool to learn about. So it was really, it was a wide variety of... I think the category that kind of held it all together was creatives as a noun, meaning like creative materials that weren't the main sort of movies and TV shows. So dubbed audio, subtitles, box art, trailers, you know, like all the other, all the things that get produced that are still, you know, creative materials, but that aren't like the main content. It was an interesting, yeah, it was a very interesting array of people, but yeah, I mean, like very cool, um, range of skills and backgrounds. You mean people, you know, lots of machine learning expertise. Um, you know, one, I think the guy that was doing the box art stuff, his background was in, I want to say geology, geophysics. Like he told me about his, some of his career before Netflix and he was doing this work on like permeability of. with like gases, you know, fluids and gases underground and different types of rock and stuff like that. I don't

Michael_Berk:

That's

Noah_Silbert:

even remember

Michael_Berk:

cool.

Noah_Silbert:

exactly, but like having to do with, I think probably like natural gas extraction, some of that sort of stuff. So yeah, it was kind of a wild array of backgrounds.

Michael_Berk:

Nice. And do you think this diverse, cause there's this classic adage where you want a diverse team because this creates good work and no one like it's sometimes people don't define it from my perspective. The value of diversity is that people can attack a problem from very different angles. Um, do you, did

Noah_Silbert:

Yeah,

Michael_Berk:

you find

Noah_Silbert:

totally.

Michael_Berk:

this effective at Netflix?

Noah_Silbert:

Yeah, I think that's probably one of the strengths of a place like Netflix for sure, is that it's a bunch of, I think, very accomplished people with, you know, strong backgrounds and a huge array, a wide array of fields. And I think, yeah, that's exactly one of the benefits is that you can get people that just know about modeling techniques or mathematical models or, you know, computer science skill sets or whatever, and just be able to come at a problem from a totally different angle, notice things that maybe you don't even... know to think about. Yeah, for sure.

Michael_Berk:

Yeah. So, after Netflix, you joined TUI. And were there any things that shocked you in terms of differences between the two orgs?

Noah_Silbert:

Yeah, I mean, so as a brief aside, after Netflix, I worked at University of Maryland as a research scientist for a year and a half, and then I joined TUBE. So it wasn't directly after Netflix. But yeah, no, I mean, I have found the differences between Netflix and TUBE super interesting. I think mostly related at a very high level, mostly related to the difference in the age of the companies and the size of the companies. When I was at Netflix, I think there were probably about 6,000 employees total. The data org was called DSE, data science and engineering. And there were, so that's made up of data engineers, data scientists and analytics engineers. So those jobs at Netflix, data engineers are data engineers. Data scientists tend to work on causal inference and experimentation and or machine learning. And then analytics engineers do a little bit of sort of like light. data engineering and then build lots of dashboards. So they're fairly complimentary roles with a little bit of overlap. Anyway, so those three roles, and at the time, I mean, part of what was funny about Netflix, at the time, everybody's job title was senior whatever. So senior data engineer, senior data scientist, senior analytics engineer, because the way the company was at the time, they tended to hire people that were further along in their careers. And there wasn't really any kind of like leveling up. Like if you got hired there, you were senior whatever. My understanding is that since then they've implemented some leveling and started to hire more junior people. But I don't know the details really. Anyway,

Michael_Berk:

Another, just interrupting

Noah_Silbert:

brief aside.

Michael_Berk:

one more

Noah_Silbert:

Yeah,

Michael_Berk:

time.

Noah_Silbert:

yeah, yeah.

Michael_Berk:

Yeah.

Noah_Silbert:

So, yeah, yeah.

Michael_Berk:

So I was just on a plane and I was listening to the hard things about hard things or something like that by Ben Horowitz. And he talked about two competing perspectives on leveling. One comes from Mark Andreessen and one comes from Mark Zuckerberg.

Noah_Silbert:

Okay.

Michael_Berk:

Mark Andreessen says that you should give people the fanciest title possible because it's free. So if they want to be senior vice president of data analytics,

Noah_Silbert:

Right.

Michael_Berk:

because when they're doing analyst work, go ahead.

Noah_Silbert:

Right.

Michael_Berk:

It's free.

Noah_Silbert:

Yeah.

Michael_Berk:

Mark Zuckerberg takes the sort of inverse approach or opposite approach. And he says that you should be given a title that sort of promotes equality and immediately indoctrinates you into the leveling process at Metta. Um, I'm probably in butchering the, this overview, but, but what are

Noah_Silbert:

Sure.

Michael_Berk:

your thoughts?

Noah_Silbert:

Oh, that's a good question. That's a very good question. That's a good question as somebody who has been a senior data scientist at two different companies and has never held a different title. I mean, I think so. I think there's value in having a well-defined leveling system. So my guess is actually that it probably interacts with those two factors I said before about Tubi and the differences between Tubi and Netflix, the age of the company and the size of the company. So let me come back to that. So I think there's value in having levels and having them be reasonably well-defined and understood sort of how, so that people will know kind of how their career can progress within a company. But I also think you have, my guess is you have to be really, really careful to design that system, to avoid, I think, so to avoid a system in which What's the... I'm blanking on the name of it. So to avoid the problem in which the thing you're measuring, so the level ends up like not actually measuring what you want it to and people just chase the label. People wanna level up because they'll get a better paycheck, but it doesn't necessarily reflect actual growth or increased impact or increased value or anything like that.

Michael_Berk:

Right.

Noah_Silbert:

You don't want it to be just a superficial label. When people level up, I think both for their own growth and career progress, but also from the company's perspective, leveling up should reflect growth and increase impact and increase responsibilities, or, you know, again, however it's defined. And so my guess is it's probably really, really hard to do that well and design that well. And then, okay, so the reason I think it maybe interacts with the age and size of a company is I think it probably really young companies at startups and stuff. I don't know how much sense it makes to have levels because my understanding, like I've never worked at a tiny company or a startup, but my understanding is that anybody that's in a really early stage company kind of has to do whatever needs to be done like that, you know, there's a lot more sort of permeable boundaries between different job titles. And so. Yeah, I think it's harder. It's probably harder to have a well-defined leveling system and stick to it. Whereas, and I think, I mean, Netflix is an interesting example. Again, I don't really know anything about the leveling system, but it's interesting that they switched from a system without that to a system with that as the company grew and aged and got more mature. That they reached a point where they decided, oh, actually, we do need levels. And again, I don't know why. I have no idea why that decision was made or how it was made. But. Yeah.

Michael_Berk:

Yeah, you basically encapsulated the point of in the book, which is when you're small, it doesn't matter when you're big. You have to start designing carefully.

Noah_Silbert:

Yeah, and it's so and that I mean that leads back to the question you asked which is you know Some of the differences between netflix and tubi. So, you know, I mentioned dse was about 300 people I think I mentioned it. So the whole everybody that did all the data work was 300 350 people something in that ballpark And the company as a whole was 6 000 um And you know when I interviewed for netflix, I got an on-site interview at facebook also and I went over to that campus and saw that and you know faith that The people that worked on the campus at Facebook at the time, it was something like 18,000 or 20,000 people. So it was pretty wild to see a company that was basically a small town. Like literally, you know, the campus, everything about it was like a small town. It was, I was just like, this is astonishing. And then to come to Tubi and the whole company when I got hired was about 300, 350 people. And that shift was totally fascinating to me. And then I had, there were a couple of things that happened. as I, you know, the first six months or so that I found kind of interesting in this regard too. So one of them was one of the stakeholders I worked with had been at Tubi since it was a company of, you know, 20 people or something like that. And then she moved on, she left, and I saw on LinkedIn that she was at a startup that had, you know, 15 or 20 people. And that helped kind of helped something kind of click in my head, which was, oh, yeah, like, the size and stage of a company is a totally valid thing to think carefully about and know about yourself. And like, yeah, I like working at companies that are like in this stage or this size. And know that, like, so exactly that issue of like, you know, I prefer a role where it's a little more amorphous and I get to kind of create what the roles even are. And I know that that means I'm gonna have to do, you know, a wide variety of things and maybe do stuff that isn't strictly speaking in underneath my, you know, under my purview. but that's a totally legitimate way to want to do your job. And that, you know, it's great like to know that about yourself and to know that you like that kind of role is super valuable. And so a big part of the difference is again, just like, and I can illustrate like another way to illustrate this is you mentioned earlier, and I talked some about it that my role at Netflix was working completely on data science in and around dubbed audio. So, you know, sometimes on the product side the consumption side, sometimes with production. I did some causal inference with observational data, but it was all dubbed audio. And then one thing that kind of blew my mind when I started at Tubi was I'm doing onboarding and our boss says to me, like, okay, we have this pod structure. It's kind of this structure where it's groupings of data scientists and kind of how they map onto stakeholders. And she said, so pick two or three that are interesting to you. And I was like, What do you mean pick two? Like I just get to pick which ones I join. And it's like, yep. And I'm like, there's not like, you don't just have one that you are going to tell me that you need me on. Like, nope, you get to pick which pods you work on. And I was like, that's wild. Like that degree of flexibility to pick and choose, essentially the content that I'm gonna be working on, the problems I'm gonna be working on. I mean, again, when I think back at it, it kind of blows my mind, but it's, you know. And so part of what that helped me realize is Oh, okay. I think I like working at companies that are to be size more than Netflix. You know, like I really, I really, really, for me personally, it has been enormously helpful in growing my skillset because I get to work on a variety of problems. So it's, it's been fantastic for me. And yeah, it's just a very, very different, and it's, I think very closely related to again, sort of how long to be had even been around, how long they'd been doing data science. and just how big the company was as a whole. Because it's impossible to imagine that. I think Netflix is much bigger now than it was when I was there. But even when I was there and it was 6,000 people, there's no way they could onboard people and say, hey, pick a team.

Michael_Berk:

Yeah,

Noah_Silbert:

pick a

Michael_Berk:

yeah,

Noah_Silbert:

problem

Michael_Berk:

that's

Noah_Silbert:

to work on.

Michael_Berk:

crazy.

Noah_Silbert:

Like when they advertise roles, like it's, you know, the scope of it is something like, you're going to be working on dubbed audio. It's like, all right.

Michael_Berk:

Yeah. And did you find this transition? And I would imagine that academics are also pretty specific. Like, yes, you have freedom to explore areas, but you have your department, you have your colleagues. And likewise at Netflix, you were sort of pigeonholed into a specific role. How has this transition from structure to ambiguity for you?

Noah_Silbert:

That's a good question. I mean, so for me personally, that the ambiguity I think was really helpful. That flexibility was really helpful in developing my skills and kind of figuring out what sorts of problems I do like to work on and kind of where I can bring my strengths to bear on data science problems. So I've really enjoyed it. And it's interesting too, just still being a 2B, things are sort of firming up a little bit. There's a little bit less flexibility to pick and choose. Like when we hire new data scientists now, there's the roles are much more, just sort of much more constrained than when I got hired. I think I was one of the last to get hired where it was still very, like that extremely high degree of flexibility where it's literally like, yeah, pick your pods. All right. And again, I appreciate like, I've... started working on some new stuff recently. I guess it's been a few months now, but I was able to shift just internally and shift my focus pretty dramatically. And I felt like I kind of got like grandfathered in like, okay, I'm still from the era of the ultimate flexibility at Tubi, which from my perspective is great. Yeah, I mean, academia is interesting because in theory, and this gets back to that first question you asked me a little bit. In theory, you can work on whatever you want to work. particularly once you get tenure. You can do whatever you want to work on. That's true to a degree. There is some pigeonholing. There are some expectations. But it gets back to that funding issue in some important ways, I think, which is that. if you want to do research on a topic, but you have to figure out a way to do it. You have to get the money to do it. And so, even if you find something interesting and you want to work on it, that doesn't mean that you're gonna be able to work on it. And yeah, so it's, you know, there are a lot of pressures to kind of, I don't know if pigeonholing is exactly the right word, but that basic idea of, yeah, kind of like narrowing down your focus and constraining what you can actually work on. But yeah, I mean, it's the... And I mean, you know, even at Netflix, even with Dove's, like there were a lot of different sorts of things to work on, so it wasn't, it was constrained in terms of the sort of content area, but in terms of the data science skills and the data science applications and problems, there was still a pretty wide variety of things to work on there for sure.

Michael_Berk:

Did the tech stack and tooling differ?

Noah_Silbert:

Oh, yeah, for sure.

Michael_Berk:

Yes, sir.

Noah_Silbert:

That was an interesting one. I mean, Netflix had a very, when I was there again, like pretty advanced, like well-developed tech stack for sure. And it was, I mean, honestly, it was only when I got to Tubi and started to learn the tech stack there that I realized like, oh, okay, some of those tools, like clearly they had people working internally for years and building these kind of amazing tools. They had a thing, I don't even know if they still have it, they had a thing at the time called Big Data Portal. And it was just this interface that made it very easy to. write queries and access to data warehouse. I mean, one thing that I have found interesting about the distinction between Netflix and Tubi, I kind of didn't even know it at the time. And this will actually allow me to loop back to another part of your, I think, initial question. I didn't even know it at the time. Like I didn't, I think I got like a glimpse of a tiny corner of the data warehouse at Netflix. Like a teeny tiny corner.

Michael_Berk:

I'm sure.

Noah_Silbert:

I think at the time I didn't even really know what the term data warehouse meant. Like I, you know, it was just sort of like. And so to give a little more of my background, I have a lot of training in statistics and statistical modeling. I know you asked at the beginning about some of the challenges moving from academia to industry. One of the biggest challenges was not even knowing the things that I didn't know. And specifically, not really knowing what data engineering was, not really knowing kind of a lot of programming and computer science concepts and ideas, and particularly around cloud computing. And so in some ways, Netflix was probably a great place to land because I was insulated from a lot of it. I didn't have to know that much about data engineering. In retrospect, I would love to get another glimpse at the data warehouse and understand more about how it's structured just because. part of what's interesting and part of what, again, from my own skillset and my own interests, part of what's been great about working at Tubi is that, you know, there's a data engineering team that's separate from data science, but that overlap and the flexibility with kind of how much data engineering type work data science does is very different between Netflix and Tubi. And part of what that has allowed me to do is get better at that stuff and learn more about it. and understand, you know, it's like at 2B, I feel like I've got a much better handle on the overall structure of the data warehouse. You know, from day to day, I still only use bits and pieces of it for sure. But I have a much broader knowledge and much deeper knowledge of the data warehouse at 2B than I did at Netflix. And the tooling can, you know, the way the tooling is, is part of the reason in order to access data at 2B, you know, it's. there's not that sort of insulation between these layers. There's not just a sort of single convenient tool like this is how you access big data.

Michael_Berk:

Yeah. It's super funny. You say that because no, and I basically have opposite backgrounds. I barely got through undergrad. I studied environmental science with a minor in data science and all of my technical acumen came from side projects and building stuff. And so I had very little formal training. And then I go into to be and sort of the hacking was, it was natural for me, but these statistical, like what is statistics? What is math? What is science? Like all these. sort of academic based concepts were, that was the learning curve for me. So that made me work on a bunch of side projects, including this blog where I broke down academic papers and that sort of helped me get up to speed and get. Not the equivalent of like a master's degree, but at least some of the topics now became familiar and I could speak the language. Um, but it's interesting moving from two B to Databricks. It's a whole reset into working for an infrastructure company where I now have to work on. Google Cloud, AWS, Azure, I have to be fluent in all of them. I have to be able to spin up Databricks instances and understand the underlying cloud infrastructure that's provisioning them. And it's a lot more software and data engineering heavy. So now like a compute resource is no longer magic to me. Like it's a processor with

Noah_Silbert:

Nice.

Michael_Berk:

Noah_Silbert:

Yeah.

Michael_Berk:

fixed size and it's so cool. It's so cool to actually see what's going on under the hood. But it's interesting how your background can really influence. what things you find valuable because I find this Databricks experience really valuable. I feel like I can build things from scratch. But if I was put in Netflix working on dubs or anything linguistics related, I know I would do horribly because you

Noah_Silbert:

It's

Michael_Berk:

need

Noah_Silbert:

been a

Michael_Berk:

years

Noah_Silbert:

very

Michael_Berk:

and

Noah_Silbert:

different

Michael_Berk:

years of

Noah_Silbert:

experience,

Michael_Berk:

training.

Noah_Silbert:

yeah.

Michael_Berk:

Yeah, exactly. It would just be a very steep learning curve. So not only size, but the industry or the technical product that your company produces that can really influence what you work.

Noah_Silbert:

Absolutely. Yeah. And again, like kind of how roles are defined, right?

Michael_Berk:

Exactly.

Noah_Silbert:

What, what counts as data science at a company. And that's one of the things that's interesting and also sometimes kind of frustrating or bizarre about data science is that you can have exactly the same job title and from company to company, it's just a wildly different job. Um, yeah, I mean, it's, you know, another reflection of this sort of general pattern and talking about our backgrounds at one point in my, when I was at university of Cincinnati, I decided I should learn about machine learning. And so I got some books and I was reading about machine learning. And I even did some posters at some conferences and used some of these tools. Like I was, I built some like naive Bayes classifiers and I was doing some stuff with support vector machines. It's been a while since I thought about it. Yeah, support vector machines.

Michael_Berk:

SVNs, yeah.

Noah_Silbert:

You know, so I was learning about these models. And at the time I remember I was reading these books and implementing some of this stuff on the datasets that I had and the data I was collecting in my lab. And I remember at the time, and I even read some papers, there's a paper, it's called something like the, is it the three cultures of statistics and machine learning? Or the two, something like that. It's kind of just about different approaches to some of these same problems. And I remember thinking at the time, oh, okay, yeah, I get it. And I think I read a quote that said something like, machine learning is basically statistics done by computer scientists. And I remember at the time thinking like, oh, okay, yeah, I can see. It just sort of makes sense, right? Reading these books on machine learning. same basic ideas that I learned about taking statistics classes. And I should say too, I mean, my statistics training is, was atypical, I think, in a number of ways. The lab I worked in as a PhD student was called the Mathematical Psychology Lab. And so we dug into a lot of like low level details and did a lot of mathematical modeling. And I took a bunch of stats classes from... professors that did similar sort of like cognitive modeling and would build very sort of custom custom models of various cognitive processes. And so I learned a lot of statistical modeling in some, I think, pretty unusual contexts. So it wasn't just a lot of like regression modeling. I mean, I learned about regression modeling and t-tests and ANOVA and all that stuff too. But it was all simultaneously learning about, I mean, basically it was all in the context of cognitive modeling, but it was like building sort of bespoke models for particular sorts of tasks, trying to understand working memory, perception, decision-making under uncertainty, things like that. So it was a little bit of an odd array of classes. And that definitely informed, I think that's part of why when I first started reading about machine learning, I'm like, sure, like it's an eclectic array of problems and it's bringing to bear on those problems, a set of models and approaches to it that, you know, similar enough to the tools that I had learned that it felt very comfortable and I was able to start implementing some of this stuff on some of my problems. You know, like the naive Bayes classifier and SVM stuff I was doing, I was using those to classify essentially cochleograms that I had created, which is a, you take

Michael_Berk:

Yeah, what's that?

Noah_Silbert:

an auditory input and it's essentially a simulation of what the cochlea does in the inner ear. So it's kind of like a spectral analyzer, but it's... not just in linear frequency in hertz. It's because the way the cochlea is shaped and the way the cochlea works, so it maps frequencies under particular locations on the cochlea, but it doesn't just do it in the linear hertz scale. And I'd read a bunch of work on sort of computational models of how the auditory system works. So I was building these. And the general problem was I was interested in understanding how people perceive speech and noise. And so I would... take the stimuli that I had given to humans in these perception tasks, and I would run them through this system I built to make these cochleograms. And so essentially, you know, it turned into essentially like an image classification problem because I would have these arrays that represented energy at different points in time and at different frequencies or different cochlear frequencies. And then I would try to classify those and see what the models see what signals in the noisy speech the models these models could pick up on and see how that compared to how people behave on these tasks. Okay, so all of this is in the context. Like I read these books, learned about machine learning, thought, okay, this is cool. It's basically statistics done by computer scientists. You know, that's fine. And then one of the big challenges moving to industry was then realizing in industry, oh, it's not just that. Like you also have to have some good data engineering and you have to be able to get truly massive amounts of data into a form that is usable for your models. And not only that, but now you actually also need models that can be trained on these massive data sets. So the scale issue was just something that I never ever encountered in academia. And it's, for an academic, my data sets were pretty big. I mean, I would have people come in and do these long, boring experiments where they would give me thousands of responses to these noisy speech signals. And so, you get multiple subjects doing these experiments and I would have tens of thousands of data points for my experiments, which... I think for a lot of academic work, even in speech perception, it's a lot of data. By the standards of 2B or Netflix, that's nothing. It's a minuscule amount of data. These days, I think of machine learning more broadly, and I include in it essentially the data engineering and the pipeline building and the software engineering and all these other skills that... to do it at scale, you have to have those skills too. You can't just fit the sort of statistical model at the end of all that.

Michael_Berk:

Yeah, it was a rude awakening coming to Databricks and realizing that the modeling process itself is probably like maximum 50% of the work,

Noah_Silbert:

Right.

Michael_Berk:

like maximum, maximum. And that's probably if you're building some deep learning model, that's relatively complex to structure. If it's a statistical model, it's like 15% or.

Noah_Silbert:

Right, right.

Michael_Berk:

Yeah, and

Noah_Silbert:

You

Michael_Berk:

the

Noah_Silbert:

know,

Michael_Berk:

other thing... Sorry,

Noah_Silbert:

and I

Michael_Berk:

Noah_Silbert:

was

Michael_Berk:

ahead.

Noah_Silbert:

just gonna say, I've read, I've seen plenty of blog posts where people talk about how as a data scientist or a machine learning engineer, a huge proportion of your time is spent just on like cleaning the data and getting the data

Michael_Berk:

Exactly.

Noah_Silbert:

into a shape where you can use it. And it's absolutely true. And it's true for a good reason. I mean, it's, with the massive data sets that we're working with, it's really hard to. to make sure that it's suitable, to make sure that it's clean enough, and it's never gonna be perfect, but to make sure that it's clean enough that you feel like you can trust it and that it's worth even applying a model to.

Michael_Berk:

Yeah. Yeah. The other thing that you mentioned that really hit home was academics thinking that a hundred thousand rows is a big, a big

Noah_Silbert:

Right.

Michael_Berk:

data set. And I don't remember how many rows were in a two B's like court. Do you know how many rows are in two B's core analytics platform? I think it was

Noah_Silbert:

It's ludicrous.

Michael_Berk:

hundreds of billions, something

Noah_Silbert:

I mean,

Michael_Berk:

like that,

Noah_Silbert:

at least, yeah,

Michael_Berk:

or at

Noah_Silbert:

it's,

Michael_Berk:

least trillions.

Noah_Silbert:

I don't remember

Michael_Berk:

Yeah.

Noah_Silbert:

the exact, yeah, it is a truly ludicrous number.

Michael_Berk:

Yeah. Yeah. So when you have this scale of data and to be still probably relatively small compared to something like meta, how does that impact how you use data to make decisions in the organization?

Noah_Silbert:

That's a good question too. I think part of what it does, it probably, whether you know it or not, or you're explicit about it or not, provides a really strong incentive to make sure you really understand what kind of questions are being asked and what kind of decisions need to be made, because you need to make a well-informed decision about what data you even access, right? What data, like where you're getting your data from. And I can... Given an illustrative example, so part of the onboarding process at Tubi, you mentioned that sort of root table. There's this one enormous table that's sort of at the root of the data warehouse. And then at least approximately everything downstream of that is aggregated up to some level. So it's smaller and... Yeah, I mean, ultimately just sort of, it's smaller because it's aggregated, you know, whether it's respected with respect to time or groupings of users or whatever. So part of that decision-making process for data scientists is knowing where in the data warehouse you're going to write your queries. Like what tables are you gonna access? And I remember somebody, probably you, I mean, probably multiple people just telling me as I was onboarding, essentially, unless you absolutely have to use that root table, don't. like,

Michael_Berk:

Yeah.

Noah_Silbert:

you know, use something downstream. And for most problems, you can use something downstream, but don't use that really, really big table unless you absolutely have to, just because it's so big. Like it's, and I mean, it's big, but it's also challenging to use because of its structure, just for other reasons. But...

Michael_Berk:

Yeah. It, yeah, typically in the warehousing process, you start off with this, at least a variety of roots or maybe one giant root table, and then you go through this, the ETL process of cleaning and aggregating and cleaning and aggregating has a purpose. Like there's a reason

Noah_Silbert:

Absolutely.

Michael_Berk:

why we do it. And you, when you materialize those changes in, in intermediary steps, it's often a lot easier to work with those clean tables than recomputing from the beginning, but

Noah_Silbert:

Absolutely.

Michael_Berk:

the worst part is when you have dependencies on, uh, let's say external teams or even yourself, where you have to create those intermediary tables for your project, that can be challenging. And then you do have to drop down to the trillion row table and do aggregations from it. So it can be a mess.

Noah_Silbert:

I remember at Netflix, there was one time where I was trying to, I was working on understanding abandonment rates. So understanding when people abandon a piece of content. And I was trying to figure out. I'm remembering correctly, it was something like understanding abandonment, but I wanted to only be looking at people that had, like, in some sense, really started watching something. And so I wanted to just make a histogram to look at view times. And I wanted to filter out, basically, people that... maybe accidentally clicked something or watched for just a few seconds. I wanted to get some sense of like, is there a threshold for cases that I can just sort of say, those weren't really watching it.

Michael_Berk:

Yeah, we're getting

Noah_Silbert:

But

Michael_Berk:

real conversions.

Noah_Silbert:

once I get past that, yeah, getting rid of those ones where it's sort of like, oh, I didn't mean to click that and then they X out. And I remember all I wanted was just like a super quick histogram just to see like, I expected, I'd seen it before and I expected like there'd probably a peak that's really short and then a much larger peak for people, you know, the distribution of sort of real view times. And I remember... pulling some data and it was too big for my local, like just inadvertently too much data for my machine to handle. Like, all right, so I put some filters on and it was still too big. And then I put some more filters on and it was just like this iterative process of like, come on, like I just want a histogram, right?

Michael_Berk:

Hahaha

Noah_Silbert:

And it's like, just like repeatedly accidentally getting too much data for my laptop to handle.

Michael_Berk:

Yeah.

Noah_Silbert:

And like just, the idea that that's even possible. Because again, you know, my academic stuff, that was never, there was never an issue of memory. you know, being able to even handle the data and like hold it in memory to make a histogram. But it was exactly the opposite problem in Netflix because it was like I had to work to avoid that problem. You know, and I'm sure

Michael_Berk:

Yeah.

Noah_Silbert:

I was like, you know, really, oh, okay, yeah, I should filter to how I need to make filter down to a subset of type or whatever. I don't even remember the filters, but it was a very fun. Come on. This should be

Michael_Berk:

Yeah.

Noah_Silbert:

I should be I should have been done with this five minutes ago.

Michael_Berk:

Yeah. No, I, yeah. Working with that table is a mess. The way we solved it at 2B and quote unquote solved was we sampled every thousandth device and have a sort of a sample table where you prototype on. And then you throw it over to a giant compute cluster that will actually do the big job once it's clean. But it's, it's yeah, it's a mess,

Noah_Silbert:

And

Michael_Berk:

but

Noah_Silbert:

those,

Michael_Berk:

I wanted

Noah_Silbert:

I mean, it's,

Michael_Berk:

to touch, sorry,

Noah_Silbert:

oh yeah,

Michael_Berk:

Noah_Silbert:

I was

Michael_Berk:

ahead.

Noah_Silbert:

just gonna say like those. So that, that general strategy, but then also developing little tips and tricks on your own to do, to implement your own versions of that sort of thing. Like prototyping and filtering things down so that you can. Yeah. Get, get a system that works has been like just an invaluable skillset. We're working with these giant data sets for sure.

Michael_Berk:

Yeah. Yeah. I wanted to touch on something that we were discussing before we started recording, which was essentially how decisions are made with data

Noah_Silbert:

Mm-hmm.

Michael_Berk:

depending

Noah_Silbert:

Mm-hmm.

Michael_Berk:

upon the size of the data. So, uh, I've been doing volunteer data science for this education nonprofit for about three years now, and their biggest table is about 10 million rows. And that got added about six months ago. And then prior to that, their biggest table was something like 50,000 rows or like a hundred thousand rows. So we only work with base tables. And from that, you can do a lots of really clean analysis really fast and, uh, sort of move quickly and break things. But once you start getting, like, let's say you take this company public, you start getting shale shareholders. There's now tens of thousands of people working under you. Uh, the process for decision-making changes dramatically.

Noah_Silbert:

Hmm.

Michael_Berk:

So thinking about the differences between Netflix and Tubi, how is data differently when making organization level decisions.

Noah_Silbert:

Geez, I'm not even sure if I know how to answer that. I mean, I think, you know, I guess one thought that comes to mind is. how much further along Netflix was with their experimentation system. You know, I had one experience that was interesting. And I should say, I mean, I think Tubi's experimentation system is excellent. And I

Michael_Berk:

really

Noah_Silbert:

think

Michael_Berk:

really good.

Noah_Silbert:

for the size of Tubi, it's actually kind of like way ahead of where it... So I went to a workshop a year or so ago and an experimentation, and it was really interesting hearing some presentations from people at different companies and realizing, oh, hey, Tubi's experimentation system is... very advanced for a company of 2B size and age. So just to be clear, I think 2B is really impressive. But when I was at Netflix, like, you know, again, they just had a multi-year head start. Part of what that meant at Netflix, I mean, I had a very interesting experience one time where an engineer got in touch with me and an analytics engineer on my team, asking for some help with some segmentation and kind of post-hoc analysis on a dataset. and on the output of an experiment. And part of what was interesting about it was just realizing, oh, yeah, they didn't need us at all to run an experiment, to make decisions around the experiment. It was so automated that the people on that particular product team could do every bit of the process and get to the conclusions and make decisions. And it was probably not fully automated, but it was as close to fully automated as it could be, I think. No, I mean, I think that's probably true to be in a lot of cases too. Like lots of product decisions, I think are similarly automated where product teams can implement some new feature, you know, put together an experiment and run it in our pipeline and our system works well enough that they can do that. But the, the scale of it and the variety of experiments that can, that could be run at Netflix when I was there with that kind of automation was, was fairly impressive to me. And I think that's one of the main ways in which this data is is brought to bear in decision making at either company. I mean, that's the main thought that comes to mind is just thinking about like experimentation systems. I guess the other thought that comes to mind is just when... When you can't run an experiment or when you run an experiment and you want some post-hoc analyses, I mean, then you get into a lot of tricky issues around sort of who should spend time on it, how much time should they spend and what data should you even be looking at? You know, you can do all sorts of sort of observational studies, quote unquote, looking at historical data and maybe building causal inference models or using causal inference techniques to try to inform decision-making. But... And my guess is that those sorts of things are probably pretty similar at Tubi and Netflix. Part of which I mean by that is, that was a weird sentence. Part of what I mean by that is that those are really challenging problems. And... Yeah, it's hard to know how solid your assumptions are with a lot of those techniques. And that I think makes it hard to know how confident you should be in making a data-driven decision based on, you know, some causal inference technique.

Michael_Berk:

Yeah. Yeah. Making being data driven is something that everybody advertises, but it's actually really, really hard to do well because I think the biggest issue is confounding and isolating

Noah_Silbert:

All right.

Michael_Berk:

causal and causal, uh, relationships,

Noah_Silbert:

Yeah, that's kind

Michael_Berk:

but.

Noah_Silbert:

of what, right, that was sort of behind what I was saying about causal inference. I think that's a huge part of it for sure. I think part of it too is, I'm not even sure how to articulate it exactly. I think part of it is... Sometimes it's, so part of it is just communication, I think. So really understanding exactly what decision somebody needs to make so that you can even figure out if there's data that can be brought to bear directly on that decision. Like that can be a remarkably challenging problem. Yeah, like just. Yeah, understanding. And it's, I think, some decisions. I don't want to say I can't have data brought to bear on them, but some decisions

Michael_Berk:

Right.

Noah_Silbert:

are pretty far removed from data, I think.

Michael_Berk:

I would say they can't. I think there's plenty of decisions where you have to use sort of soft, touchy feely intuition and can't bring the hard numbers, especially if the, the classic example of there's just no historical data. Like even if you just

Noah_Silbert:

Michael_Berk:

haven't

Noah_Silbert:

sure,

Michael_Berk:

implemented

Noah_Silbert:

yeah.

Michael_Berk:

tracking for a given analytics event yet, there just isn't data for it. So you have to sort of use intuition.

Noah_Silbert:

Yeah, that's fair. So yeah, I think there are cases where it's pretty straightforward to see how data relates to a decision-making. I think a lot of product decisions, particularly around A-B testing, the decisions to graduate a new feature, I think, are pretty close to the data. And you can make pretty solid decisions, or even very solid decisions, based on experiment results. And so those are cases, again, where you can have a pretty scaled up automated system for bringing data to bear on those decisions. But yeah, I've been working a lot on various forecasting problems and prediction problems recently. And it's interesting to talk to people that make some of the decisions that rely on forecasts. And yeah, the decisions are still clearly related to data, but you're also, the data are guesses.

Michael_Berk:

Yeah.

Noah_Silbert:

sometimes a pretty good guesses, but they're in some important sense, they are guesses based on models. And you're not actually going to know how good the guesses are for days, weeks, months, years, maybe.

Michael_Berk:

Yeah.

Noah_Silbert:

So that's a challenging case where the decision is pretty closely related to the data, but the data itself is sort of hypothetical or the output of a model.

Michael_Berk:

Yeah. Yeah. Well, we could keep going for another three hours. We've done it before. So I'm going to wrap and we'll get handed over to you for any next steps. So, uh, there were a lot of really interesting points in this, this conversation, just wanted to highlight a couple. So first, if you don't want to apply to, for grants, you should enter industry and get a research position, but that transition isn't always the smoothest. One issue is sort of not knowing what you don't know. In academics, you have a well-defined sandbox where you work on whatever your area of specialty is. But typically in industry, especially with smaller companies, you need to know data engineering, cloud computing, computer science, the whole nine. And so it can be really hard to make this transition from specialty to general generalization. And then another thing is handling that ambiguity. Again, you're not necessarily pigeonholed into a specific role, so you often have to wear different hats. This brings us down to the company size aspect, which is Smaller companies may not need to define titles as well, but typically as you get larger and larger, it's important to define these titles. Smaller companies also have a lot more diversity of workload. And as we mentioned earlier, you wear a lot of hats. One tip is that you can't work with base tables after your company exceeds, let's say 100 people. It's just not possible. Don't do it. And then... One thing that is really interesting when thinking about how a company scales is where you start to automate processes and what you think people should be doing versus what can be a repeatable process by a computer. So Noah, if people want to reach out, learn more about you or to be data science, where should they go?

Noah_Silbert:

Um, where should they go? I don't have much of a social media presence or anything like that these days. I mean, I'm on LinkedIn. That might be the best place. Feel free to reach out for a connection there. Send a message.

Michael_Berk:

Cool. All right, well until next time, it's been Michael Burke and my good friend Noah Silbert, and see you next time.

Noah_Silbert:

Thanks.

Search

Trending Now

Popular Searches

How to Transition from Academics to Industry - ML 114

Hosted by

How to Transition from Academics to Industry - ML 114

Share This Episode

Show Notes

Sponsors

Socials

Transcript