A/B Testing with ML ft. Michael Berk - ML 181
Michael Berk joins the adventure to discuss how he uses Machine Learning within the context of A/B testing features within applications and how to know when you have a viable test option for your setup.
Show Notes
Michael Berk joins the adventure to discuss how he uses Machine Learning within the context of A/B testing features within applications and how to know when you have a viable test option for your setup.
Links
Picks
- Ben- David Thorne Books
- Charles- Shadow Hunter
- Michael- Stuart Russell
Transcript
Hey everybody. And welcome back to another episode of adventures in machine learning this week on our panel, we have Ben Wilson. Hello. I'm Charles Max Wood from dev devchat.tv. That rebrand is gonna kick my butt for a while.
Top end devs. Top end devs, folks. Come check us out for our, prelaunch special. We're here with our special guest, and that is Michael Burke. Michael, do you wanna introduce yourself?
Let people know who you are, why you're important. Sure. I'd love to. As Charles said, my name is Michael Burke. I have been working as a data scientist full time for about a year and a half now and then part time for a couple years before that.
Currently, I work at a company called Tubi. It's similar to Hulu, and we provide movies and TV shows that are ad supported. And they're, like, quick hitters. It's, like, a 4 or 500 person org, and we have the largest content library out of any movie and TV streaming service. Oh, cool.
And what do you do for them? Data. I work on 2 pods mainly. The first is sort of AB testing infrastructure, and the second is ads configuration. So on the ad side, it's more decision science and data driven ad configuration.
Experimentation is optimizing the experimentation pipeline and then building out infrastructure. Gotcha. And, before the show, we were talking about some stuff around AB testing and how you're using machine learning there. Do you wanna just kind of explain how that's all set up? Because it sounds really fascinating to me.
And just to give a little bit of background, I've done AB testing, but I usually, like, hook it up to an analytics engine and say, that number is bigger than this number, which doesn't require machine learning, by the way. And then I pick the one with the bigger number. So, yeah, what are you doing different from that? Yeah. So we subscribe to frequentist experimentation.
The other main area is Bayesian experimentation, and they have their pros and cons. But we personally subscribe to or frequentist. And so, roughly, what an AB test is, it's a randomized controlled trial with a control group and n number of treatments. And then what you look to do is compare whether the treatment has a statistically significant lift relative to the control. And if that's the case, then we know that there's an actual change.
And if it's a positive lift, then it's a good change depending on metric. If it's negative negative lift, it's usually a bad change. And then from there, we can decide whether to graduate the experiment or not. So when you're looking at stuff like evaluating changes, when you you have your control group, how do you determine how many users would have to be in that in order to extract the signal? Right.
So there are 2 ways you can go about this. The first is pull a number out of your hat and call it that, and that is sometimes implemented. The second is actually That's what everybody does. Yeah. Yeah.
For sure. The second involves something called power calculations. And we at Tubi have something called a sample size calculator that just uses a a, like, front end dashboard that allows you to put in some parameters. And then on the back end, it uses a Python library that solves power calculations. So you essentially put in your type 1 error, your type 2 error rate, and a couple of other things like minimum detectable effect, and that will output a number of unique units that you would need.
In this case, users, devices, whatever it is. And then from there, you can set an experiment duration depending upon how many users you get per week or per month or per day. So the other the aspect of the power factor calculation is the detectable quantity of difference between your subpopulations, but that's predicated on what your sensitivity is. So a lot of times in data science, we're we deal with alphas of 0.05. Like, hey, 95% certain that this is gonna happen.
And Exactly. That when you do a power fact factor calculation, one of your variables in there is setting alpha. It's saying, how certain do I wanna be? Can you explain to people what the relationship is to your sample variance to that that certainty? Like, if we're measuring something super noisy, how how big of a sample do we actually need?
Right. So I'm a so the most simple calculation per sample size and I found this derived in some YouTube video somewhere, and it changed the game for me. But it's roughly 16 times the variance over something. I forget exactly what. But in the numerator is the variance, and so the larger the variance, the larger the sample size required.
And so that actually is a pretty good segue into variance reduction techniques. But long story short, larger the variance, the larger the sample size by a good amount. And it's really powerful if you can thereby reduce the variance prior to the using information prior to the experiment, and that can really cut down sample size. So we we do something like that as well. That was one of the initial projects that I worked on.
Every major tech company that I that I know of does it, like the Googles, the Facebooks, the Netflixes. Mhmm. And we've stolen some of their ideas, and they they were great. So yeah. So variance reduction, you're talking about subsampling of your populations to eliminate latent factors that contribute to, variance.
So that is an interesting point, and that is an option. But what another route is essentially forecasting variance using preexperiment data. Mhmm. And because it's preexperiment data, there is no confounding within the experiment. A very simple method is called cuped, c u p e d, which stands for controlling experiment data using pre or controlling something using pre experiment data.
And that uses simple, univariate linear regression. But as you can imagine, if you have the tech stack for it, you can throw all sorts of crazy ML at it and just forecast out variance because it's preexperiment data. So when we randomize, those confounders are evenly split between control and treatment Yep. And we're good to go. So you could use more sophisticated models, not just Yeah.
A simple linear regression. You could say, I wanna use Holt Winters, exponential smoothing, or I wanna use a Rima or Ceramax. Or if you don't wanna go through the trouble of dealing with stats models package, you could use Prophet or Greikite. And you can get that that predicted value, forecasted out with seasonality factors incorporated into it. That's pretty fascinating.
Exactly. So Yeah. I'm gonna chime in and ask my noob questions here. So when you're talking about variance, what you're saying is is that, the numbers that you're trying to predict could be all over the place. Mhmm.
And and so what you're trying to do is you're trying to reduce whatever effects are making it go all over the place that don't actually matter to your experiments, that you can get a result that you can rely on with some level of certainty. Is that what you're talking about? Did I or did I completely miss the boat? You got on the boat and are sailing pretty. Yep.
Okay. So at this point then, what you're saying is you look at your your data from before the experiment because you haven't changed anything yet. And you figure out, okay, well, we can ignore some of these effects because they're consistent through the entire thing. And you forecast what our future state would be. So you're, like, hey, we have 2 years of data.
We are gonna run our experiment for 3 weeks or 4 weeks. Let's predict out what the next 3 or 4 weeks of variances, and then we can strip track that out from or manipulate the actual test data to reduce that variance. Right. And that way, we can get a clear answer because we know that this much of the all over the place is stuff that we can reasonably predict is in there anyway. Exactly.
Okay. We're we're just isolating signal and a bunch of noise. Right. Alright. I'm just making sure I get it, and I know that there's some people that listen to the show that are probably also trying to follow along.
Alright. Go ahead. Sorry. Yeah. And the other aspect of what Michael was talking about with the variance component to to people that are familiar with time series forecasting, one of the first things that you do when you're looking at a time series data set is you decompose that trend to your trend, your seasonality, and your residuals.
And those residuals is what drives that variance difference so much. And it it's that noise. And if you can't reduce that noise when you're doing your test, that translates to, oh, jeez, I need to collect more data. And that means money. Right.
Yeah. I guess that was the other thing that you were saying, right, is that your your variance is in the numerator. Right? And so you have to collect that much more data based on, yeah, how noisy your data is. And so if you can account for it and not have to include it in that number, then you can collect less data.
That's what you're saying then. Mhmm. Definitely. Alright. Yeah.
And by the way, I remember the the it's the numerator is 16 times the variance, and it's over the effect size. Okay. So whatever our minimum detectable effect of interest is, that would be in the denominator. Okay. So when you talk about the minimum detectable impact, who usually sets that?
That that's a great question. That's a trick question. You you get a Yeah. A thousand different answers if you talk to a thousand different data scientists. But, really, there's one answer at the end of the day.
Well, if there's one answer, enlighten us. I mean, it's always the business. Right? Exactly. So it's always Yeah.
Whoever is controlling that product, they're the ones that are saying, hey. We're only gonna pull the trigger on this new feature if we hit a a 3% lift. Yep. Exactly. And there there are a couple just thinking about it from an entire business perspective, one interesting angle to take is, let's say you're the machine learning team and you're looking to maximize a single metric over a year time.
The smaller your minimum detectable effect, the larger your sample size is required and the shorter or the longer your iteration cycle would take. But if you have, let's say, a much larger minimum detectable effect, you can iterate a lot more quickly, spit out a bunch of different ML models, try them really frequently, and then see if that leads to a greater lift over time. So something that is actually really useful is if you look at the empirical lifts over, let's say, a year period for your team and see if you simulate just, like, random draws from that, what would be the optimum minimum detectable effect. And that can be a good way to empirically choose that number. That's a pretty cool tech trick tip for for people out there is is using analytics on your historical data, to the to inform the business to make a decision that makes sense.
So a lot of times when you're talking about a new feature where, you know, the company that you work for, you're direct b to c. You're a company that interacts with individual account users and, of course, your Mhmm. Your ad, you know, your ad revenue groups that are putting ads on your platform. And if if you're not demonstrating some new feature that is that is definitely gonna raise raise the bar on on Lyft. I'm like, hey.
We wanna increase viewership. We wanna, you know, increase the the amount of time that people are spent on our platform looking at ads and hopefully buying products from them so that we get more ads, ad revenue. If you're not really moving the needle fast there or iterating quickly, then it's gonna be a tough sell to the business to justify what your experiment is. Yeah. For sure.
Yeah. I think I think of this in terms of because, you know, I've been a freelancer. I've I've worked with companies where, yeah, I've had to go and basically pitch ideas to the business. And what they wanna see is they wanna see how much money can this potentially make us and how risky it is. And so that's where these numbers come in, is yeah.
We have a very high probability, very low risk way of increasing our whatever, which increases our income. Yeah. For sure. So if you were to talk to people in different industries that aren't in where you work right now, like, not in the the streaming video service, somebody that works for, like, an engineering firm or manufacturing or something, and they've never heard of of this concept, How would you break this down to them of convincing them how important it is to use these techniques on everything that they do? A b testing or variance reduction and those things?
A b testing and attribution. Got it. So the the main selling point of AB testing, or randomized controlled trials is that they allow us to determine causality. And there are very few other methods that you can prove causality. So, theoretically, you could build a model that has all relevant variables in it.
And then from that, you can isolate specific impacts of x or y or whatever it is. But with randomized controlled trials, the key thing is that with a large enough sample size leveraging the central limit theorem, we know that confounders will be equally distributed on average between the treatment and the control. Mhmm. So the only difference between treatment and control is the treatment effect. And so that allows you to essentially determine slash prove causality.
And I think it's it's definitely the most robust causal framework, and there are other ones that you can argue also prove causality, but, I mean, that's a that's a whole can of worms. I mean, causality modeling in general is something that not a lot of sort of traditional data science, people actually deal with that often about creating simulation tags and running it through millions of iterations to say, hey, I'm gonna change this thing. How does this impact the output result of this? And it's a highly esoteric realm of data science. There there there's companies and and groups within companies that focus on that entirely.
That's all they do. They don't do supervised or unsupervised learning. But for the the other 98% of the of ML practitioners in the world, nobody ever touches that stuff. And the way to exactly as you said, if you wanna prove that your model is doing something, the only way to prove it is run the test, collect the data, and and evaluate it. And but what would you say if somebody were to run an AB test and get that causal impact of, hey.
We had a a 10% lift on our test versus control, and then they go back and look at feature importances of the model and say, oh, it's this feature that caused this. So I personally don't have a problem with it. It I mean, there's a big difference between determining graduation criteria and the why. Mhmm. So graduation criteria needs to be set prior to the experiment.
And so you say, if we see a lift of greater than 2% of this metric, then we graduate. Else, we don't. And then after that, if you wanna go digging and figure out why this happened, I'm very supportive because, a, it tells a pretty story to whoever's listening, and then, b, it also allows for ideation for for subsequent experiments. So but the key thing is that those stats stat sig will only apply to that initial metric, and everything will be directional and not necessarily statistically significant. Exactly.
That directionality is important for you can use that analysis to to inform further iterations to prove further further, you know, relationships of causality, but you can't in infer causality from your features because of interaction effects and other confounding things that could could happen. The only thing that I can actually prove is not running the model versus running the model result, and here's what we observed. So that's a great Yeah. Way to put that. So I'm kinda curious.
I mean, I oh, is there an instance where you've put something together, you know, you ran your experiment, and it, like, really paid off? Yeah. So the one of the things that I was hinting at earlier about looking at the empirical distribution of your experiment lifts is that more often than not, a lot of the experiments tend to follow the 80 20 rule. So we get 80% of the lift from 20% of the results. That's, like, a Pareto something number.
But that tends to be the case in a lot of things in life and stocks and experiments as well. And so there's definitely cases where we have big winners that drive the vast majority of the lift. I I personally am starting to run some ads experiments, so we're looking to figure out how to structure ads. So for instance, I'm sure you guys have seen plenty of YouTube or Hulu or you name it. We can structure ads so that they are shorter breaks but more frequent or longer breaks, and there's more space in between each break.
So that's something that we're currently playing with. And, obviously, if you add one ad per break, you increase revenue by, like, 20% or 30% just with that. So there's a lot of movement that we can do in terms of revenue versus engagement. And, that's something that we've we've been playing around with for the past couple years, and we're starting to to really make strides in. Have you ever gotten a an idea for an implementation from somewhere in management, and they say, hey, we really wanna run this test, and this is definitely gonna gonna change things.
And people are adamant and about this is gonna be a game changer. You run the test. You evaluate the results based on your power factor calculation saying, we should have significance within, say, 2 weeks based on what people are are projecting. And you you see no real difference. For sure.
2 Tubi is a very data driven company, so we're licensed to push back when execs say, implement this. And we can say, well, at least let's run an experiment. And the nice thing about experiments is they democratize truth. And you can have the greatest opinion in the world. You can be the highest paid person in the room.
But if the experiment shows no lift or it's neutral, doesn't matter. I don't know. I'm pretty smart. Yeah. Well, other than Charles.
Well, that's what my mother told me anyway. I believe it. So if you're gonna sell this this concept of data driven decisions, and it is something that certain successful companies in the tech space have been doing since their inception. Like, a lot of count companies are founded on these principles of we welcome all ideas from whoever they are, whether it's it's the janitor cleaning the bathrooms or it's the CEO. Everybody's ideas are weighed the same.
We let the data decide, but not all companies are run that way. And all even if they should be. So yeah. What would you say to a company that's like, We know our customers. We know our business.
We can just turn this thing on. How would you defend it Yeah. The the inverse of that? Yeah. I'm sure you guys are more qualified to answer this than me.
But I I take a sort of different approach in that AB testing is the gold standard, and it is not necessary. It is definitely helpful and important. And if you have a larger company with the infrastructure to do it, you it is completely necessary. But oftentimes, in smaller companies for example, I'm working with a tutoring nonprofit, helping them with their their data tech stack Called Learn to be, they provide free online tutoring to underserved kids, so homeless kids, foster kids. They're just too small to run any type of experiment.
They don't have enough sample size. And so the highest paid per person's opinion does win. But, theoretically, as you get to a larger state in your company, you do need to run experiments because results tend to be pretty counterintuitive, and even the best PMs don't have a 100% feature rate. So Yeah. But it does occur to me that, sure, you don't have enough data to get to statistical significance, or you don't have enough data to build a model and run some robust machine learning against something.
But you do have enough data, generally, to say, hey, we're gonna try this, and we're gonna we're gonna be because most folks can set up analytics and measure stuff. Right? And just say, hey, we're gonna see over the next 2 weeks if the number goes up. Yeah. Exactly.
So those are called pre and post methods. Uh-huh. And we definitely so creating an AB testing infrastructure is a good amount of work. You have to randomize at usually the user level. But there are other alternatives, such as pre and post methods that, just look at pre intervention versus post intervention to see if there's a lift.
Yeah. And, yeah, those those can be effective in in smaller companies. Right. And then and then as you grow, then you can start saying, hey. Now we're reaching statistical significance, or now we're getting enough data to where, yeah, we can throw some more robust model at this and say, alright, what are we actually seeing here?
Yeah. And and during that journey, you're graduating along the path. I mean, even if you're a single person startup, you should be collecting that data that of, like, the results of what you're doing. For sure. And as you grow, there's a there's a long journey between just collecting data and manually analyzing it, and then going all the way to, hey, we have an automated a b framework that is part of our CICD where we set the conditions of this experiment saying we need 10% control and and four 10%, you know, slices of these different test groups, set up my experiment for me, do dynamic traffic allocation, and maintain state, and then collect all the data and automatically, you know, analyze it through windowed aggregations over period.
There's a huge journey there. And that end state, there's not a lot of companies that have that set up to the point where it's sort of autonomous and it just runs as a service. I have seen them, and they're they're pretty impressive. But it's years of work to build one of those. Yep.
Yeah. But even during that journey, you can make steps. And one of those early steps is exactly what Michael was just saying. He's describing that, like, hey, figure out your sample size, and do your first simple control to test on this, collect it for the amount of time that you need. And what you said with, like, hey, check checking for these big lifts or no lift.
You can get that pretty quickly if your expectation is, hey. We expect this to be 15% lift differential. If after and that means we just have to run this test for 10 days. At the end of 10 days, if you see if you don't see any difference, move on to the next idea. Yeah.
And, also, in smaller companies, you tend to be able to move Lyft a lot easier than at bigger companies. I know the the larger companies, you basically operate by working with sub metrics. So let's say you wanna improve retention, you need to improve sub metrics of retention because you just cannot move retention with a with a new feature. But in a startup, you can make a new website in a day, and that could completely change everything. So Yeah.
It's really hard to move the needle at certain companies when you look you realize that the potential user base is a significant fraction of the population of a country. Like, hey, our user base is is 250,000,000 users. Like, yeah, good luck changing anything with that churn rate. Oh, you're gonna tell people to just have more children to get user numbers higher? Yeah.
So aside from Yeah. One other thing, though Go ahead. Oh, go ahead. No. I was just gonna say, I mean, yeah, if you're talking about, like, raw numbers of, hey, we're growing our user base by such so much.
But, you know, maybe it's, hey, we wanna increase our watch time on Tubi or Netflix. Right? How many hours, how many ads they see, or you know, so there it it may not make sense to go for number of people anymore. It may make sense to optimize for something else, and so you wind up doing that. And sometimes that's the trick that you have to play is, okay.
Well, we can't yeah. We're not gonna get significant lift in this category anymore, but can we get significant lift in some other category that moves the needle in ways that matter to us? Yeah. For sure. So moving on from AB testing, you do you're working this project that we talked about before re recording started, which I find incredibly fascinating.
The blog post that you've been doing, which from what I've seen, you're looking at very interesting white papers that get published and then distilling down what the contents of that are into a highly consumable, pleasurable to read blog post. Where'd you get the idea for this, and how do you pick your source material? Yeah. Thank thank you for the nice words, first of all. Regarding the idea, I think it was mainly twofold.
The first was I wanted to learn there's something called the Feynman technique, which is explaining something to a 5 year old. Every time you find gaps in your knowledge, you need to learn more about that topic. So you can explain string theory to a 5 year old. And if you can't do it well enough, then you don't know the material well enough. So it's essentially learning through teaching.
And so that was the first reason for doing it. And the second reason was sort of more of a fuzzy, warm, reason, which is I think there's a lot of really cool tech out there and really inaccessible tech. And this stuff isn't that crazy or that complicated. And if you just have someone explain it to you, it's usually pretty accessible. Now if you wanna understand the back end perfectly, there's a lot going on.
But getting 80 or 90% of it and then knowing enough to implement it effectively isn't super, super hard. So a while back, I thought it would be a good idea to write a blog post a week for a year. It's it's going. I'm I'm, like, halfway through and haven't missed the week. Don't know how.
But Way to go. That's harder than people think it is, by the way. Oh, I know. I mean, effectively, what you're doing is writing a manuscript. It's in the process of writing a textbook.
It's effectively what you're doing every week. It's like a new chapter, and you gotta go do your research. So you're prepping yourself for your first publication. Yeah. There we go.
That's also a lot of work harder than people think. I will talk to my editor and let him know that you're interested. Perfect. So, yeah, the one liner is looking to bring, academic slash white paper research to the data science industry. And there's a lot of stuff that's out there.
Cool. A lot of stuff gets published. How do you how do you sort through a list of all of the flood of papers coming out of MIT, Berkeley, Stanford, you know, reading through all this stuff and saying, okay. This one seems relevant, or this one's potentially gonna be important. Somebody's gonna build something based on this.
Yeah. That's a great point. There is way too much out there for anyone to master or even roughly learn. I have so I use a website called arxiv.com or dot org, excuse me. And it is Cornell's aggregation of just tons of papers that you can sort by keyword.
They have a nice advanced search feature, and they have a pretty good statistics area. So that's where I get a lot of my papers. And I just do what I'm either interested in or working on at work. So if it's related to a project, it's really helpful to have in-depth knowledge. And then if it's just cool, then it's a fun read.
I'm interested in, like, simulations, and, marine energy is another topic that I'm I'm I'm looking into, like, clean energy in general. So just anything that catches my eye. But the key is to do stuff that you like, or else you won't get through it. And, you know, it's funny you say that because that's the same advice I give people about podcasting, about having a side project at all, is, you know, they're like, why? I wanna do this podcast, and I wanna tell people about this stuff.
And I'm like, well, do you enjoy talking about it? Well, I don't know. I said, you're gonna be done by episode 5 because you're gonna be bored. You know? And it's the same thing with the side project.
I well, everybody builds a Twitter clone in in Rails, and so I'm gonna build a Twitter clone in Rails. Well, are you interested in Twitter clones? No. Well, what do you do with your time off? Oh, well, I I paint minifigures for for D and D.
I said, you you do you like talking about that? Oh, I can talk about it all day. Do that. Build a catalog for minifigures for D and D. Right?
It's your side project. So I love this. Right? Where it's it's, hey, this is stuff I'm really interested in. This is stuff that, you know, I could talk about forever.
This is stuff that I'm really fascinated by. That, yeah, do that, folks. Exactly. And it's tough to take a leap of faith and know that in the future, it'll all work out because I think a lot of these projects are very goal oriented and outcome oriented. But doing stuff you like almost always leads to better results than just what the rest of the people are doing.
That's true. But the other thing is is I think people also have a a tendency to they underestimate the work that it takes to, like, write a post or build a side project, and then they also underestimate how long it's gonna take for people to really find it. I mean, you actually have to go do work to help people find it. And Yeah. It's like, well, okay.
So if I start the podcast now, or if I start blogging now, then am I gonna have a new job in 2 months? No. No, you're not. Right? I mean, unless you have one of your things go viral and people are going, who the heck is this guy?
And then when they look into you, they go, he's he's amazing. It's just not gonna happen. The odds that happening are really low. Instead, what generally happens is you you're consistent, like you've been, and then as people move along, it's, oh, this guy's terrific. You know, he's writing great blog posts.
And so you get more and more and more people showing up. You start appearing on more Reddit threads and newsletters and people just talking about you at the conferences. And, eventually, you build into that. Exactly. Yeah.
You know, it was very flattering to be asked on this podcast. I'm pretty early in my career, so it's kinda wild It's good stuff. The blog post. Yeah. Yeah.
It is good stuff. I mean, I I thoroughly enjoyed the few that that I've read. I'll I'll probably go back and read all of them. The one in particular Say less. Thank you.
That was is really interesting. I read, yesterday was the one on IBM's one of the research papers that they released recently about explaining model deficiencies, free, or free AI. Yeah. Seems like it's proprietary, which is interesting to see somebody write about a white paper that is a proprietary tool. Most people are, are writing about like, oh, you know, there's this new open source tool that, that people are, are using, or it just got released.
Everybody check it out. But I find it almost refreshing to see somebody reporting on something that is proprietary. Because it's opening people's eyes up to the possibility of, oh, this is a new technique that I didn't think of before. There's no code in it. And that's that makes sense from IBM's perspective, because it is proprietary.
But I wouldn't be surprised if within a year or so, somebody will have an open source package that does that. They'll figure it out Agreed. Yeah. Because you you've helped translate something that is I've read the white paper as well. It's not one of the most dense and complex ones out there.
Some of them are almost illegible. But it's still lot of not a lot of people will actually go and read the original source paper of something, particularly if they've never heard of it. So it's it's interesting seeing that. Yeah. And I definitely will not be able to take any credit for future open source packages.
But, yeah, it's really nice to have a digestible piece that explains the high level. And I took a page out of my manager's book and basically tried to, in one paragraph, explain the entire topic and see whether someone's interested in. And then if they actually do wanna implement it, blog post is not the resource. They should go to the paper. They should go to other things.
Well, the other thing is, is with proprietary technologies like that, one thing that I found just, over the years because, you know, I'm I'm a programmer, but I also run a business that runs the podcasts, and I've built custom software for pretty much every part of the podcasting stack. And what I found is that, in a lot of cases, it's easier, cheaper, and better to just use a proprietary solution for for some piece of the puzzle. Right? And then I can focus on the parts of it that really matter, that are different from what I've whatever else is out there. And so but but people tend to shy away from it in a lot of cases because there's free open source solutions.
And so I'm I'm also gratified to see, hey, look. You know, if you wanna shortcut this process, save yourself time on these areas, then go pick up this thing, and and then move on to the next piece. Crazy thing in in the ML space is a lot of the cool tooling that comes out that's proprietary. You gotta buy an entire tech stack for that. You can't just say, like, oh, I want this service from IBM, and then I want this service from SaaS, and then I want this this other service from h two o.
None of them do that. It's like, hey, you want the IBM like, all these things that IBM has to offer? Here's the contract signed, and you're in it for 1,000,000 of dollars per year. So it's a big sell to use that stuff. But illustrating and putting it into layman's terms about, like, hey, this is the what this company is is releasing and some of the research they're doing in Israel, which seems to be IBM's new, like, massive research headquarters of all their ML stuff.
It's all coming out of FIFO. And it gets people thinking. I'm like, I I've been struggling with this, and I've been wondering, cause some people's implementations to refer to that paper in particular that you talked about in the blog post. Some models are out there that are running in production that you have a performance blind spot in a subset of your data. How do you find that?
It's it's not trivial to find that you have some massive tree based algorithm, you know, you run an XG boost and you're like, hey, yeah, we're at 88.9% accuracy score on this. And then somebody does analytics post hoc after, you know, the real world interacts with this algorithm. You find that on a cohort analysis, you have 99% accuracy in 80% of your your use cases, and you have 1% accuracy in your most valuable customer base segment, people are like, what went wrong here? And the tool, like what you wrote about in that paper, is designed to illustrate that. Like, hey, there's this issue here.
It can't predict this correctly. Exactly. Yeah. Anytime you simplify accuracy down to, a mean or a sum, you're throwing out lots of potentially useful information. Mhmm.
So Yep. Alright. Well, I'm gonna pick push us toward the end of the show because I actually have a work meeting in 8 minutes. So, I have a question for you guys before we end. Go ahead.
Cool. So this is sort of, like, a career advice y type question, and I think a bunch of your listeners could probably relate. I have more of a generalist skill set. I've definitely picked some areas of interest. But I think because I'm new in my career and wasn't the coding math nerd PhD, like, like, I don't have all the educational credentials that many of the really technical people have.
So I think I'm gonna make my, like, mark, quote, unquote, in data science by bringing data science to new fields. And I hinted at them before, but I'm working in 2 areas for, like, side projects. The first is education, and then the second is marine energy. So converting waves, tides, and currents to electricity. Mhmm.
So do you guys have advice specifically for generalists on how to bring data science and machine learning to new industries that aren't ready for the next giant deep learning model? I mean, I'd say generalists are the only ones that are gonna be capable of making that happen. And Mhmm. I've heard that from a number of people at companies about like, oh, well, we have, you know, 2 hats being worn by people on our, in our data science group. We have, we have some generalists and we have some specialists that are really deep in this one area.
And I don't know, it always confuses me unless you're working in an industry that is hyper focused on one particular use case, that your industry just doesn't have any, any need to embrace anything else ever in the future unless some revolutionary thing comes out. I think everybody should be embracing some aspects of generalism in their data science career. And that also means that there are some specializations that I think every data scientist should focus on, which is learning how to write good code, like testable code, knowing how to This is the first time I'm hearing this from you, Ben. Yeah. Knowing how to simplify a problem down to something that you can communicate with lay people at a company and gaining wisdom over time of how to most simply solve a particular problem.
And I think people that see themselves or others see them as generalists are more, more equipped to do what you are asking. Like, hey, I'm going into this industry, this company, or this this nonprofit that I'm working with has never done anything with with machine learning or data science or analytics. Somebody with a broad skill set of all the aspects is gonna be able to make that happen a lot easier than somebody who is a specialist on GANs or or, you know, is an NLP specialist. Because those Yeah. Those industry groups might have no need for that.
Exactly. Yeah. Yeah. I tell people I'm an adversarial network all on my own. Anyway, my advice on this is mostly from, the other end, not necessarily building your, your ML career just because yeah.
I I don't I'm not deep enough into it to tell you, hey. These skill sets. Right? But I can tell you that if you're trying to break into a new industry that does not adopt technologies quickly, then the trick is is you have to show them the value before they're gonna move. Right?
In other words, you have to prove out, hey, look. This is going to give you such such a competitive advantage that you would be insane not to adopt it. That's when they move. Right? If you come in with the idea, they're gonna be, like, that's nice, but it's worked this way for 80 years.
And they tend they tend to not want to take the the risk, even recognizing that if one of their competitors tries it and it works, that they're kind of in a bad spot just because they're they are risking capital and time and effort to get it. And so if you can like I said, if you can prove out, hey, look. If you if you go adopt this thing or that thing, you know, you'll increase efficiency, or you can basically print money, or you're gonna be getting some advantage over your competitors by reducing your costs in a significant way, Those are the kinds of things they're gonna pay attention to. It's not gonna be the idea. It's gonna be some aspect of the implementation that they think they can drastically increase their profits and and, competitive nature on.
So that's what I would focus on is I would focus on, a, proving that it works, and then, b, how do you communicate that to them and speak in their language as far as what the advantages offered are. Amazing. Yeah. That both of those points make a lot of sense. Cool.
Thank you, guys. Alright. We're gonna do picks really, really fast. So let's go in, like, 30 seconds apiece so I can make it to this meeting. Ben, you wanna go first?
The only thing that I've been focusing on the last week aside from writing and testing code is, reading the next book, David Thorne, one of my favorite authors, humorist, sort of a very dark humor, of an Australian variety. Check out his latest book and his entire back catalog on Amazon. Guy is a riot. I've yet to meet a developer or a data scientist who doesn't find the guy incredibly funny. Cool.
I'll go next. I have 2 things. 1, I've been doing board game picks. My pick is Shadowhunters. It's out of print, which means you either have to find somebody selling it used on Amazon or go get it on eBay.
But it's Shadows versus Hunters. There are neutrals. They have their own win conditions. If they get it 1st, they win. Otherwise, the shadows win by eliminating the hunters.
The hunters win by eliminating the shadows. And you play with cards, and you move around the board, and you use equipment to kill each other. It's way fun. I'm moving fast. I would sell it more, but I yeah.
The other pick that I have really quickly is top end devs. And I'm I'm moving fast and trying to get this content together. I'm focused on putting out content on building your career and building, leadership. But I am looking for technologists that wanna help either build out an ongoing series of videos that demonstrate in 15 minutes or less some concept. Machine learning is one of the series I want to put together.
I'm also looking for people if they want to put together, like, a 1 hour tutorial. So this could be, here's how you do this aspect of MLOps, or here's how you do this aspect of of building a model, this algorithm explained, or whatever. And then finally, the longer form courses that I think everybody thinks of when they think of online courses. I'd like to have some of those as well. And then I have some other stuff going on that I would love to talk to you about.
If you wanna be an author, topenddevs.com/author. If you're trying to get a blog or podcast or YouTube channel or freelancing launched, I'll also coach you, and that's, topenddevs.com/coaching. Alright. Michael, what are your picks? I I had a a longer list, so I'll condense.
But, basically, there there's some interesting research out there for the ML specific research. There's some research by Stuart Russell at UC Berkeley. He's reframing loss functions for general AI. Super cool. Leon Battu at Facebook's ML team, he was working on causal inference for modeling and just causal modeling in general, then conformal prediction, and then optimal transport for fairness.
Those are all super cool topics and highly recommend. Awesome. Alright. One more thing. If people wanna contact you online, so that's Twitter, GitHub, LinkedIn.
Where do they find you? LinkedIn is Michael Burke. Medium account is Michael Burke. Website is michael dberk.com. And those are all of my things, basically.
Alright. Reach out. The note from my boss says don't worry about making it. So we'd go a little longer if you want. If you have a longer list, go for it.
I mean, it was more like going into each of those. Yeah. But just one quick point that I would be curious for your guys' opinion. Stuart Russell at UC Berkeley, it works on general AI, and he focuses on the his hypothesis is basically that with a fixed loss function, you can't have smart general AI because people don't operate with a fixed loss function. So you're getting from your okay.
Cool. Yeah. I just wonder if you guys are on board with that. I I think that's a super cool, new paradigm where you not only learn the data, but you learn the loss function. Yeah.
I mean, I think it's a natural evolution of not just applied to deep learning, although it it would more aptly apply to something that has back propagation to adjust for and compensate for that. But I think in general, in our profession, that is kind of where where tech will eventually go is is more closely emulating human decision processes, which are no not only relatively inconsistent, but are adaptable to new information as they as it comes in. And, being able to understand interactive elements in a way that I don't think we even understand how human perception works and how the brain works very well. Yeah. But however it happens to work, more closely in letting that behavioral pattern is makes sense.
And, like, fixed loss functions exactly as you as you said, it's not a it's not a finite definitive element about how we we decide on things. It's fungible, mutable. Yeah. Mhmm. Yeah.
And I I feel like it has less application for industry. But for the big ML questions, like, I, robot style, it's a really cool thought experiment. Yep. I'm trying to explain why I do stupid stuff all the time. Exactly.
I mean, what's funny about that statement, not applicable to industry. People said that about neural networks back in the early 19 eighties when they first sort of came out. People like nobody will ever use this. It computationally takes forever. You know, the specialists required to actually write this code in whatever it was at the time, COBOL or Fortran.
And now look at industry With the advent of cloud computing and much cheaper GPUs Yeah. Cell change. Everybody's doing deep learning. It's ubiquitous now. So never say never.
Alright. Well, now that we live in a Star Trek world, we're gonna wrap up. Thanks for coming, Michael. This was fun. Yeah.
This was great. Yeah. Thanks for having me. Alright. Well, we'll wrap it up here.
And until next time, folks. Max out. Alright. Take it easy. Bye, everyone.
A/B Testing with ML ft. Michael Berk - ML 181
0:00
Playback Speed: