Welcome back to another episode of Adventures in Machine Learning. I'm one of your hosts, Michael Burke, and I do data engineering and machine learning at Databricks, and I'm joined by my cohost.
I work on software stuff at Databricks, specifically MLflow.
Beautiful. So today we are going to be doing another case study and this is a customer request that actually came up this week for me and I will be anonymizing it so I can't get sued. But a lot of the underlying principles and underlying themes for this case study apply to most use cases of this flavor. So starting off, uh, we are going to be talking about recommendation engines. And.
We're going to be basically stepping into my shoes and then leveraging Ben's wisdom. This will very much help me in my job and allow me to deliver a better product. So I'm just being selfish right now, but I'm sure you guys will learn a lot.
So let's say we are a large retailer and we sell, pick something, food, we sell food, all right. And we are looking to create a recommendation model. And what this model will do is it will be used to serve the top end products to a given user at a given time. So let's say they enter the homepage, they already have an account with us and they have some viewing behavior and browsing behavior.
We're going to leverage everything that we have in our system to give them the top products that would most likely lead them to convert. So, um, some examples of data that we have, we have prior behavior. So browsing behavior, what they clicked on, what they bought in the past. And we also have a lot of users. So we have a potentially similarity matrices that we can leverage to say, all right, user X is similar to user Y. Maybe they'll be buying the same things.
Um, and then we also have profile level information about sort of what their dietary restrictions are or anything like that specific preferences that they have hard coded in good so far. Ben.
I have a lot of questions, but please continue.
Cool. So currently, and this is not gonna be Greenfield, we're gonna be discussing an existing model and how to improve it. So currently, this theoretical company, which is actually real, has a sort of multi-step process where they first calculate embeddings, then they do some distance calculations, and then finally, based on those distance calculations, they sort of...
coalesce everything into a ranking of 50 products, 100 products, whatever it might be. And this current model, uh, needs to serve, let's say a hundred million users over a year, and we're running it once a day. And then also maybe have a, uh, another model that runs once every six months. And, uh, so that's the setup. Good. So far part two.
What are they embedding?
That is an outstanding question. So, um, from my understanding, they're using these embeddings to essentially convert information into a high dimensional space where they can then use distance calculations. So.
So is the embedding vector as running through something like a large language model based on the product description, or is it a feature vector that they're generating based on attributes of that product?
Okay. And how clean is the data? Do they get the product information data from a centralized repository that is assigned based on the purchasing department? Or is this something that comes from vendors? Like the people that they're buying product from.
That's an outstanding point. Let's say, I don't know, let's say that it's clean for fun.
So in-house curated perfectly pristine data, which I have never seen, but let's go with that.
perfect in every way.
Yeah, we live in a perfect world for it today.
Okay, so we got a feature vector. We're using that in some sort of loose scene based search to say, what are the most similar products to this thing? And we're doing that 50 times for each customer who has an account on the website to, to give relevant recommendations of products that they've not purchased before.
Correct. Yeah. So I'll bring up two more things. Um, the first is they are currently struggling with the runtime of these pipelines. They're, they're absolutely massive. They're basically throwing everything into one big giant model. And the way that they have historically tackled this is they've created subgroups based on dietary preferences. So if you're a vegetarian, if you're keto, if you're paleo, whatever it might be, they have smaller groups where they know that these users are already more similar.
and we're just going to lump them together. So that's a one point. Good so far.
But the granularity of products is still at the level of the product granularity. So we're not grouping products together into a hierarchy. We're just grouping users into groups.
Yeah. So you're, um, hitting at some very important points, which are, uh, basically this thing does everything. It calculates distance between users. Um, it also calculates distance between products, um, for each user. And then, um, I think there was another step where, uh, they leverage sort of hard coded rules based on profile.
to see, let's say again, if you can't eat fish or allergic to fish, well, it won't surface fish as a recommendation. So they're using a bunch of different techniques and this model has adapted and evolved over time.
And the computation of these relationships are, this is conducted at the time of model fit or transform.
Or, okay. Yeah, no wonder it takes a day to run this. First thoughts that I have about that, about the design. Okay.
Well, hold on before we kick it off, my second point slash question. So Ben, we know you're a smart, nice guy, but can you provide a little bit of background about why we should trust you in this space?
I mean, I've built these things before, like the entire thing, designed it and put into production many times, these exact engines that are intended to solve this business problem. It's not a cookie cutter thing that I did like one time and I just copied that at a bunch of different companies. They're all bespoke implementations. Sometimes I wrote all the code. Sometimes I worked with the team. Sometimes I just advised people on how to solve this.
But each one was unique and distinct because it, these sorts of recommendation engines are highly specific to a lot of different things. Most notably your business model. What are you doing as a company? Also your sales model. How do people actually buy these things? Whatever you're selling. And then thirdly would be the diversity of
So, sorry guys.
What is your footprint, both with product volume as well as user volume? And knowing those three things and having a deep understanding of them informs the design of what you would actually attempt to work on. And it's something that's iterative. If you start off as simple as possible and then only go more complex once you get validation of hypothesis testing of new ideas. So basically it doesn't, it doesn't.
go into the model and stay there unless it's making you money. If it's not making any money, if it's flat, or you're losing money, you rip it out of your solution and try something else.
Yes, so to sort of indirectly answer part of that, the North Star metric that they're using, there are two of them. The first is minimizing distance in the distance calculations. And then the second is they actually have an online A-B testing algorithm that will serve multiple models in real time and then determine which model leads to the highest total number of sales.
across all people that are seeing the model or is it split up amongst all of your different customer profile groups? So if you don't split, so this is like an aside for AB testing when you're talking about dealing with humans that are buying things. If you don't disambiguate the differences between your subpopulations within your AB testing groups, so it's not just population A versus population B that are seeing these different models.
Within A, you have 125 different recency frequency monetary value customers. So RFM groups. And this is how often do they come to your site and buy stuff? When's the last time that they actually gave you money for products and how many products, how many purchases have they basically engaged with you over the lifetime of their account? And if you're not bucketing your groups.
within those, you don't need all 125. You can usually subset that down into like 25 super RFM groups, or I've seen people do it with eight, but you basically want to say within test group A, I have high value, high frequency customers. These are people that are, you know, if we're talking about buying food, then these are people that are coming.
and placing orders twice a week, maybe, maybe once a week on a big order. And they've been with your, like using your website for probably a year or two. And they're just, you are the place that they go to buy groceries effectively. Um, all the way down to your least interactive customers. So this is somebody that the last time they bought something from you was a year ago.
they bought 10 things, they've come back and viewed your site recently, but they're effectively a reincarnated customer. So if you looked at their data six months ago, you would have said, oh, they're dead, like they're not physically dead, maybe physically dead, but they're as a customer in the marketing sense, they're, they're deceased customer. Um, but they've come back recently. So their RFM value is going to be in this very weird bucket, um, that you're not going to.
you're not going to have high confidence and hope that they're going to actually make a purchase on that particular day that they come back. They might, but it might not be a big order. Or even if it is a big order, there's no guarantee or there's, there's a low probability that they're going to convert into that high value, high frequency customer. So knowing this about how retail works about these, these RFM groups, your AB test analysis, actually,
let's say we just define 10 supergroups of customer types. We're actually doing 10 AB tests in parallel with our data, and we want to determine within each of these subpopulation groups what the lift is. If we average them all out, you can actually get an invalid AB test because this isn't a fixed thing that's based on time. So where you used to work at 2B, you're doing AB testing based on viewership.
There's a fixed resource that exists within each customer that you're looking at their data. They can't watch more than 24 hours of TV in a day. It's impossible. Unless they have multiple accounts open all at the same time and they're doing something weird. But they should be excluded from A-B testing. If you get data like that, you're like something funny is going on here. I don't want that polluting my data. When you're talking about capital, money, and products.
You'd be surprised.
There isn't a bound but the bound exists within the individual customer. It doesn't exist on a global sense So with nearly infinite resources and money somebody could buy your entire product catalog in one transaction It's theoretically possible. Somebody could come in and say I'm buying this entire store with this transaction It might be a an eight hundred and fifty thousand dollar bill that they should do a transaction But nothing in your system is restricting that
In fact, most people would be happy about that. You bought all of our product in one transaction. Awesome. What if that person was in group A and you're testing out group, you know, an AB test, group B is going to look like garbage, right? You're like, we saw this massive negative lift in this test. Well, because you have this outlier, that outlier is super easy to detect. And everybody listening in right now would be like, Ben, don't be stupid. Come on, man. We would filter that out. Well,
What if it wasn't that, but in group A, you had 40 people come in and do abnormal purchases that were 40% larger than what they normally would do. And then in group B, nobody did anything that was abnormally large. But what if those 40 people were all in one group? They're all high value, high frequency and medium monetary.
Well, because you have a large portion of in your control group, abnormal behavior for a particular subset of users, it now pollutes your, your test and you're going to get invalid results or you're just increasing variance, like unexplainable variance. But if you split it out and assign those customers, a particular group, you can do that analysis and look at all 10 of these groups and say, wait a minute, nine of the groups are showing a lift on our change. And this one.
group, it looks super weird. Like what is going on? And you plot it out and you're like, hang on, these 40 people did something really weird that has never happened before. Let's remove them from the experiment real quick and then look at the data. And you just start looking at variants. You're looking at like the standard deviation of, of what's happening within that population group. So my aside is now over, but that's, that's how you, you can more effectively figure out.
what is actually the impact of your change if you're just grouping people with similar behavior. But that statistical analysis of your subpopulations needs to be done beforehand and decided on so that you can do a stochastic split across those groups for A and B groups. You can't just randomly sample A and B and say, hope for the best. You have to do an A-B split within each of those population groups so that your analysis is correct at the end.
Stratified sampling, lots of fun. Yeah, cool. All right. So I'm sold. If we're going to be evaluating this, you should look at the whole distribution and theoretically pre-experiment defined groups of, let's say these are your power users, these are casual engaged, whatever it might be. Great. So evaluation is sort of clear. We're looking at sales and then we're looking to just minimize our loss metrics. But going back to the model itself.
What are your thoughts? How would you approach this type of problem? Are we using a good architecture? What, what, what are you thinking?
Um, I mean, we've talked about this on the show before about where do you start with a project and how the algorithm doesn't matter. It's more like what problem are we trying to solve? We're trying to sell more apples, sell more bananas, whatever it may be. Right. Or are we trying, or is this recommendation engine and its eventual use case as it's applied on the website or in marketing campaigns, is this for retention? Are we trying to just.
sort of serve as a reminder of relevant products to say, hey, these things are going on sale. We know you really like these. If that's the intention, we don't need fancy math here. We just say like, hey, what are the most frequently purchased items that are gonna be on sale for each customer? That's a SQL query that we just generate. That's probably not the use case here. We're trying to like increase sales or increase awareness that.
our company offers products that these people might not be aware of in their individual market that they shop in. For talking about local stores that they can buy from. Let's see.
Well, let me, let me just answer that in the annoying business way that it was probably presented. We want to increase revenue. Full stop. Oh my God.
What does that mean? So there's so many ways of doing that. So when I've had this exact explanation given to me, I immediately go into what sort of revenue do you want to generate? And is this, do we want to generate raw sales? Do we want more transactions of things that we're selling? And usually the answer to that is, yeah, we want that. I'm like, or do we want profit to go up?
because sales and profit are not, like there's not covariance there at all times. So we start talking about, well, how do we calculate what our raw profit is? And that usually says, you know, that conversation will evolve over a couple of minutes into nobody in the room knows how this is calculated. And then I always say, can we get someone from finance in the room?
We're not going to be doing data science talk right now. We just need an expert in the room who can tell us how this stuff is calculated for this company. And that finance person, if they're available, they usually come in and give this wonderful and amazing educational speech to everybody in the room. I usually tune out during it because I have no context. I don't work at that company and I don't know anything about how their business is run.
But everybody in the room, their eyes open. They're like, we had no idea that was how this works. And it has to do with macroeconomics of how they're actually purchasing products and what are loss ratios. What is the theft ratio from stores for this thing, if we're talking about brick and mortar? And if it's just purely online, what are fraud rates for certain products? And what do we take as losses when fraud occurs? So.
All of these things factor into and also like return rates of people saying, I don't like this thing or this, this went bad and you had a sticker on it that said it, it was good for 10 days and I bought it two days ago and there's mold all over it. Some people return that stuff to the grocery store, believe it or not. Um, so that finance person is going to know that. So they'll be able to inform decisions that are made in the early planning stages of building a solution.
using data science techniques where you can start thinking about, oh, we need this data to optimize this problem and we might not have access to that. Maybe those numbers that the finance person is quoting, those equations might only exist in their spreadsheets. In fact, there's a very high probability that that's the case, that knowledge, that very deep knowledge that people working in finance have about calculating.
profitability for the company for public accounting purposes, if it's a publicly traded company, the chances that those algorithms are in the data engineering pipelines are pretty slim. So that might have to be part of the feature engineering pipeline while there's another project about getting data engineers to create that data set as a table.
But for the purpose of the project, we need to define what is our target metric. What are we trying to optimize here? And once we really understand that, that lets us know, do we need to build a recommendation engine that's custom to our customers so that they will come and buy more of what they like? Or is it something that's custom tailored to our business where people are buying things that make us more money?
I got you.
There's a huge difference in how you would implement both of those.
Yeah. So for, in the spirit of just moving forward, um, let's say we want to predict raw revenue. And the sort of quick rationale is that if we have a bunch of money coming in, we can theoretically optimize supply chain processes and other things to turn that revenue into profit. So let's just operate with the word. I see your face. Yeah. Let's. Mm hmm. We want sales. So I will define this.
That's a big assumption, but okay. So we just want sales to go up. All right.
We want over a month long window, we want the sales per user account, let's call them households, sales per household to be maximized. And sales are defined as total amount of dollars spent towards our platform.
Okay, there's a lot of conditionals in there that you would have no control over, but let's operate under that assumption.
Cool, just for clarity, are any of them very problematic and blocking or are they just sort of things we'll need to account for?
It's a goal that you don't know the maximum ceiling limit on any of the things that were just said. You don't know how much each of your customers brings in their own revenue. You don't know what their household income is. You don't know what their storage space for food is in their home. You don't know what the ratio of them cooking things at home are versus
Do we need to know that though?
You don't need to know that, but if your objective is to maximize the spend at each, for each customer, the assumption is, well, if they're buying milk from our competitor, we want them to buy milk from us. Right. You don't know if they even drink milk. You don't know if you're maximizing, like there's a, there's a, if we're selling food to humans, there's a total maximum capacity.
Alright, you're right.
Unless somebody's trying to shoot for the Guinness Book of World Records, that's like the largest mass holding human in history. Provided we're not dealing with, our customers are trying to do that or competing in some sort of professional eating competition, there's a maximum amount of food that a human is going to be able to consume in a given week or in a month. Where do we get that from? Do we know what the caloric intake is?
for each of these people. Cause anything you buy in excess over that is going to be spoilage and they're going to throw it away. They can be like, man, I shouldn't have bought all this stuff. Now I have to throw it all away or put it, my composter out in the backyard. There's so many factors in that. So if the target is we want to maximize spend for each of our customers, how do we know if, if they're already spending, like getting all of their food from us.
They don't ever order takeout. They don't ever eat anything not purchased from us. We don't know. So if your target is to maximize and you're already at a maxima, you're not going any higher. So any effort that you put into that is gonna be wasted. So you'll just end up getting frustrated as saying like, sales aren't going up. It's like, well, cause you have really loyal customers in this particular group. And that's where a pre-analysis happens. Where you're looking at those RFM groups and saying,
for these customers, if I look at this RFM group over time, what is its trend over a period of 10 years? Is this group stable over time? Do people move from this group to another group? So all of that analysis I would do way before talking about building a recommendation engine so that you have a true understanding of your customer base.
All right, I will rephrase maximize to increase to some arbitrary level. Is that better?
They just want more money. It makes sense. Yeah.
We want more money. Yes. Cool. So that is the definition of what we're trying to optimize. And again, we, we let out this model architecture that exists. Um, the thing that I'm going to be, or at least was scoping on the call is their job runtimes. But because ML is super sexy and you like talking about sort of high level concepts and algorithms and things like that, how would you go about sort of restructuring this algorithm set up?
to be maybe more performant, but ideally just optimizing our metrics of interest.
I would do something super annoying, which is I would ask the front end developers that are in the room and the product people that are in the room, say, can I see the interaction rate charts for each of these RFM groups with your widget that you have right now? I want to see how many people actually interact with the current existing one and what's the click through rate? What's the rate of?
a persistence on that page after clicking and then what's the conversion ratio. And hopefully they have all that data just ready because they should if they're talking about working on this project or improving it, they should be tracking all of this data. And then I would look at what those numbers actually are. Is it 80% of people use that as a primary navigation aid? If so, that's super important.
all hands on deck for making this thing as good as it possibly can be. Is it a, is it a carousel that they're interacting with? Is it an entire page that's generated for this? Is their entire site customized based on the output of this algorithm? Like what, how is it being used is the big question that I would have. Cause that informs what sort of like the volume of data that you would need to generate if you're talking about making this better.
Do you need to do product category level recommendations with a single algorithm solution? Like, do we need to provide a recommender on every product grouping page? When I go to dairy, do I need my top 20 dairy items that are recommended for me? Do I, do I need that for each, each division within the site? What do I do with, if I'm going to do that, then what do I do with like sections of the site that.
That's a good point.
this customer has zero data on.
Or the worst case scenario is it's such an esoteric section of the product hierarchy that you have such a dearth of data on anybody really interacting with it, except an extremely small subpopulation. How do you populate stuff with that? If there is no data or not enough data for a particular population group within your customer base, where do you get it from? Is it most popular highest sales?
And that leads into how do we do fallback? Like what is fallback for these? When a customer arrives and their generated list is insufficient and it doesn't have like high probabilities of affinity, do we, do we salt it? Do we add in products that are just like, Hey, this stuff's on sale. We hope that it sells. Or is it stuff that's been trending? Like highest sales in this product category of the last 24 hours. There's lots of questions to be.
to be asked and then hopefully answered at that point.
Yeah. One, just piggybacking off one of your points, I was careful in defining the metric, uh, to ignore sort of long-term retention based value, because oftentimes, especially with recommendation engines, you can cannibalize, uh, future revenue by optimizing current revenue. So like, as Ben said, uh, maybe we throw, let's say there's, uh, some products that are missing there. We have like tiles on our home screen that are not ready to be filled.
Well, we could throw the most expensive things possible into those buckets. And theoretically users would buy them. That might be a method of increasing revenue, but long-term it might not provide value. And that scenario wasn't perfect, but there's often a trade-off between long-term engagement and short-term gains. But to answer all of those questions in a concise manner to get, to allow us to keep moving, our model is being served on the homepage.
Literally everyone on the site will go to the homepage first. There are no product categories. And I think that's everything.
So the probability exists that if I've only come to buy dairy products from you, that when I go to the homepage, all I'm going to see is dairy. So I think you're a dairy company.
I mean, hopefully not. Like, hopefully our algorithm knows that there's a family behind the screen or maybe there isn't, maybe it's just an ice cream company, but hopefully it can sort of infer what we actually want and it wouldn't over index on prior purchasing.
So, but if I've only ever interacted with that, but I'm interacting with a site that sells, you know, a hundred thousand different products. I've been to those websites before. It's kind of, it's almost painful when I see them. When I look at somebody, it's like really terrible recommendation implementation. It's just like, really? Like I know what you're doing here. I know why I'm seeing this. Like this is.
It's not easy to fix, but it just requires some more creativity in the implementation. And that's more on the product side. That's not the data science side. The data science side will fuel the product's ability to do that, but product design with what you interact with, with recommended, like recommended products. That doesn't happen on the data science side. The pool of things to recommend is generated on the data science side. So if there's a hundred thousand products, when I go to that site,
That element on the page that would contain things for Ben Wilson, behind the scenes of that, there should be a thousand products that get generated. Maybe a hundred, maybe 200, depends on what site it is, but where the sausage is made, the ingredients for what is going to be presented to me, there should be a much larger proportion of things that I would have an affinity for that then on the front end.
on the product side gets filtered. So I'm only going to see 10 or maybe 20, but behind the scenes that are the array of data or those, you know, skew IDs that the front end developers have access to their, they have logic in there to say, well, we predict that he's going to like these five product categories. And we, we have.
That makes sense.
10 products for each of these categories to show him. The first time that he comes, I'm going to show him the first elements in these arrays, like index position 0. When he hits Refresh or comes back to this page, I'm going to show him position 1. When he refreshes, if he hasn't interacted with it, I'm going to show him position 2. And we continue doing that. And there could be other business logic that gets put on there to say, I want to promote things that have.
you know, 25% or more sales taken off right now. Like something's on sale for like 25% price reduction. I want to push those up higher, even if there's a lower probability for the user to interact with it, because that's compelling people like, Oh, I didn't, that's cool. That's on sale. I don't normally get that, but it's, you know, money talks, right? So sometimes that's a, that's something that you can force somebody's hand.
If they think they're getting a good deal, you would promote that over something that somebody would have like a 99% affinity to this product. So there's that informs the, the algorithm design. Once you know what the product target is and product will tell that to the data science team. Data science team shouldn't be thinking of all this stuff themselves. Uh, the, the head data scientist should be thinking of this and working with the product person every single day.
to figure all this stuff out and run a bunch of tests and come up with lists of hypotheses and do whiteboarding sessions and loads of things. But the individual data scientists who's selecting an algorithm to solve this, that's not what they should be doing. They should be getting that list of product design requirements. This needs to do XYZ and all this stuff. And then they have to sit there and think, what?
Algorithm could I use to solve this problem? I'm gonna test out 10 things and figure that out.
What are some of the 10?
to testing out 10 things to solve.
I mean, algorithm wise, you will get pretty far with using, you know, exactly as you said at the beginning of this podcast, matrix factorization is a good initial step that requires some amount of coercion with respect to understanding your customer groups and what their affinities might be. If you do one big meta model, you're going to get a bunch of garbage, particularly if you're talking about.
a company that sells a diverse array of products. It's going to be very hard to generate a single model that's going to be performant across all of your potential users when you're talking about, say, a grocery store. Cause there's, there's some things that are infrequent purchases and there's some things that are purchases every week when somebody interacts with that grocery store.
You're going to be buying consumables, literal consumables. You know, you're going to buy, buying fruit and vegetables every week. Uh, you're going to be buying meat probably every week and bread and milk and stuff like that. Those are things that are just staples that are going to come up over and over and over again with a high degree of frequency for all users. But then when you're talking about deodorant or soap, or
kitty litter, dog food, the frequency at which, and this is based on volume of product and how quickly it actually goes bad or is consumed itself. So the depreciation rate of mass of these things over time dictates their frequency of purchase. But to most algorithms that deal with...
these affinities or looking at similar items and stuff like that. If you don't use a hierarchical approach that defines what these particular consumption rates are to determine like
If you put it all into one big pool, then the things that are most frequently purchased with high purchase rates are always going to bubble to the top. So you're going to get a completely pointless algorithm. It's like, Oh, we're promoting 2% milk this week to everyone. It's yeah, because 90% of your user base is buying milk every week. So you, if that's the case, then just.
make a very simple promotion tool of like, Hey, whatever brand of milk is on sale, show that to people because it's a staple. But if you want to say on the less frequent higher value items and being able to let people know that your place of business actually sells these products that they might not have been aware of or stuff like that, you would need to create a family of
of algorithms that are based on these hierarchies. So you might do department based. So if there's 12 departments within a particular grocery store, that's 12 models. So your entire user base, but they would be grouped by some sort of super RFM bucket. So you don't want the high value people influencing the purchasing behavior of the lower value people.
So if you get somebody that comes to your store or it goes to your site or whatever once a month and spends 200 bucks versus somebody that's spending $800 a week, every week, what they're buying is very different probably. So you'd want to group the $200 a month sort of people and maybe try to get them to behave kind of more like the people that are coming every week, but they might not care like what are they buying already?
Is it like perishables or are they just buying other things that are not like that? So understanding the user base and grouping people together with specific subgroups of your actual product based on what those products actually are is how I would approach it at first in order to get more relevant recommendations. But then the other goal is.
What are you actually trying to get them to do? Do you want them to become aware of things that they might like that people like them also like?
So that's algorithms like FP growth. So frequency of purchase. So it's a market basket analysis technique where that algorithm will find people similar to you based on purely purchasing habits.
It's a fantastic algorithm. It works really well. It doesn't need a whole lot of data. It just needs basically like products to frequency data. It figures out the rest based on building matrices internally on similarity of, of patterns that are, are being detected.
Are there any algorithms that you would really steer clear from? Other than the obvious things that are just not applicable.
anything that you're not a hundred percent sure that your data is absolutely clean. So I've, I've heard a number of people and read a couple of blog posts recently of people like, Oh, I'm going to use, you know, chat GPT to do this thing with my product descriptions. And I think the idea that, that people are trying to go with.
is well-intentioned, but in reality and the practical use cases, it's very hard.
It's very hard to actually use something that is magical in nature.
So what I mean by that is the magical nature of LLMs is not in how they actually work. Cause they're not magical. The magical aspect of it is somebody thinking that this technology that they don't fully understand is going to perform a magical feat, which is magically clean their data for them in such a pristine way that it'll just start working for them.
I'm actually boost.
So they see a new tool like this that they don't quite grok what the heck's going on. And they're like, well, we couldn't figure this out ourselves because we can't standardize our product descriptions to a particular format that we need in order to use it for some sort of NLP based application. Maybe this magical bot can do it for us. And maybe it can.
You know, you've been doing work recently with prompt engineering and instructing it to do like standardized text.
Does it get it right every time?
No, it's, I mean, it's pretty darn good, but you're trying to tell it to do something very specific, change formatting, do not change content, fix grammatical errors if you see them within reason sort of thing. But when you're talking about standardizing product descriptions for something that's being sold to people. That product description that comes from the manufacturer of that is going to be so wildly like crazy different, and it's going to depend on
Who's selling it? What country are they coming from? What language is natively spoken by the employees at that company who are generating those product descriptions? And what are their standards for product descriptions? You look at some product descriptions on Amazon Marketplace if you wanna see just how crazy these things can get.
So if you look at any Amazon prime listing that is sold by Amazon LLC, all of them conform to this like perfect structure. They are immaculate. You can read through it. They all look very similar. The content is very clear. The grammar is flawless. You just understand, like I know what this is for. I know where we, like where it was made. I know, you know, a bunch of attributes about this product without having to
figure anything else out. But when you go to the marketplace, it's just random companies selling their stuff on the Amazon site. Look at some of the product descriptions sometimes. You're going to have grammatical errors that are so atrocious that you're like, was this written by the precursor to chat GPT-2? What is this nonsense? There's repeated text in it. The punctuation's all over the place.
You'll have stuff like skew information in it. And the worst case scenario, which is anybody who's ever worked at a retailer knows when you're getting data from the actual seller of large volume of things that frequently change, it's just wrong. So the skew matching to the product description is incorrect. So some human entered this, they hit, you know, command V on the wrong cell in their spreadsheet.
So what actually is a cantaloupe is the product description is, you know, 30 weight engine oil. So if you use, you know, an automated magical bot to clean all this stuff up, it doesn't fix your actual root cause problem. Like you have to put in the work to fix your data. And then if you want to use an LLM to say, I just want an, an embedding vector here from this product description text.
And then I want to search for similar items to this. Great. Put it in a vector database. And then for everybody's top 50 items, you want to have another recommender that's like, hey, here's other products that also contain these criteria that the business is setting. Hey, we need to burn down inventory of this thing that nobody's buying right now. We bought too much of it. Or hey, we're going to make a lot of profit if we sell this brand of milk because we got it for $0.
like super cheap because of some contract we negotiated with them. We need to promote this. Well, if that now starts popping up in everybody's recommendation of who really likes this one brand's type of milk that we make, we make $2.50 in profit from this one brand, but we make $4.50 from this other brand, that algorithm can automatically determine who to promote that change to.
Right. Okay. Yeah. And as you guys can see, Ben could keep going for, for hours, if not days on this. Yeah. There's so many edge cases when thinking about these, just machine learning models in general, it's really important to think that, all right, this is actually going to be useful for the business and determine how, when it's implemented, how users or whoever will actually have their behavior changed by it. So a lot of these questions are really, really valuable in the recommendation setting.
And I know we were trying to keep this one short and sweet so I will wrap and then see if Ben has any last thoughts So today we talked about a real-life case study. It's something that I actually came across in work this week We anonymized a lot of the information but the underlying concepts still apply
So, uh, first, when you're sort of scoping these use cases is important to think about how the algorithm will be used. So as we were discussing is the recommendation going to be surfaced on a homepage where everybody sees it, or is it some little window somewhere that maybe 10% of users or even 1% of users will see that will dramatically influence how much sort of the business should be investing in it.
On the evaluation side, have good North Star metrics that are tied to real value. So revenue, profit, those types of things, and you need to evaluate them via an online AB test, but on the algorithm side, you can just optimize your loss metric. And hopefully that'll be doing pretty good. But when you're actually evaluating online, make sure you look at the whole distribution of your users, look at groups of users and make sure that certain users are not very adversely impacted relative to others.
Then finally on that algorithm side, matrix factorization is a great tool. Looking at similarity between users, there is just unsupervised learning of any type. In this example, the organization was started off with KNN and then moved to sort of an embedding based approach. And so there's many different ways that you can calculate similarities between users, but maybe similar users would actually have good recommendation information.
And then finally, an algorithm that we were chatting about for recording that I've not used is FP growth. And basically what it does is it finds people similar to you based on purchasing habits. And a couple more things, just some tips. Uh, when you're working with such a complex set of products, it's really important to understand that the, the trends learned overall might differ between subcategories within those products. So think about segmenting products into groups.
And then also think about, for instance, the natural life of a product. So the example that I immediately thought of was like fish versus cleaning solution. Fish is not going to last you a year. Cleaning solution might last you many years. So the frequency of these purchases are sort of built into their nature.
Anything else Ben?
I mean, fish cleaning solution might last you a lifetime.
Oh man, I don't think I've ever cleaned fish with a solution. Noted. Cool. Well, until next time, it's been Michael Burke and my co-host, and have a good day, everyone.
Ben Wilson. Catch you next time.