How does Search Work? - ML 118

In today's episode, we speak with Roman Grebennikov, an expert in ranking algorithms. Expect to learn about his open source project, the difference between retrieval and ranking, and much more!

Hosted by:

Ben Wilson •

Michael Berk

Special Guests:

Roman Grebennikov

RSS Spotify Apple Podcasts YouTube Amazon Music

Show Notes

In today's episode, we speak with Roman Grebennikov, an expert in ranking algorithms. Expect to learn about his open source project, the difference between retrieval and ranking, and much more!

Socials

Roman Grebennikov

Transcript

Michael Berk:

Welcome back to another episode of Adventures in Machine Learning. I'm one of your hosts, Michael Burke, and I do machine learning and data engineering at Databricks and I'm joined by my amazing, lovely co-host.

Ben Wilson:

Ben Wilson, I help make your prompt engineering dreams come true by making it simpler at Databricks.

Michael Berk:

And that sounds enticing. Uh, it's cool. So today we have Roman Grebennikov. Uh, he got his PhD in computer science and after graduating, he worked as a big data engineer, ML lead focusing on recommendation and search and is now the principal engineer, there's only one allowed at delivery hero. So Roman, after graduating from school, you jumped into the deep end of big data world working in very complex ecosystems. So how did you pick up the skills coming out of college to be successful in that realm?

Roman Grebennikov:

just being curious about how things work. So there will no like a specific type of school you can attend after university and then you just get your skills. Just natural curiosity.

Ben Wilson:

Michael Berk:

That's

Ben Wilson:

think that's

Michael Berk:

it.

Ben Wilson:

probably the best answer I've ever heard somebody

Michael Berk:

Oh man.

Ben Wilson:

answer that question. I think that's what makes a good software engineer period, good developer, ML engineer, whatever you want to call it, is that curiosity. In my experience, that's what sets the truly amazing people who just excel at building stuff from people that are collecting a paycheck or punching the card. or just building whatever people are asking them to build. So, excellently put.

Roman Grebennikov:

Yeah,

Michael Berk:

All

Roman Grebennikov:

thanks.

Michael Berk:

right. Well, I have a question though. So I'm a very curious person by nature and I love rabbit holes. So for instance, we just got a request from a customer to explain all the different compression types for Delta, which is the data format at Databricks. And basically what are the pros and cons of each? And I was like, this is amazing. And I could easily sync, I don't know, 15, 20 hours into just learning how this stuff works. So do you ever experienced that curiosity? can kill the cat.

Roman Grebennikov:

Hmm, so once I started an open source project and it took me two years and it's still going on. Yeah, so this happens. So I can just give you some info about this project. Probably you've seen it in my CV. It's meta-rank. So I'm a person doing mostly informational retrieval, search, ranking, and all that stuff. And the problem I observed multiple times in multiple companies that everyone is just reinventing the wheel when things are related to ranking. So, and, but there are no open source tooling available for that. So I've seen just exactly the same systems and different companies written in different languages, doing exactly the same thing, like some basic feature computations, separately offline, online, then some inference happening in real time, but at the end, if you're going through e-commerce world, for example. It's just even the features are the same, like some sort of different ways of counting clicks in the current years, a bit of embeddings and all that cosine similarity, BM25. And you just collect everything together, put some machine learning on top to predict the ranking and call it a day. So I decided, why don't we make an open source project about that? And I thought that maybe in a month I will do something which will work. But it's already two years and it's kind of working, but it's just the beginning.

Michael Berk:

Nice. So what is the core infrastructure of MetaRank Labs? What is your core offering?

Roman Grebennikov:

You know, MetaRank Labs is just two guys in the basement doing open source on like on a free time So it's not like, you know, a labs with people in the white costumes doing science, so it's more like a computer science type of science and you Read some paper and think why don't we just you know train our own Embedding model. Why not?

Michael Berk:

So you guys are two guys in a basement, but what what exactly if I go and sign up for your product? What exactly does it offer?

Roman Grebennikov:

Everyone is trying to treat it as a product, you know, like in a startup when we came through Y Combinator and you can get sauce or just pay for something. So you can go on a GitHub and get it like the whole offering in a single turbo. Yeah, but product wise, it's just so how to compare. to existing things. So it's more like opinionated way of doing ranking. So you ingest your data, what you're actually going to rank, ingest your telemetry, how people interact with this data, define what is important for ranking, literally just ranking features. And to It's just a wide collection, a rich collection of different templates, like click-through rates and all that stuff, user agent parsing, some embeddings, B encoders, cross encoders, all that stuff. And then you just train the final model, like lambda-marked on this offline set of data, and go online with hopefully better ranking. Usually people do it by themselves in some... scary Python scripts, but you can have another set of scary Scala scripts instead of your scary Python scripts. But still, a good thing about the Scala scripts is that they're maintained by someone else.

Ben Wilson:

Very well put. So I'm interested in the process that you took and I'm glad that you said it's an opinionated take. Uh, as somebody who's built a couple of these before in different companies, um, the core fundamentals of it, exactly as you said, it's the same thing. It's not like you're re-implementing some bespoke algorithm from scratch and saying, Hey, we need to get super creative here. There's only so many ways that people can interact with a ranked, you know, ordered set of material. And you, you can only collect data in so many different ways, like valuable data at least that informs how people are interacting. Like, Oh, do we attach, you know, click through on this or, you know, amount of time paused on an app when they're looking at, at items, you can collect data like that or purchases for e-commerce, but when you boil it down, the implementations that happen at different places in different companies, or even within the same large company in different use cases of these, the code's different, but the fundamental structure is very similar. What was your process for distilling what that core set of components and features need to be when you're approaching a project like this?

Roman Grebennikov:

So usually if you feel like a physical pain writing yet another take on implementing click-through rate on a window on historical data, that's like a perfect time for just making it as an open source bit of code. So for me, this click-through rate was just counting clicks in a different ways, like offline, online, rolling window, within some data storage. Or just... in Spark directly, it was the last point, so I can't handle it anymore, and then, screw it, I will just make an open source thing about that. So that's why it's opinionated. So when you had a couple of rounds implementing the same thing in different companies, sometimes you just see it as a... the same thing, so if someone will ask me to implement it once more, then I will probably struggle a lot.

Michael Berk:

Got it. Do you have a list of sort of core components, both Ben and Roman, when you think about developing sort of a search algorithm or a rank algorithm?

Roman Grebennikov:

Can you make an example which components can it be?

Michael Berk:

Sure. So there's the classic feature engineering, then build a model, then serve the model. But within that, are there subject specific technologies or structures that you always use when working with search or rank algorithms?

Roman Grebennikov:

For the data structures, it's complicated topic. But for me, like if you build something related to ranking, probably 95% of the time you will spend just messing with data, making it properly collected, ingested, cleared, processed in a way that you can compute your features offline, online. And just all the smashing learning is like a cherry on top of the pie. You just throw XGBoost there and that's... all meshing learning you did. And surprisingly, it works. So most of it is about feature engineering. And even the cool ways of doing ranking with neural networks can be seen as a way of doing feature engineering. So if you're using LLMs in the way with this cosine similarity between embeddings, it's just a number. you put it as a feature in the ranking model. It's a quite advanced number computed on a GPU, but still at the end it's just a ranking feature.

Ben Wilson:

Yeah, I would a hundred percent agree with the component aspect of that. the core of, even if you're using sort of traditional rank and search algorithms that doesn't use any sort of deep learning, you're using sort of matrix factorization techniques in order to compute what the effective probability that user A is going to interact with item 317. Even when you look at that algorithm and look at the source code for it, it's like, OK, it's It's fairly complex. There's a lot of moving pieces in here and okay, I'm going to have to look at some, my old, you know, grad school textbooks to remember how all this math works that they're doing. But the thing is somebody already built it. It's already out there. It's a standard. So that aspect of, of the projects, the ML aspect, it's really just, are you crafting the data in such a way that the algorithm can accept it and making sure that you're going through and. filtering out bad data, garbage data, identifying users that are effectively poisoning your data set because somebody's trying to figure out how your system works and they created a bot network that's just spamming repeated patterns over and over. You have to identify that problematic data, remove it. But then... As you said, Roman, like some of the cool stuff, like the sentence transformers library using like a BART model and you're taking unstructured text that describes say a product that you're selling and you want to say, okay, we're out of product B. We sold out of it two days ago. What do we put in front of the customer that has that product as their number one highest probability of interaction, you need to do similarity search, and then you start getting into. the non data science, but more pure software engineering, you know, side of things where you're like, okay, I need a vector embedding database. I need something that's in memory that I can do a search lookup of an embedding, a fixed length embedding vector and use, you know, but most of those algorithms that do all of that and the math behind them, it's not like you're, you're doing pure data science work and implementing that stuff from scratch. There's packages that do that. You just import it and use it.

Roman Grebennikov:

Yeah, current data science is often just gluing open source libraries together and just hitting deploy and see and hear some sirens. You know.

Ben Wilson:

I then go back and realize that it's always a data issue. It's

Roman Grebennikov:

Yeah.

Ben Wilson:

very seldom an algorithm problem.

Roman Grebennikov:

Yeah, actually, probably that so for the data, it's one of the reasons why metronome is not in Python. So Python is like lingua franca of data engineering and data science. But I also do write Python, but I'm not proud of it. Because you need to do a lot of safety nets all around you. to make sure that you are not doing bad things with types, for example. So schema, different validations, you need to do it manually in a more strict and type safe language. It comes handled usually by either a compiler or library. And it works fine with Python when you're just doing a green field project. You just write some code. It somehow works. And then this code grows to 100,000 lines in a couple of years. And then you decided to do refactoring. And then, oops. So you can't just easily refactor because there are just too many dependencies. strict languages, it's usually you do some refactoring and then just make it compile and usually it works. So with Python and less type-safe languages, you need to be sure it's covered well in tests and just have a lot of integration tests, all that stuff. But they are not replaced with the strict language, but you can focus more on the business side of things while writing this test and not like you put string here and it should emit string there.

Michael Berk:

Yeah, Ben, is that your experience that Python requires more tests than typed languages?

Ben Wilson:

Uh,

Roman Grebennikov:

Ah yeah,

Ben Wilson:

Roman Grebennikov:

I get

Ben Wilson:

somebody

Roman Grebennikov:

it.

Ben Wilson:

who did the inverse of what, what you did

Roman Grebennikov:

I guess

Ben Wilson:

Roman.

Roman Grebennikov:

people are not just writing this because you know it takes time whatever just I do something deploy it and see it in a production.

Ben Wilson:

Yeah, I'd say, like I don't work on ML projects anymore. I work on the tools that people use to build ML projects. And the main code base that we work on is MLflow, which is 100% in Python. But I did the opposite thing that a lot of people do. I started in Scala. That was like my first ML programming language. I was building my projects in that. I also did Python back in the day, very bad Python. But Scala is that safety net exactly as you said. You get away with fewer tests just by code coverage alone. Like you don't have to do as many, I'd call them unit test plus or integration test minus. They're somewhere in between the two of those where you're simulating large chunks of your code base in order to make sure that, or have validation that, hey, I'm calling this higher level API here with data that's simulating how a user would use it. And I'm not evaluating the performance and functionality of a single function or a single method. I'm instantiating an object and then I'm running three or four method calls on that in order to see how does it behave across all of these different steps. It's not a full integration test, but it's definitely not a unit test. So you have to do stuff like that in Python. or else you're going to get burned with the whole like, yeah, I put a type hint in there and that'll complain within an IDE that, Hey, this doesn't really make sense. You can override that at runtime. There's no control over that saying, Hey, you can't do this. You can write code to enforce that. And then your code, you can turn Python into a pseudo type safe language with putting loads of boilerplate in your code base. But then what you're doing is you're not getting that compile time performance and benefit. What you're doing is just creating an exception throwing factory and it's going to annoy users. So like, Hey, you know, this, this used to support an in 32. I'm now passing an in 64 in like, why does this not work anymore? It's like, well, we have an assertion here in our code that says this it must be this exact type. It's annoying. Um, but in a language like Scala or anything really based on the JVM, you have to declare all that stuff. You can't pass it. I mean, Scala, you could pass in an anonymous type like just any or any Val, but hopefully you're whoever's reviewing your code is going to slap your hand if you put stuff like that everywhere in the code

Roman Grebennikov:

There

Ben Wilson:

and the

Roman Grebennikov:

are

Ben Wilson:

compiler

Roman Grebennikov:

even

Ben Wilson:

should complain.

Roman Grebennikov:

plugins for automatically slapping your head, hence

Ben Wilson:

Yeah.

Roman Grebennikov:

why you do it.

Michael Berk:

Yeah, there's a startup idea right there.

Roman Grebennikov:

It's already there on GitHub,

Ben Wilson:

Yeah.

Roman Grebennikov:

so it shouldn't be a startup idea.

Ben Wilson:

Yeah, you can

Michael Berk:

valid.

Ben Wilson:

put it in like IntelliJ and have it automatically just tell you, like, I'm not going to compile this because here's some issues that I see in your code.

Michael Berk:

Got it. Yeah. I was going one step further and actually getting a physical hand, but I guess it didn't land real well. Um, so I wanted to also ask Roman, how do you think about serving these search algorithms? Because typically search applications require really low latency. I forget the actual stats, but Google saw massive drops in conversion rate or, um, click through rate with even a millisecond increase in latency. So how do you achieve this super high performance?

Roman Grebennikov:

I actually haven't, I tried to find this, some sort of research about dependency between latency and conversion. It seems to be that it exists, but no one's actually seen it. And I observed another thing. So once in my previous company, we were doing e-commerce search. So it's like you control the whole pipeline, the whole funnel of conversion. So you can measure how things happen. And what we measured is that adding X... So we did an ABA test. We just added a thread slip for 200 milliseconds for a small segment of traffic and tried to measure. So we had just a free tire customers. So when you're not paying for products, you might expect that you are the product. So we were experimenting quite a lot on... this customer, so we just added 200 milliseconds and tried to measure if the conversion will drop or not. It didn't. Like, everyone got used to everything is so slow on the internet, like 200 milliseconds. No one cares because your web one page app through which pulls one megabyte of JSON loads for three seconds. Who cares about extra 200 milliseconds in practice?

Ben Wilson:

There was a study done at Northeastern University, I think over a decade ago, where they actually tried to do a proper statistical analysis of this exact phenomenon. And that was the internet back then, so before modern cellular networks that behave basically like broadband internet now. But I don't have the exact data to mind right now. There was a big threshold leap at five seconds and another one at 20 seconds where that's when people started to get kind of annoyed. And I think that's relative to what they're expecting elsewhere on the internet. If you're like, Hey, if I go to Amazon and I'm searching around, yeah, they're their response budget for a latency on the Amazon's main page or their search. Yeah, they're measuring that in tens of milliseconds for each of the different phases. If you look at total browser load time, they're going to freak out if that browser takes longer than about 200 milliseconds to load up. But most websites, you can't afford that infrastructure. It's expensive to run stuff like that. But if similar websites are not that performant... you don't stand out that much and people are kind of inured to like, Oh, if I go to this, this clothing company's website, I'm expecting it's like, takes three or four seconds for things to load on the page or the app takes, you know, five seconds to load the next page. People get used to it and they don't have gross expectations. So it's always, I've always found it kind of annoying when that conversation comes up and it has with me many times where somebody says, Here's the budget for this response to this REST API for the ML team. You only have 50 milliseconds to return results. Like how long does the page load? Like in total, like, well, the page loads in 270 milliseconds. Like great. Can you tell the difference between a page that takes one second to load and one that takes 270 milliseconds to load? No. It looks almost instantaneous to people. You can tell the difference between one second and 20 seconds. It's definitely worth looking for that study if anybody's curious, because they did do a very comprehensive study of that focused around e-commerce.

Roman Grebennikov:

I guess five seconds is like a threshold you can just sit and stare on your phone, waiting until page loads and five seconds you just start doing something else but still in the background focused on your page and if it won't load in half a minute, probably you just... Okay, whatever, I will try next time.

Ben Wilson:

Yeah, you think it's broken. Like the website's just not responsive or some app has crashed or something. Yeah.

Roman Grebennikov:

That's not

Ben Wilson:

All right.

Roman Grebennikov:

full

Ben Wilson:

Roman Grebennikov:

of...

Ben Wilson:

people will reload the page multiple times.

Roman Grebennikov:

So for latency, if you speak about search, it's usually two parts. First is retrieval, the second is ranking and retrieval. It's usually not that bad if you are not doing bad things in your search, like handling everything on a single Postgres instance, because Postgres has full text search. And for the ranking, it also depends on the things you are doing there. But at the end, it's usually a balance between your budget and latency and precision. So you're trying to balance this triangle, choosing 2 out of 3. So you can go faster and make more relevant results. But you need to throw a couple of packs of dollars to the Amazon to make it work.

Michael Berk:

Yeah, that's interesting. I remember back at my prior company, 2B, which is video streaming, like movies and TV shows, we were working on search and I don't know if they're still doing this, uh, cause the information is becoming out of date, but we used Algolia and we were prototyping and essentially what would happen is a data scientist or a data analyst would go in and determine what rules should be served offline. And then it would just be a simple rule-based system. So if you type in this query, like X, Y, Z, it will return these five results. If you type in sharks, it will return this, these 10 results. And that was in the spirit of minimizing latency and it would just be a massive lookup table. And, uh, it was interesting that that was a very core offering of Algolia. I'm sure they have other products as well. Um, but there's so many different, uh, ways to approach this. It's, it's kind of cool.

Roman Grebennikov:

I always considered that Algolia were quite focused on the latency. And in the community of search engineers, Algolia are kind of a weird company because they were sharing quite a lot of things they are doing inside internally before, I don't know, maybe 2016. And then they started growing, and they kept silent till then. And two years ago, they published an article that neural networks for search doesn't work. Like, no way. And a year ago, they published another article that they do indeed work. And now it's like, are you sure that they work? Because you just wrote a giant article describing that, no, that's not. the industry shouldn't move in this direction.

Ben Wilson:

Yeah, I think it's already moved.

Roman Grebennikov:

It's already moved and probably that's why they decided to have another take on this topic.

Ben Wilson:

Yeah, that sounds like a marketing department that went off the rails. And then some engineers like actually we need to, we need to revisit this because people are making fun of us. Yeah.

Roman Grebennikov:

they're still doing.

Ben Wilson:

Yeah, it's always weird to me when, when people backtrack like that about text statements, they're trying to be prescient and saying, this is never going to work for this thing. It's kind of like that Laudite mentality.

Roman Grebennikov:

Ben Wilson:

You

Roman Grebennikov:

depends.

Ben Wilson:

know, they know this one thing and they don't want any, they don't want to change themselves, but they also don't want anybody else to change or innovate.

Roman Grebennikov:

I guess that the Algolea with their like commercial growth when you start growing fast enough your marketing and management won and technical people are just there to make things running and So that's probably the reason why they stopped posting anything technical about your internal things And for the neural networks, I got the impression that... So Algolia, before they were seeing how it works, they use an internal search engine. So it's not Lucy, not Elasti, that's a fully internal one. With all this search implemented in C++, so it's a pretty scary thing to do. So probably they spent quite a lot of time optimizing. the core, this term search, so they build something which is better than Lucene at the end. And then this neural networks came and they were saying like, okay, you're already obsolete, no one cares about term search, this GPT is the future, and you can't just pivot that fast into a completely different area. So you start trying to find reasons why you are not doing this pivot and you start saying that okay that's probably not the best idea for us and for the industry to move there because you know it's hard to beat old good term search but seems to be that in some areas there are a lot of another opinions on this topic

Ben Wilson:

Yeah. I mean, that's an interesting topic to discuss just in general as, as it's approached to search. How do you think the landscape's going to look in five years when like, yeah, chat GPT four is awesome. DaVinci is even better. These massive large language models can do so much and, but they're not, they're not open source. You know, they're a highly closed source because they cost so much money to make. And these companies want to make a profit off of their SaaS offerings. But then you have Meta who's releasing Lama, its core out there saying, Hey, we're not in the business of making money off these things. We want to help the world innovate. And the best way to do that is to open source this model as well as the code that built it. So here you go, world. go nuts and there's already people retraining that architecture and doing it. And the results are even better than the initial models with that process and the new research that's coming out of places like MIT and Stanford for the next generation that succeeds transformers. When these models get smaller and smaller and cheaper to actually deploy, do you see search having a paradigm shift globally? like an industry where we're no longer thinking of it as we have tabular data or you know, vectorized data in the form of features that we're, we're trying to retrieve relevant information, but more moving towards natural language where somebody can actually take the speaker on their phone in an app and just ask for something as they would in a store to an employee and actually relevant things come from. a series of these, these large models, you know, like the front end is whisper, which has a language head on the end of it, which then goes into a, you know, a generative model and that's been instructed and trained to provide an itemized list of things that are available within the table somewhere. Do you see that as something that is going to take off or do you think that's just not really going to go anywhere?

Roman Grebennikov:

In some areas, probably yes. But you're going on some sort of a specific store and ask for support, and you've got a response that as an AI language model, I cannot say to you, because we decided that this topic is taboo. So it's or. maybe from another perspective, this AI language model will say you hallucinate about things, it doesn't know, like okay you need to do this and that, and should you really? Like okay, which medicine should I take right now because I have the symptoms? Like okay, take this, highly recommend it, don't write it.

Michael Berk:

Yeah, it's interesting. The existing web pages are sort of a disclaimer where you're free of blame. It's on the onus of the user to figure out what is relevant, what is correct. And if you have an LLM, you're only provided one answer. And so now the burden sort of shifts to the LLM. So maybe that could be a blocker in the adoption of this.

Roman Grebennikov:

I've seen ideas of some sort of a hybrid approach. So LLM is not used for the actual retrieval. So you have still documents. For example, you're a lawyer and you're looking for a specific case matching your query. And it's not like LLM will hallucinate your specific case for your query because it would be quite awkward. But it might expand your query. So like, OK, that's the. that's the problem I got. It will generate you a query for an existing corporate of documents, which later is used for retrieval. So at the end, it would be just the same documents, no hallucinations, but just this retrieval step is expanded or optimized by LLM. So it might get some extra context from you, maybe just with a couple of questions, like, okay, in which state for example, or some other questions regarding to the keys. And then just formulate a specific query. You might not even see this query at the end. You just have a conversation and you got your results. But these results are not generated by ChudGPD.

Ben Wilson:

Yeah, can confirm a meeting on, on Monday this week, uh, with a team that's trying to use something that we just built and MLflow for, you know, supporting integrations with these different services. And it's almost exactly what they're doing. Um, they're using the open AI chat GPT for prompt, uh, in order to, they've, they basically provide a prompt that says a robot, I want you to do the following things when. When somebody asks you a question related to these topics that are relevant to the product we're building, I want you to expand this to at least 400 words, but no more than 1000 words in your response that describes exactly what, as much as you can about the nature of what they're asking for. And it does that. It generates these blocks of texts that kind of describe in incredible detail, unique attributes associated with what they're trying to search for. And then they get the embedding from that. So they take that text, embed that text, and now they have a search factor term that's, you know, length of, I guess, 384 is the length. So it's just a, basically an array of floats. And then they're using that against the vector database to find like which documents in this massive, you know, repository that those have already been. You know, their key is that vector that search factor. And then they just have metadata that associates to, okay, where on S3 is this file that we need to retrieve and return. And it'll do that and say, Hey, I want top K 10, give me the 10 closest documents. And it is crazy how good it is. Like it blows out of the water. They're they're preexisting thing. Their big thing is like, Hey, this is going to be expensive for us to turn this on because each. requests to chat to EPT, they got to pay open AI for that. So their desire was, can we do this with an open source model?

Roman Grebennikov:

Probably the answer is yes, but you're going to still pay quite a lot for GPU inference

Ben Wilson:

Yes, the hosting of all of those is,

Roman Grebennikov:

Yeah.

Ben Wilson:

is going to be big. I mean, they're a big company. They have a lot of money, but they just don't want to be paying a service for something that they feel like they could just pay for a VM to run it.

Roman Grebennikov:

European companies it's also a bit of a problem for privacy perspective so if it's a search query who knows what I'm searching on your search box maybe I'm just putting my passport number there or whatever should really send it overseas to chat gbt API it's a questionable thing so formally you can do this with a lot of different safety nets from lawyers but usually all the product hear the idea of sending it, they are like, they have absolutely scared face of like, scared of the interactions with their lawyer department, how it can be handled properly. So better not. And if you can host it inside your perimeter, that's wonderful. And technically, training, not the training, more like a fine tuning LLMs is not that complicated thing at the end. So you can figure it out maybe in a week. And if you're a large company, you probably won't have any issues just writing 10 GPU instances for a week. So it's not a big deal. So even hobby machine learners can afford that. For me it's used 40.90 here on the bottom, but for companies it's just renting it on the cloud. And rent prices are just going down.

Ben Wilson:

Mm-hmm.

Roman Grebennikov:

So on Amazon not likely, not that much going down, but you can V100 is like two dollars per hour right now outside of Amazon, like more obscure. new age hostings but still two hours for an hour for two dollars for an hour and training is like a couple of hundreds of hours so it's it's manageable

Ben Wilson:

Yeah,

Michael Berk:

Right, now...

Ben Wilson:

one, one training event and VM hosting of your app for an entire year is probably what you would pay a software provider who's providing that service in a week of usage for the volumes that you're looking at.

Roman Grebennikov:

Yeah, but usually inference is much cheaper. So like, I don't know, cheapest Amazon instance with GPU, with T4 is 300 or 400 bucks a month. So not that much. It's enough to handle some basic sized models. Llama is probably okay. Something 65 billions parameters, not okay. But it's doable. I saw some team during the hackathon, they got a data set of the subreddit called Dad's Jokes, so about horrible jokes. And they fine-tuned Lama on that. And it was like not a prompt engineering, but you might start. typing something and it tries to make a very bad joke out of your prompt. And it seems to be working quite fine. But still it's it's not covered by some grants or something. It's just people running experiments on their own hardware.

Ben Wilson:

Yeah, it's truly the democratization of deep learning, like effective, useful deep learning, because you can do, you know, historically you could take a pre-trained, you know, CNN, those were the big things, you know, eight, 10 years ago. It was like, Oh my God, CNNs are amazing. It can classify images. Like, yeah, it can. But if you're taking an ImageNet trained model, it's good at generalized classification of images because it was trained on billions of images at Google. But when you try to retrain that on your own image sets, you notice that there's certain things that just doesn't do well with, and you rapidly, when you're trying to do retraining of one of those, you learn that, okay, in order to really bring the accuracy up on our bespoke use case, this is going to cost a lot of money to retrain this. So you're only looking at big enterprise companies who have a vested interest in that project becoming successful because they know they're going to make a profit off of it for them to say, yeah, here's $1.5 million budget to retrain this. Not a lot of data science teams have ever heard something like that. But now with these LLMs and the Transformers architecture, even for vision models and audio models, yeah, you can train it on a desktop computer. It's crazy how far we've come. And from what I hear, from the new stuff that Stanford's working on, that's going to In the next couple of years, it's going to be, you can retrain that on your cell phone,

Roman Grebennikov:

Oh.

Ben Wilson:

like that size and not need for extreme. I mean, you're still going to need things that can process tensors, which phones are now having those built on their SoCs anyway. But they'll be small enough where you, you know, the same predictive power inference capability of something like. You know, GPT 3.5 or Lama can fit on a cell phone in memory. It's pretty crazy.

Roman Grebennikov:

There were fox fitting llama on a raspberry pie, so it's already quite close

Ben Wilson:

Mm-hmm.

Michael Berk:

Yeah, we recently had a hackathon at Databricks and my team didn't have enough cool stuff to submit a final video, but what we were working on was a transformer based time series forecast. And with one of the smallest GPU instances on Azure, we were able to train in like 15 minutes or something with like a reasonable amount of epochs with. hyperparameter tuning and everything. So, uh, it, yeah, the democratization of this is really cool. And hopefully it'll get, it'll lead to a lot of like small sub products and pieces of innovation that built off of the transformer framework at delivery. Here, I would just quick question about that. Do you guys look to use transformers a lot or what exactly are you guys focusing on as in terms of cutting edge tech?

Roman Grebennikov:

Hmm, so it's a pretty large company, so it's hard to say about everyone, and I don't know everyone, it's like 3000 people, or the level of developers. And I'm just only in a specific domain of search, so maybe some, so I know people are using Transformers here and there, and, but for, so I can't say a lot because it's kind of publicly traded company, I have some fear of saying too much, but like why not at the end so everyone is using it. So

Michael Berk:

Okay, and what is it?

Roman Grebennikov:

for me personally, I see it as a... So for... So what Delivery Hero probably never... As Americans, you probably never heard about what's that. And it's more like an umbrella company of different food delivery brands across the world. So I don't know how many of them are there. Like 80 in many countries, but not in the United States. So you can see it as a door dash, but outside of US. And the. it's a problem of multilingual processing so you have a lot of languages and all of them are important you don't have an imbalance of the internet when everything is in English not everything is in English especially if you're speaking about food delivery you everyone wants to get food and not just only people speaking English in different countries so but building pipelines on all the languages is usually a major pain because English is fine, but good luck with Hebrew or maybe traditional Chinese, because it's just not yet well developed in open source. There are some different projects, but it's still not mature enough, and with Transformers you can get all these things for free. So it's just a good baseline. Doesn't mean that it's the best thing in the world, but with the transformer approach to search, you're already getting very strong baseline, sometimes even better than Elasticsearch gives to you with like out of the box analyzers.

Michael Berk:

And then what are you specifically focusing on at Delivery Hero?

Roman Grebennikov:

That's a good question. So I'm just running in circles and screaming usually. So I don't work on a... not always spend much time on a specific products, but more like convincing other people to do the right thing. So especially when they're doing something weird. So. That's

Michael Berk:

Yeah,

Roman Grebennikov:

kind

Michael Berk:

do you have

Roman Grebennikov:

of...

Michael Berk:

an example for us? Or are you just gonna leave it at that?

Roman Grebennikov:

So, for example, it's more about making other people do the hard work, but you're just trying to make a proper hint or question so they can grow. So, you got some latency issues in some specific servers and you maybe... Have you ever tried to use, I don't know, Java async profiler? And they're like, no, is it a good thing? And you just show, and they see that 80% of their time is spent on a single function call. And they're like, whoa, interesting. Then something

Ben Wilson:

I'm

Roman Grebennikov:

happens,

Ben Wilson:

sorry.

Roman Grebennikov:

and they rewrite this function call, and their service becomes five times faster. No problem. Did I fix this issue? No, they did it. But that.

Ben Wilson:

basically sums up the job role of a principal engineer perfectly.

Roman Grebennikov:

yeah but you know

Ben Wilson:

You're there for wisdom.

Roman Grebennikov:

in a month or two you start seeing that this tool is being used in another team so it just starts spreading across the company so you just throw good things on specific people and people start spreading this knowledge and that's it so mostly speaking sometimes some prototypes

Michael Berk:

Nice. Ben, is that your experience as a sort of in a similar role for Databricks?

Ben Wilson:

I used to be in that role. Now I'm a code monkey, but yeah, a principle in the field at Databricks. That's generally what you do. You might prototype something you might do a proof of concept or you might come in to help out a team that's really struggling, but a lot of the time it's like somebody just pasted and Slack some code and they're like, Hey, can you tell me if this is good? Like. like empirically good or you just want some tips on, on how to make it a little bit better or are you asking would I have written this? And the answer to that is generally no, most certainly not. But here's some tips and here's some things to think about to make this not suck so much. And the goal of that isn't just fixing that one problem. It's exactly as Roman said, it's to impart knowledge that you know, is going to spread virally to their. people that they talk to. And then all of a sudden, you know, more and more people get better and better. I think that's the role of any like super seniors, IC tech person. You're there just, just to make your organization better by virtue of you telling people all the things that you screwed up when you had to learn it the hard way.

Roman Grebennikov:

Yeah, I have some sort of a statistics for me, like just what apps do I use? Where do I spend my time and how much do I work? It's not shared with anyone but me because I don't like anyone know what I'm doing. But I found that I'm the most frequently used tool in my arsenal is Slack. So like that's like three hours. 4 hours a day just shitposting in Slack, that's what you usually do.

Michael Berk:

That's crazy, wow. Interesting, well.

Roman Grebennikov:

To relax, I do shitposts in other non-corporate areas, so it's not like only Slack.

Michael Berk:

Got it, yeah. Cool, well I know we have a stop coming up pretty soon, so I'll wrap and we can continue on with our lives. So a couple interesting notes that I heard at least throughout this call is type languages typically require less unit tests because the compiler will do a lot of the type checking for you. You don't need as structured and extensive unit tests. And then also latency of search impacting conversion is a absolute lie. And any company based on that should just stop right now. And, um, that said that, uh, research suggests that latency tolerance is more relative. So if maybe 15 years ago, if latency was an average of five seconds, well, maybe a increase in five seconds would be problematic. Now that would be completely intolerable. So it's more about what is the relative latency of not only the competitors, but also the page load. And then finally for search algorithms, there are two core components retrieval, which typically occurs via an API and then ranking, which can be offline training with an online inference. So Roman, if people want to get in contact with you, where should they go?

Roman Grebennikov:

Hmm, non-linked probably.

Michael Berk:

Cool. There you have it. Until next time, it's been Michael Burke and my co-host.

Ben Wilson:

and Wilson.

Michael Berk:

And have a good day, everyone.

Ben Wilson:

Take it easy.

Roman Grebennikov:

Bye.

How does Search Work? - ML 118

0:00

52:41

Playback Speed:

Show Notes

Sponsors

Socials

Transcript