MICHAEL_BERK:
Welcome back to another episode of adventures in machine learning. I'm one of your hosts, Michael Burke, and I do data engineering and machine learning at Databricks and I'm joined by my cohost.
BEN_WILSON:
And Wilson, I help build the tools that make your ML life easier at Databricks.
MICHAEL_BERK:
And today we are speaking with Praveen Paritosh. He is currently a research scientist at ML Commons, and he actually just left Google, where he was there for over 12 years, and he also was a research scientist there. Prior to joining as a research scientist, he worked in industry on large scale distributed data mining, and before that, he was a research assistant at Northwestern University, where he coincidentally also got his PhD. So today we're going to be talking about some cool stuff. Specifically, we're going to be focusing on the cycles of innovation, both for data and ML. So Ben, do you mind teeing up what we've been talking about for the past 30 minutes prior to starting the recording?
BEN_WILSON:
Yeah,
PRAVEEN_PARITOSH:
Woof woof.
BEN_WILSON:
usually before we start recording, we do, you know, like a five minute chat with a guest, be like, Hey, what do you want to talk about? And like, who are you? What have you done? Uh, and we almost blew our entire time slot because of what we were just talking about with Praveen, uh, at one of my, my favorite intros before recording that we've ever had. Um, but we were talking about hype cycle and we were talking about also how This isn't the first one. Everybody's aware. Everybody's listening in the newest LLM large language model hype that's been built up about this massive new capability. And as we were saying before, Praveen, I don't want to steal your thunder about paradigm shifts in the public perception and industry perception and research perception of what these models are now capable of and what these engines can do. But it's not the first time this has happened. This has happened many times. And then we were also talking about what are the ways that we can look historically in the past at when these hype cycles have started? When did the bubble get to the biggest point? When did it burst and when did disillusionment cover everybody who pays the bills? Uh, in order for us to be able to learn from the past and prevent a winter from happening again.
MICHAEL_BERK:
So Praveen, do you mind sort of walking us through the typical stages of the cycles of innovation?
PRAVEEN_PARITOSH:
Sure. Thank you for having me. This is really fun, just getting started to talk about some of these issues. I'm going to borrow Thomas Kuhn's structure of scientific revolutions as the basic template for these. And Thomas Kuhn spent a lot of time doing case studies of what took two major shifts like the Copernican shift, what took us to going from one framework, one way of thinking to another. And what he found was something really interesting, as opposed to kind of the linear arrow of progress. He saw that science was moving in cycles. And the cycle would begin with a new framework being offered to explain some phenomena. So whether it's the, you know, where the celestial bodies, the moon and the sun and the planets are going to be in the sky. And there is a framework that explains that. And that framework... Once it explains that it gets people excited and people want to apply that and to more and more problems and it becomes very important. And so you can go very far, but then you run into some problems, perhaps because this framework is insufficient to explain that. Imagine, you know, the other paradigm that Thomas Kuhn was interested in was going from classical mechanics to relativistic mechanics. And so the classical mechanics explains a lot. of what happens in the world when we throw objects and drop glass vases. And we can predict this. However, when things are very small and move very fast, then it fails to predict such things. So there are limits of a paradigm of a framework that can be powerful. So think about a framework as its value being its explanatory power and how many things in the range of phenomena that it addresses it can explain. So that's the beginning of this paradigm. So Thomas Kuhn was the person to introduce the word paradigm and paradigm shift to talking about this framework that allows us to now understand the world in some way. And then as it matures, as more and more people are trained in the application of that framework, whether it be classical mechanics or pre-Kopernikan astronomy, or we'll talk about frameworks in AI in a moment. And Then you have grad students that get trained in those techniques, faculty that have tenures in those areas, and the paradigm has power at this point in time. It has a maturity and an inertia. Then it can continue going. So as the paradigm, of course, as we expect, as we broaden the scope of the paradigm, we run into problems. We run into parts where it doesn't quite well work. And this period of maturity of paradigm. it slows down paying attention to these little anomalies. However, Kuhn saw with various paradigms like that, over time these anomalies accumulate, places where the paradigm either fails to explain or it's too much work to do something really simple. And as they line up at some point in time, there's a feeling that this paradigm is not quite living up to its promise, it must go, but it doesn't go until another paradigm shows up. another framework shows up that explains some of these anomalies that were not being accounted for by the previous one, as well as some of the things that the previous one did. And this is the new paradigm now that hopefully has picked up some of the good ideas of the previous one and has addressed some of the shortcomings of the previous one. And then the cycle grows and repeats itself. So this paradigm will too mature and resist, you know, the attacks or the weaknesses that will crop up. And so he took this as kind of a sobering fact of matter, of how science makes progress. This is not about accusing people of being too conservative or too slow or being hyped up on pre-Copernican astronomy. I think it is just saying that this is the way our thinking, collective thinking, develops through these kind of Kuhnian cycles.
BEN_WILSON:
And it also has parallels to technology as well, not specifically to scientific research. So for another example, if anybody's curious, think about electric cars. When the first ones came out in the 1980s, I think maybe even predating 1980s, there were some attempts were made, but the first generation of those people got really excited. They're just coming off of a time period in at least in US history. where there were oil shortages and people were really worried about gas prices, gas shortages. So the promise of electric cars coming out was very appealing. They made them, people bought them, they were garbage. They didn't have very good range. They broke down when they did break down. They were very expensive to fix or they caught fire and burned cars and houses and people. And look around today, how many Teslas are on the road? So the It's interesting how the initial hype of, I don't want to say hype, the initial promise of a new technology or a new thought pattern that comes out of like, Hey, we have this idea, let's execute on it, or at least attempt to execute on it. That first pass that goes around. Isn't always super successful. Maybe people built it up too much and it just didn't fulfill his promises. But then eventually if it is a lasting idea and the foundation of that idea is good, the execution can actually. catch up to that and become so widespread that we're talking about entire states now make, you know, passing laws mandating that if there's a car that's going to drive on the roads of this state by 2035, it has to be powered by electricity, like no internal combustion
PRAVEEN_PARITOSH:
Yeah.
BEN_WILSON:
engines permitted.
PRAVEEN_PARITOSH:
And
BEN_WILSON:
So
PRAVEEN_PARITOSH:
across
BEN_WILSON:
it,
PRAVEEN_PARITOSH:
the world. Yeah.
BEN_WILSON:
yeah, it's, it's amazing how that the trough of disillusionment happened so powerfully for electric cars that it was. Basically a non-starter for anybody to even try to do it for almost 20 years until finally technology kept, you know, caught up and a visionary said, Hey, let's do this for real.
PRAVEEN_PARITOSH:
It's as if we take rides through these dreams over and over again, right? Like, you know, you can think about Icarus as the story of the first man who tried to fly and that didn't quite work. And then we go keep going over and over again until you figure out some critical pieces of science and execution. And it takes a while, but we can dream. But when we dream, we don't quite have those answers yet.
BEN_WILSON:
My question to you as somebody who's been a researcher in industry and has seen that academic side from within institutions, like doing your PhD research in this topic, and then moving into industry to continue research and build amazing things in one of the most successful AI-backed companies or AI-powered companies on the planet has ever seen. How did you How would you see things like that within your own experience where you understand the core quality of an idea, of some sort of research? You're like, hey, I think this has promise. What are you worried about? Or what were you worried about? And what are you most worried about right now when you see that new idea that people are talking about or trying to work on and want to? to make things out of that or build things out of that. What comes to mind first for you?
PRAVEEN_PARITOSH:
There's definitely different pressures depending upon how AI research is funded. So for instance, you know, I have seen AI research that was funded by the government. When I was doing my graduate research, it was funded by the Office of Naval Research. And it was a basic research grant that allowed lots of latitude and what we could do. And I have seen research that is funded by DARPA, which is a little more applied research. that was some of the other students in my cohort were getting funding. And there's already a difference in structure of these two types of funding. And I think the military even has like levels of funding. Like they have basic research, they have like on the war field, like your hands might have blood on them. And there's a whole spectrum of funding. And most of the funding is right now in the kind of the middle tier where. you still have to show applications. So every six months or a year, you actually have to do some demonstrations that are in service of something that they will then hand off to defense contractors to build. And in industry, it is actually the product. So at Google, there is clearly search is the product that I work closely with. And they have metrics about how successful we are. How often are we getting search answers incorrect? how often we are showing the user something that is not relevant. And you want to show how whatever research you're doing, let's say it's AI, let's say it's language modeling, how it can help that product. And that's a very different question. That's a very different way of framing and funding that question. So of course, search is a really valuable product. A lot, billions of people use it every day to get answers to, you know. you know, trivial questions like, where can I watch this movie? To very, very serious questions like whether I should, you know, take this medication or call a doctor at this point in time. And so it is really, you know, a useful service. So you can think, so there are different kinds of, you know, scientific perspectives, and you could say, well, how would I help this product? So let's say if you had a better understanding of some small slice of such queries, let's say you... could understand recipe queries just a little bit better. And then you can make improvements in the product that will show up in the customer satisfaction and quality metrics that the corporation is already has in place. So that's the kind of alignment with the product that is necessary for figuring out research. Like you cannot say that this is like really interesting to me as a scientific phenomenon, I'll pursue that. It's you have to do the job of connecting it to how this might improve the product. It turns out that if the product is broad enough, then there is always a place to find some alignment of the tools and the techniques you might have. But you're working inside the confines of that.
BEN_WILSON:
Which is a perspective that I, based on my history of interacting with tech focused companies, it's a mixed, it's like a polar opposite of how some companies that aren't primer like tech first, they might be in an industry that has absolutely nothing to do with technology, you know, so to speak. They might be a logistics company or they might be in, you know, the medical field, but not necessarily in genome sequencing or stuff that's data intensive. These sorts of companies that aren't tech first, sometimes when they see the promise of a new implementation of research that's out there, like, hey, there's this new Python package that uses this hardware that we have available. We read these blogs or read these reports. for this book that said that this company solved this problem using this technology. If it worked for them, it's got to work for us. They take this tool, this technology and install it and start working on it, which is exactly the opposite of what I see from the tech first companies, which every single one of them said exactly what you just said. It's like, what are we good at? What is our core product? Or what is our suite of products and how do we evaluate and determine if a new technology fits within our business model and can it actually prove out benefit to our users, our bottom line, our efficiency, are we going to make our own employees happier? You know, is there some positive benefit of using this tech? If not, you know, there's nothing wrong with doing a research spike on it. Like, Hey. Engineers are curious, the scientists are curious. Let them take a week to play around with it and figure it out, see what, what, what they can learn about it. But it's not, Hey, this tech is out there. We have to use this somewhere. So let's force it onto a product or let's try to create a product based on this technology. And that's why a lot of the tech focused companies, the reason they're so successful, you know, Google is successful because of its It's product focus. It's exceptionally good at that. You know, you're best in the world at search, you know, best of the world of all these different technologies that are out there. And I think it's that product focused mindset that makes it successful. But I've seen that even at non-tech companies in isolation with some data science groups that look at problems the same way, where they say, Okay, whatever our problems that we have internally, whatever our internal customers need solved, solved from our team, let's go evaluate technology and let's select what works for this particular use case. As we were talking before we were recording,
PRAVEEN_PARITOSH:
Thanks for watching!
BEN_WILSON:
that could mean I'm using tech from 200 years ago.
PRAVEEN_PARITOSH:
Mm-hmm.
MICHAEL_BERK:
Got it. So it sounds like academic and government focused funding can be a lot more broad and then product focused funding is very specific and looks to solve issues that make revenue essentially. From your guys's perspective, which style of funding is more impactful for innovation in society?
BEN_WILSON:
Uh, geez, that's a great question. impactful, do you mean in a positive way or just impactful in any way?
MICHAEL_BERK:
So going back to this innovation cycle, we start off with a framework. We build up sort of anomalies where the framework falls apart, and then we iterate, or if the new framework is different enough, we call it a new framework. Which of those methods leads to the biggest jumps in innovation, minimizing the amount of time between jumps?
BEN_WILSON:
I mean, for the innovation and
PRAVEEN_PARITOSH:
Thanks for watching!
BEN_WILSON:
funding question to be answered at the same time, and not to sound morbid, but some of the biggest advances, I'll take two examples from the United States history of the 20th century for incredible amounts of funding and a public focus on innovation, whether they're aware that it was being done or not. blank check to a research group to solve a problem. It's amazing what humans can do when they're just told, hey, money's not an object. You figure this out. We're going to get 1,000 of the smartest people that we can find who are specialists in this area, and you're going to do this. For use case one, I don't know how many were actually involved. I don't know if anybody knows exactly that number, but Manhattan Project. It was pretty impactful and that was working off of that same cycle that Praveen was talking about that was building off of research that started pre Marie Curie days or discovering the fact that radioactive isotopes are a thing and ruining her own health in the process. But people were stepping on the shoulders of giants that entire time and you get enough people and enough focus and the need is there. whether that was actually needed is debatable, but they sunk this money into this and said, hey, this is the way that we're going to end this conflict as quickly as possible. Was it the best way to do that? Probably not, but they built an atomic weapon. They built several of them and then initiated the era of fission, which then rapidly moved to fusion research, which then brought us into the Cold War. and redefined the geopolitical landscape of the planet Earth for the next 60 years. It's still to this day, it's still carrying out. There's a conflict going on in Europe right now. The reason that people can't intervene the way they would have back in the early 20th century is because somebody's got a lot of nuclear warheads. It changes political dynamics and that's all because of put a couple of thousand really smart people together and said, hey, go figure this out. Philosophical discussions aside, that was pretty impactful. The other one that I think of as a positive shining example is the Apollo program and NASA being created as a thing, which is an extension of US Air Force and research that happened during World War II and following World War II, and people saying, hey, let's go to the moon. and let's figure this out. Think about how many hundreds of thousands of people were involved in that and galvanizing an entire industry to invent new technologies and solve problems. And there was a winter that was associated with that as well. You know, after, after we got to the moon, you know, a couple of times, the nothing really in the public consciousness or in funding was really maintained at that level for very long after that. And then it started picking back up with the space shuttle being created and, you know, oh, we're going to fund all of these trips up to create this international space station. It's going to make peace between these nations. And, you know, it's interesting how that stuff is built on it, each other over, over the decades. And yeah, now that I'm thinking about this, this example
PRAVEEN_PARITOSH:
Hehehe
BEN_WILSON:
that you gave Praveen about this whole cyclical. way that things happen. It's sort of like reprogramming how I think of how this stuff works.
PRAVEEN_PARITOSH:
It's really interesting that both of your examples are, you know, somewhat military and war related. In the US, the space research was a public, publicly funded research program that was televised and, you know, engaged with the society. The corresponding program in the USSR was a military program that was not publicized and I think it was just based on where we were taking the funding money from. They told us how to engage the people on, but I think it's really interesting that these are both military government projects with somewhat clear defined goals. So that's one thing. And they're projects, right? They're like, you put a bunch of people in the room together, you give them a problem and you have an outcome. And you can imagine a whole bunch of other such examples, such as, you know, the nail-biting scientific finishes, like people sequencing the human genome. And there's a whole bunch of projects where at some point in time, there's a bunch of people with a clearly defined goal. We're really good at that. If we have a
BEN_WILSON:
Hehehe
PRAVEEN_PARITOSH:
clearly defined goal and we can measure progress, then we can really iterate. And that works on projects. But I think there is something less clear when there isn't even a well-defined project. Let's say I want to understand how we can, how do we make, computers are really good at giving me precise answers, but humans are really good at coming up with educated guesses when they don't know the exact answer. And so let's say I'm interested in understanding how to make computers make educated guesses. And that was, now that was my thesis project, and I just used that as an example where it's not so clearly defined. as what is the criteria. So for IBM, for instance, worked on this product where they competed with Jeopardy winner, Ken Jennings, and won in their Watson system. And but still that's a well-defined problem. They had 60 years of Jeopardy games. They knew what kind of performance you had to have to actually make it and what decision. So they had, that is much precise. So you can know when you have a system that's good enough to get on stage. So let's see if I can go back to Michael's answer and see if there is a... you know, slightly different answer that is, you know, maybe less, less morbid, is how can we have, I think that it's really interesting. What you're asking is like really the big question, like the trillion dollar question for scientific funding, right? You're saying we're going to fund in science and we have a bunch of different mechanisms for funding science. There's VC funded science, there is, you know, big corporate research funded science, there is startups doing science, even in industry, there are different types of sciences. So for instance, VCs have like a 10 year or so time window. Startups might have a single digit year time window. Corporate research, you know, it's both applied and more abstract. So both Microsoft and Google, these big companies have small research wings that like take on like very long-term projects like quantum computing and things like that. And there is a class of research that might have a shorter time window, like can you show improvement in the product within a year or two? And so we have these like... different mechanisms of funding science and they have pros and cons. And what you're asking Michael is maybe which one is the best or how should we build a portfolio? Because it might be that we want to balance out, hedge our bets, have stocks and bonds. So there's clearly some outcome to this government funded basic research. or even corporate funded like Microsoft research and Google research and many of these places, they can create AT&T Bell Labs famously. They can create small spaces of pure basic research and playful curiosity driven scientific exploration. And that is a different kind of gamble, right? Like I think I would put that more in the stocks category than the bonds category. Like if you luck out in that gamble, you would come up with information theory like cloud Shannon did at Bell Labs. which ends up becoming the framework for understanding everything for the future of like, you know, communicating information and storing data. And that won't happen in like another model of funding that doesn't give that level of freedom and space for curiosity. But then if you have a clear product, like if you have a precise goal and you can actually solve that goal, that has value too. And that requires doing a different kind of research that this is the product focused. you know, research that we were talking about. And sometimes what can happen in this basic research, it might not, you know, the rubber might not hit the road. These ideas might be good, but they might not actually improve any real things in people's lives. So it also like what aesthetic predisposition you have, like what matters more to you. There are scientists to whom understanding phenomena matters more. And there are scientists and engineers to whom actually building products and improving people's lives matters more. And so that's also a personal choice, I would say. So I, I don't know. I would say that if you go back to the Kuhnian paradigm cycle, most of the money, most of the funding goes into the middle stage. Once the paradigm has been formed, to grow it, to use it to different, apply it to solve different problems, to keep building this structure. You have the foundation. Now you can build ADUs and build on top. is most of the work, I think. The day-to-day bread and butter of scientific and engineering work, whether it's in academia or research, is working within the paradigm, and either extending it to new areas that it hasn't been taken before, or fixing problems that it has. And that's like, I would say, maybe 70% of research goes there. That's what makes things tick. And now these paradigm shifts, the new paradigms come out. And that doesn't take a lot of money. Like what it takes is, you know, like a secure safe space to explore that. And, you know, you have to take high risk, you know, out of hundreds such ideas, maybe only one workout. And that's the advantage that these disruptors and shifters have, right? Like they don't need necessarily as much funding. With like 10% of the funding, you can shift the paradigm. But to keep the paradigm, you probably need like 90% of the funding. So it still doesn't answer your question of which one is the best, but I think that we have to accept that we need all of these. And I would say that maybe we should shift a little bit of our portfolio to a slightly higher risk science. Like I think that, you know, we do the day-to-day maintenance of the paradigms, but figuring out problems with this paradigm, integrating it with other paradigms, I think this is the stuff that gets underfunded. And so I would... guess that if we shifted a little bit more of a portfolio in that direction, we should see an increase in impact.
BEN_WILSON:
Yeah, I'd like to just of 100% agree with what you just said and give an anecdotal story to that, that has nothing to do with software ML, but a company I used to work for, created a new paradigm for light emitting diodes. And when you went down into the factory floor, which is a clean room, it was a class, I think it was like 10 micron or something, a clean room. But you go down in there and I was working as a process engineer, part-time working with R and D, so they would come up with some sort of new structure to grow in the metal organic chemical vapor deposition chambers and say, Hey Ben, we need you to try to hit this target nanometer frequency for the LED. So tune it, tune the recipe to get that. I'm like, where did you come up with this structure? And they're like, don't worry about it. Just, just build it. And then, you know, over a long enough period of time, they I started working with them more and I was like, oh, I'm going to try this thing. And they're like, yeah, go ahead. You can do whatever you want. It's just a test. But they did this interesting funding, which is very similar to what you just said, Praveen, which is they didn't do like a 90 10 split of R and D funding of like, Hey, 90% is just product focus. Just try to make these cheaper or brighter, you know, use less voltage, you know, don't, don't just. take 90% of the money or 90% of the staff and do that. It was much more of a 60-40 split, but in the opposite of what most people would think, particularly people who are coming from a tech background. It was 60% greenfield research where the scientists that were in R&D were allowed to just play jazz. They could figure out whatever they wanted to do and I'd say their scrap rate was probably 90%. Whereas on the engineering side, when we were doing, we were making product, like what they had invented two years before, we're, we're putting that into production. We're eking out and focusing on, Hey, I need to get 5% higher yield on my next run. Or, Hey, I had too much of a color shift in this run. I need to figure out how to change my glass gas flow rate and temperature stability, or, Hey, we need to, you know, do something different with the recipe to make this more stable. Uh, but on the R and D side, people are like, Hey, What would happen if we just took that structure and copied it four times? And people are like, if you ask the engineers, they'd be like, there's no way that's going to even light up. What are you doing? They do it anyway. And they put it under the test bench and it basically just, it peaks the tester so high that it looks like nothing was coming out because it saturated the sensor. It was that bright. So they were doing stuff like that where they were learning properties of physical matter because they're allowed that breathing room, that space to just do pure research. In the comment that I remember asking one of the lead scientists one time, saying like, what's the difference between when you're doing this at NC State in the material science lab and you know doing this here. They're like here. I just have you know better tools and I don't have to ask people for money. Like they pay my salary and I can do whatever I want. And that was one of the guys that had invented that LED structure that you can buy at Home Depot in light bulb
PRAVEEN_PARITOSH:
Well.
BEN_WILSON:
form. And the only reason that was a thing was because the whole team was allowed to do that with that funding structure. And I've seen that at other physical engineering and physical sciences companies I've worked for in the past. where it's a huge budget or a large portion of the research faculty at the place. We're allowed to take a lot of their time to make those connections and figure that out. I'm curious from your perspective how often you think that happens in the tech world.
PRAVEEN_PARITOSH:
It's interesting. I think that probably quite variable across the whole sector and across time as well. But because we have these periods of excitement in which more doors are open for this kind of green field work, I would say before I try to answer your question, I would say that there is some challenges in evaluating. That's one thing we have to pay attention to. As a funder, you want to. invest in a project and you want to be able to know that year over year, quarter over quarter, we are getting closer to the goal that we have set out for ourselves. And it's much more well defined in, if you have a product, for instance, as we were talking about, like you could say that, have we improved the customer satisfaction? And you could have a way of measuring that. And I think it's less defined with these green field or these open-ended, curiosity driven research. Like how do we know? Like we might develop something, you know, the famous story of the guy who invented the Post-it glue at 3M. He was working on inventing a kind of glue and this what he came up with was like problematic. Like he classified it as a failure, right? Because it didn't stick to... It only stuck to something when it was at a certain temperature. So it didn't stick when it was not, it would be peeled off and he was just sitting with a that for a while and it turned out that that was actually one of the most successful products that 3M has ever made. And it took a while from between the discovery of that glue that was classified as crummy glue to becoming the most successful product that 3M had ever invented. And so it is one of those tricky things that the payoff might not be immediate, the progress might be hard to measure, like you might have done something cool. but how do we know what's its value? And so there's reasons why funders hesitate. And there are some funders who are more familiar with this kind of pattern, like the government, the military has funded more long-term projects. And the corporations often are more anchored towards, they know their product, they know their customers, they want to make improvements in that. So a lot of research that'll be happening, the bread and butter of research at corporations, tech corporations. I would guess is driven by the products that they already have. However, there is a significant fraction of tech research on products that don't currently exist. So with the idea of creating new products in new sectors that don't quite exist. So right now we're happy with the search engine, but we don't have a search engine that we could chit chat with. And if somebody can come up with a search engine that we can chit chat with, that's not a product today, but that's a new product that could be a product tomorrow. or the self-driving car itself. It's not a product today, but if the promise was to be delivered, it will be a product tomorrow. So I think that there is some investment because many of these tech companies have large caches of cash. There is definitely some desire to invest in from long ago. Like, you know, I think Microsoft research was like a premier research center, much before like these new companies came on the fold and even before we have the history of Bell Labs. And so... you know, tech was able to create, you know, these pockets of greenfield research and that history, that lineage, some of the leaders from Bell Labs were my leaders at Google as well. So the people that came out of this kind of, you know, research environments. So there is a fraction of research that is looking at more open and create new products, create new sectors. And the challenge is, you know, it's hard to measure that and fold that. within a value system that is very product-guided. So it often ends up being like, you know, Herculean efforts by highly motivated people, as opposed to something that might seamlessly fit in into the overall strategy. And it is hard, we should acknowledge that, that it is really hard to evaluate this kind of basic research and its value, because sometimes it takes long to play out the value. And so government... We're fine. Public science, we are fine. We are happy as public as funders of research to wait and find out. The research in mRNA was being done decades ago. And we benefited from that because COVID happened, and mRNA was able to create a vaccine. And now we are having a lot more research in mRNA that might demonstrate its efficacy at HIV and other diseases that we have no idea of. But it takes decades of prior work. Like it wasn't,
BEN_WILSON:
Mm-hmm
PRAVEEN_PARITOSH:
and like the people who invested in that decades ago, they had to wait long to see a payoff if they did even have a payoff from that research, right? And so it's really hard for funding agencies. I wouldn't blame them. It's really hard to fund, measure, manage these kind of basic Greenfield research projects.
MICHAEL_BERK:
And that segues beautifully into my next point. So Praveen, you left Google last month to work at ML Commons and sort of their tagline is machine learning innovation to benefit everyone. And they have the big players in industry and academics. So how does the ML Commons think about these types of problems? How does it reward innovation?
PRAVEEN_PARITOSH:
It's really interesting. ML Commons is trying to look at the landscape of the work that is being done. And it's a consortium of people from all of these companies. So it's a great place. It's a public organization. I welcome you to join. You can sign up for, you know, different working group meetings and attend community meetings and see what they're doing. And it's been a really nice experience for the last few years. I've been working with ML Commons and running it. a working group within ML comments that is focused on data and measurement of data. And what it allowed me to see first of all is all these different voices. So the voices of people from different parts of the industry. So how do people at Meta and Microsoft and Nvidia think about the same questions that I do, but also the voices of academic researchers and grad students working on the same topics and working. made us realize, so we came to kind of a shared conclusion, that what we need to do, each one of these entities has some particular focus, some particular agenda, but they are all in the same boat. We are all trying to, we all overlap in our interests and the larger problems that we're trying to solve. And so if we look outside of our particular context, what could be the general tools? Like could we put out data sets? Could we put out test sets? Could we put out measurement systems? What could we put out there as commons, as publicly available infrastructure and tools that would help this process, help this shared goal that we're working towards? And so one of the things we identified is, as you will notice, the current machine learning, the entire wave of machine learning is learning, you know, implies learning from data, right? Learning from past. And it's really, really built up. this model is fundamentally, you know, the foundation of this machine learning today is data. And part of the success we have had over the same techniques is due to having very, very large datasets and having large compute possibilities, even with the same models and algorithms that were discovered decades ago. And this is what is unprecedented. We have lots of data now. However, a lot of this data is unexamined. What we've called found data. You know, so you scrape the web and what you get is found, right? Like you discovered this stuff and we didn't curate it. We don't know what it contains and we don't really have enough time to really examine the entire web crawl. Likewise with images, most of the state of our, you know, computer vision models are trained with publicly available images. And, you know, one of the big parts of that is the Flickr dataset, which one of my colleagues helped release long ago. And you know, that is a specific dataset of people who thought, who were taking creative photographs and making it publicly available. And so that just feeds into this foundation. We even called it these models, the foundation model. So what is in our foundation is very unexamined at this point in time. And so what we think at ML commons, we must do is seriously examine this foundation and we, we, we are seeing that there are results. you know, there are downstream problems that are showing up as a result of the problems in foundations. So the first problem is sometimes, you know, your solution just does not do the job. So for instance, in ML and healthcare applications, you know, you can have a promising technique, you can have a promising data set, but until you can get FDA approval, you cannot actually commercialize it. And we... have had seen a lot of difficulty in getting to that phase of deployment in many products, especially healthcare. Lancet wrote an article last year, the premier medical journal, there are 450 ML and AI and healthcare papers that they received last year. Zero of them were accepted for approval.
BEN_WILSON:
Wow.
PRAVEEN_PARITOSH:
And so we have these fields such as, the medical field is like the highest bar of scrutiny. We have journals, we have clinical trials, we have bodies like the FDA, we have, you know, we have lots of structures in place to stop people from making wild claims, right? And so you have to demonstrate that it actually works. So that's one place we're seeing that, you know, these ideas don't make the last mile of actually working, maybe. So that's one problem. So your investment doesn't actually result in the point at which it can make money. The second problem. is that it has picked up things from the data. Data, by definition, is past. And so if you want to build a system that has learned from the past that nurses are female and doctors and surgeons are male, it is a statistically true fact about the past. But we don't want to live in that world. We want to live in a world where it's equally possible for professions and genders to be in any combination that people might want. And these associations are carried over by the system. So a system that is doing an autocomplete, and if you say, I met the nurse, it will automatically come up with something like she said something. And that pronoun is a baggage of the past that we have learned from the data. And there are other problems like facial recognition systems have different false positives for African Americans than other demographics. And so these are all have consequences. And so I feel that where we are right now is we have built fantastic, powerful machinery. These models are really powerful, really sophisticated. And the hardware that the way we have scaled this GPU architecture to be able to run these things at scale is amazing. So we build these amazing monuments. but they're built on the foundation that is unexamined. And that is beginning to show up, the cracks as a result of that are beginning to show up. So what ML Commons and my group, DataPerf wants to do is to go back to the lesson that Lord Kelvin taught us, which is if you can't measure it, you can't improve it. And we need to measure data. All this time we've been measuring models, we've been measuring machine learning. And we have started a bunch of efforts that are around measuring data quality by itself. Last year, a couple of years ago, we wrote a paper called Everyone Wants to Do the Model Work, Nobody Wants to Do the Data Work. And we presented it at CHI. It won the Best Paper Award. Andrew Held it in his hand when he was defining this data-centric AI movement as an example of the paper that inspired him. But that was the problem. We're reporting the phenomena that what's happening. Like, of course the models are cool. Like that's where the rocket science is. And what we put data into is like a kind of second tier work. And it's not where the rocket science is. It's where the ops is. It's where you contract outsource workers. It's where you put things into spreadsheets. But that is the foundation. And I think an unexamined foundation is going to cost us a lot. And so what we are doing is a bunch of challenges, a bunch of ways to like... that are just about data. So one of the things that we have done is launched, dataperf.org is our website where you can participate in a whole bunch of challenges that look like Kaggle challenges, like the typical machine learning challenge where you have a leaderboard, you submit a system, and if it solves the problem better, you get a higher position on the leaderboard, except what you submit is a data. The model is fixed. And... The only thing you can manipulate is some aspect of data. Maybe you sample the data differently, maybe you create new data sets, maybe you can create new test sets. So any aspect of data is what gets you a position on the leaderboard. So in the simplest model, like consider the training set task, where we have a task right now where there is speech data, a big public noisy corpus that was released by Google like a few years ago and a bunch of other people. And... you have to build a classifier for a word like hello and find the 500 best examples in the data. The dataset has millions of examples, but many of them might be noisy, not clear. And the classifier will build on our own side. So you submit the examples, the dataset is the only thing you submit. And so the winner will be the people who can identify the best examples for an arbitrary label. And so we are trying to do these experiments, engage people in a community where... We are really just trying to examine and scrutinize and characterize data because we believe that's where our blind spots are.
BEN_WILSON:
I mean, that's a two birds, one stone solution, I think. So if you get enough people who are gamifying that, you're going to get highly curated data sets. You're going to get documentation and blog posts around what that process was like for somebody who won that competition, which is greater contribution to the general knowledge in this domain, which I think is awesome. But if you get enough people doing this, you're You're creating a shortcut that an experienced and seasoned data scientist or ML engineer had to learn the hard way from a decade of industry work, which everything you just explained about data quality problems, anybody who's put, you know, 50 production systems, you know, models into production and left them running for years on end knows this. Like that's where you spend 95% of your time. It's either. It's not on the models. You know, usually you automate that, you know, like, okay, automated hyper-printer tuning, I don't care what the model is doing. It's, it's fitting. What I care about is the data and my code coverage and test quality and CI is usually what you focus most of your time on that, that aspect, but it's more like 80% of that extra time is just data. That's what I used to spend most of my time on is trying to understand or answer the those very difficult questions that you get from the business when something is out there in production and somebody notices that there's something wrong with the model. It's not always what people would think. It's like, hey, the model predicted something crazy that doesn't make any sense. It's usually, hey, I think your model has a problem with it because it's classifying fraudulent behavior on our transaction system from... three separate zip codes in the United States that are way higher than anything else. Why do we have these outliers and why do we keep on getting customer service calls because of people's transactions getting declined? Then you look at it and you're like, well, historically there was slightly more in these zip codes that there was fraudulent activity. Then you look at the underlying data and aspects of that and then join and figure out that hang on a second, we're augmenting our data with US census data in order to get more features in here. That US census data is hyper biased with regards to almost intentionally creating an economic hardship zone in this one part of this city. You look at the demographics, you're like, wow, my model's racist because this data is This aspect of data that reflects our common history of terrible things that we've done to one another under the guise of government or social contracts that perpetuates into something as silly as like a fraud detection model or something. You spend so much time reverse engineering that and figuring out what do I need to do to this data to eliminate that. I don't want that signal that's just based on where somebody has their house or their shipping address. That's completely unethical.
PRAVEEN_PARITOSH:
Yeah.
BEN_WILSON:
I want to detect the actual signal that this person is committing fraud based on their individual activity. That's going to say, I'm looking at this person as a unique human being, which is the ultimate judge of fairness. I don't see anything but the fact of who they are as a human. So it's a really interesting idea.
PRAVEEN_PARITOSH:
Yeah,
BEN_WILSON:
And I think more
PRAVEEN_PARITOSH:
it
BEN_WILSON:
people
PRAVEEN_PARITOSH:
is.
BEN_WILSON:
should go and try this out, particularly junior people or people that are in academia that want to enter into this and continue on research in industry, I think it's super important to learn those skills.
PRAVEEN_PARITOSH:
Yeah, yeah. I think, you know, what the point you made is very unacknowledged. I think it's a lived experience of somebody who has worked with, you know, machine learning and data. You know, that's where the bulk of the time goes. But I think that's not the perspective out there. Like the perspective of like, you know, young, starry eyed grad students is to focus on the silver bullet magic of the model. They don't realize that 90% of the work is in data. And also likewise, the way the investors and funders think about it as well. It's actually a lot of focus is that it's kind of a rocket science magic that is attached to the Greek letters and the math of the modeling, while the other stuff is like a lower caste like work that is not taken with that degree of respect. And so one of the things that we're working on at ML Commons with Seh Jang, who is a professor at EPFL and a whole bunch of academics, is we're trying to... create a top tier venue for publishing research and data. So right now, most faculty will say that they wouldn't even advise their best students to work on data because that will not get them a good job. They would advise them to work on more sexy models because that is what is going to look good on their CV and resume. And the people who are really worried about and still want to work on data, they said that let's get together and create a top tier venue. of for data. So what we're seeing increasingly at most of the top conferences like NewReps and ICML and ICLR that we have a bunch of workshops that are data focused, data-centric workshop. We organized the first NewReps data-centric workshop with Andrew Ng in 2001, in December, I believe it was, 2010. Wait, I'm getting all of this wrong. 2023, 2021. 2021 is when we had the first data-centric workshop. It was like 150 people showed up. It was... papers, 150 papers were submitted. There's a lot of interest. So there's been a whole bunch of people kind of, you know, noticing this, grumbling about this, you know, worrying about this, and the way they found these venues were some of these little workshops and things that we created that brought these people together. And so we had kind of a nascent community out there, and right now, with Andrew Ng's kind of excitement and support about it, Data Centric AI is kind of the moniker that puts all of these people together. And we are pushing for NeurIPS, like now has a data set track, which is specific that just to make sure the data set paper gets published, but it is still, mind you, considered a little less than the main track. And so this is an ongoing work for us to figure out, like, because this is important. It's 90% of the engineering. It's actually probably 80% of the research as well, but that's where we're not paying attention. So what we want to do is. somehow catalyze that. So with these little experiments like data-centric challenges that I would, you could not only on dataperf.org, you can participate in these data-centric challenges, but you could create a data-centric challenge of your own. So we are trying to engage the people who care about data to kind of use these incentive structures of competitions and contests to see what we can learn from that community.
BEN_WILSON:
I have one final interesting question for you from a perspective of what you do right now and the focus on sort of data integrity and data evaluation. When you look at tooling focused around the sexy models, and I'll be the first to admit, that's what I do day in and day out is work on making that easier for people with MLflow. abstractions over some of this stuff to make it simpler for a user that they don't have to read through hundreds of pages of API documentation to understand what's going on. They can just solve their problem in an easier way and we make tools to make that simpler. There's not that many tools out there or there's just not as many people that are focusing on that data aspect. Do you see that as something that is going to happen? where people are, you know, look at the difference between, you know, your old companies, massively powerful and popular framework, TensorFlow, and what, you know, pre 1.0 TensorFlow looked like versus post 2.0 merge with Keras, and you have these higher level APIs that make building models, even very complicated models, much simpler. You don't need, you know, a PhD, CS background in order to... craft a custom bespoke LSTM that solves a problem. You can do it with Keras in a couple of weeks. So with simplifying the complex cool stuff that people have been focusing on the past decade, do you see that in the future inevitably with some of this work that you're doing and your organization is doing, do you see more tooling people build the de facto standards for that?
PRAVEEN_PARITOSH:
I think so. I think we need a lot of these problems are not, we do ad hoc inconsistent ways to deal with them because there is no standard tooling available. So if you have a data set, you can dump it in whatever form. You can have a spreadsheet. You can have a JSON file. You could have triples. It doesn't matter what encoding because you're just saving it for your own use. The day you decide that, OK, I want to save this data set in a format that can be read by anybody. else at all, now you're certainly talking about having a schema that is self-explanatory enough that if somebody opens this file, they'll know what rows and columns and what to expect and how to read that data. And then you're talking about then having a storage format that is compatible and can be used by other people and other operating systems, other environments and so on. So you can just keep talking about like how... Like what we're doing right now is all these parts that are not the model. Like we publish the architecture for, you know, BERT and all of that, but all the other parts are just secret, right? Like my dataset, the way I store it is up to me. And so I feel like one of the projects that ML comments is also working on, my colleague, Kurt Pollacher, he's leading this working group. And what they're trying to do is come up with standards for just publishing datasets because. there is no standard at this point in time. And data sets can be very different, right? Some data sets can have images, some data sets can have video or audio, some data sets are just text or numbers. So that itself requires like having the tooling to take your data set, convert it into a standard format, publish it. Like because that doesn't exist, the ability to search for data sets and use external data sets is like really bad. And everybody has to like repeat the same thing for the data cleaning all those first parts of like handling the data. I believe that we need better open shared tooling for many of those steps. And that's maybe what is resulting in this like debt or like repeated work of lower quality instead of like doing it once well.
BEN_WILSON:
And also stuff like bias detection frameworks and fairness evaluators. When you're talking about a tool that can go in and look through a data set and say, Hey, I noticed that this is Boolean. And I don't have a concept of what that Boolean data is, but I noticed that when evaluating against your target variable, that there is a massive discrepancy here. And then if I combine that with this other Boolean, or this other nominal value, I see this one extreme outlier, not to say, Hey, I need to filter this out and have automation to do that. Just saying, dear data scientists, are you aware that your data contains this pattern? Is this what you're expecting? And that sort of tooling, I've seen that and I've helped create stuff like that at previous companies because I'm lazy and I hate manually doing that work many times over. It's not anything that would be open source because it was highly specific to that one company or like the nature of the data, exactly as you said, the formats. Like, oh, we have these, these data sets are stored for some reason in CSV files. And then we have Parquet over here, and then this is our Oracle database. So we build the connectors and the readers and serializers, deserializers for all of that. But the core algorithms, it's always just hitting up, you know, packages like stats models. saying, Hey, here's all my statistical evaluation algorithms. They've they're in open source, but I can count on both my hands. The number of data scientists that I've interacted with at companies. And I've said, Hey, can you show me your EDA? Like just show me the notebook that your Jupiter notebook or Databricks notebook, whatever. Show me that analysis. And I either get these looks of panic and concern of like, what do you mean EDA? Like. I displayed the data frame for you. Like, yeah, but it's 13 billion rows of data. Like you're not
PRAVEEN_PARITOSH:
Hehehe
BEN_WILSON:
going to scroll through that. You know, nobody's got time for that. And they're like, well, I did a, I plotted X by Y like, no, no, like what is your, your actual, you know, entropy calculations between, you know, you start cutting this data in different ways against your target. Where are all those plots? You know, did you do that analysis? Where, what's the score? And it's amazing to me how many people are putting things into production right now that just haven't been exposed to that. And I think it's because exactly what you said, it's not seen as sexy work, but it's probably the most important thing you can do as a data scientist. In
PRAVEEN_PARITOSH:
Yeah,
BEN_WILSON:
my opinion.
PRAVEEN_PARITOSH:
yeah, we are, you know, there's, there's a bunch of things behind that. Another analogy quickly that my friend, Danny gave me once that he, a few years ago, he bought a house in San Francisco and he was telling me that, well, he didn't have time for inspections because if you tried to get inspections, then he wouldn't be a competitive offer. And so you just basically have to waive inspections. So I was like, wow, that's like really crazy. And he was like, that's what happens in a hot market. If the market is hot, then people don't have time for inspections and scrutiny. And I was like, well, then what happens if you move in and there's a problem? And he said, well, I can put it back on the market because the next guy won't have time for inspections either. So there is a downside of like, you know, we are in a hot market when it comes to AI and the downside of the good upside is, There's lots of exciting things happening. There's lots of cool projects. There's lots of funding for interesting research. But the downside is that we have lost a little bit of appetite for inspection and scrutiny. And in particular with this data-centric models, this is all built up on top of data. We are in a very serious problem. Even things like, you said EDA, I think maybe not too many people even know what that means. I think Tukey's book. sits on my shelf, like, you know, because it's so important, like exploratory
BEN_WILSON:
Got
PRAVEEN_PARITOSH:
data
BEN_WILSON:
mine over
PRAVEEN_PARITOSH:
analysis.
BEN_WILSON:
there.
PRAVEEN_PARITOSH:
If you're going to learn about like what is in the data, and I think we are like really being like, you know, black box towards the data, and you know, and we say that the model is black box. It's actually the models mathematics are pretty well understood. What is the black box is actually the data, and I think we need to just do a lot more hammering, a lot more work here. So I would say that there is about a few hundred researchers over the last five or so years that are getting more and more focused on this. There's a lot of work that is coming out every year. And as I said, there is new journals. So there's a new journal that we are organizing a workshop at ICML, which is in Hawaii in late July. And we're creating a new journal called DMLR, which is Data-Centric Machine Learning Research. So there's some work happening to create space. for publishing this kind of research and legitimately recognizing the importance of this kind of research. And just a quick reminder that if you are interested in data centric challenges and competitions, check out dataperf.org. And I think we did some really cool, interesting things. Like I'll tell you one example we did before this, which was a data centric challenge, where the challenge was, there was a state of art computer vision models and the models were fixed. And what you had to generate was a test set. And we gave you types of objects. So we gave you that teacher, chopsticks, bus driver. There was a bunch of types. And you had to generate a test set. And you won if your test set tanked the performance.
BEN_WILSON:
That's
PRAVEEN_PARITOSH:
So
BEN_WILSON:
awesome.
PRAVEEN_PARITOSH:
the best test set will be the one in which the performance is close to zero. And we ran this competition a couple of years ago and we got 15,000 notes of submissions. And we were able to build a data set of 10,000 images on which the state of art, the performance was zero, which the reported performance across the whole data set was 0.9, 0.92. And so you can do a lot of fun with just the data. And now this test set serves as a really hard challenge for computer vision researchers, because it by definition is a test set on which they get it wrong. So. So we want to just keep playing with data. We just want to keep deeply examining the data. All of this empire is built on data. And unexamined data is probably worse than an unexamined life.
BEN_WILSON:
well put and that I love that idea of sort of that, that white hat hacking of a test data set. It's something that I've done in like with mentoring data scientists before. They're like, Ben, I've got this, this model. It's 99.8% accurate. I'm like, no, it's not. They're like, no, no, no, no, no, it is. I'm like, no, it's not. Show me your query that you generate your training and evaluation data. And then just go back and cherry pick a bunch of outliers that I know were in the database that are legitimate. How does it do on this? It got everything wrong. Like yeah, don't worry about accuracy. Worry about reality
PRAVEEN_PARITOSH:
I'm
BEN_WILSON:
and
PRAVEEN_PARITOSH:
gonna
BEN_WILSON:
the
PRAVEEN_PARITOSH:
go.
BEN_WILSON:
data as it reflects reality and then adapt accordingly. Stop focusing on the model. So yeah, very well put. And this was fun. Could do this for another hour.
MICHAEL_BERK:
Yeah, so I'll quickly wrap. I personally have 700 more questions, but time is limited. So today we sort of started off talking about the structure of revolutions and innovations. There's a book by Thomas Kuhn, if you'd like to check that out, it's actually titled The Structure of Scientific Revolutions. And this Kuhnian paradigm cycle sort of starts with an initial framework that explains something about the world. And after that framework is released to the public, people start putting it through its paces. and we see where anomalies and issues with this framework start to build up. And then throughout time, we start iterating. And then once those iterations are sufficiently different from the initial framework, we call it a new one. We also talked about funding and iteration typically is best facilitated by lots of money and in very specific goals. But those zero to one paradigm shifts that people find so attractive, those typically work best in a creative and safe space. Money and clear goals don't always help. And it's one of the other important note on iteration that historically it's been really effective. Humans are pretty good at it, but that zero to one paradigm shift that's still sort of elusive. And then we finally discussed data. And data is typically the less sexy part of AI and machine learning. And the ML Commons framework essentially postulates that most data is unexamined, and we should go back and figure out this last hundred years of research has been built upon to see if any of it's biased or if there's leakage or things like that. And if you wanna get involved, check out dataperf.org. It's basically Kaggle for datasets where we hold the model constant instead of holding the data constant. And then finally, the last plug, if you're looking to catch the next wave of AI, LLMs are done, sorry to tell you. So maybe start thinking about the data itself. So until next time, it's been Michael Burke and my co-host.
BEN_WILSON:
Wilson.
MICHAEL_BERK:
and have a good day everyone.
BEN_WILSON:
See you next time.