Cows, Camels, and the Human Brain - ML 182 - Adventures in Machine Learning -

Cows, Camels, and the Human Brain - ML 182

What do cows and camels have to do with the human brain? The latest developments in machine learning, of course! In this episode, Michael and Ben dive into a new white paper from Facebook AI researchers that reveals a LOT about the future of modeling. They discuss “cows and camels”, the question of predictive vs causal modeling, and how algorithms are getting scary good at emulating the human brain these days. In This Episode Why Facebook’s new research is VERY exciting for AI learning and causality (but what does it have to do with cows and camels?) The answer to “Is predictive or causal modeling more accurate?” (and why it’s not the best question to ask) Not sure if you need machine learning or just plain data modeling? Michael lays it out for you What algorithms are learning about human behavior to accurately emulate the human brain in 2022 and beyond

Hosted by:

Ben Wilson •

Michael Berk

RSS Spotify Apple Podcasts YouTube Amazon Music

Show Notes

In This Episode

Why Facebook’s new research is VERY exciting for AI learning and causality (but what does it have to do with cows and camels?)
The answer to “Is predictive or causal modeling more accurate?” (and why it’s not the best question to ask)
Not sure if you need machine learning or just plain data modeling? Michael lays it out for you
What algorithms are learning about human behavior to accurately emulate the human brain in 2022 and beyond

Transcript

Hey, everybody. Welcome back to Adventures in Machine Learning. I'm one of your 2 hosts, Ben Wilson. And Michael Burke. How's it going?

And today, we're gonna be talking about a white paper that Michael read a while ago, and we're gonna talk about cows in fields and deserts. The most exciting topic out of all the topics. Cool. So I'll I'll set up the problem statement and quickly discuss the method, and then we'll get into the philosophy behind it as well as the nitty gritty of the algorithms and the solvers. So this paper was developed, or this method, I guess, was developed, I think in 2019 or 2020 by, Facebook's AI research team, and it's really cool.

And the basis of it is they're looking to develop a causal learning algorithm that is better than all the other ones. So standard other standard causal learning algorithms are use DAGs or try to just get creative with the data, essentially. And this model does the latter. And what it does is it creates environments and it looks to develop causal inference based on those environments. So what does that mean?

Let's let's take a little bit of a deeper dive. So like Ben said, they used the example of cows and camels, and both have 4 legs and are animals. But the thing is, one thing that is specific to cows is they often have a green background, so pasture, grass, whatever it is. And then camels, on the other hand, often have a sandy background. And so this is a classic example of a machine learning algorithm without any adjustments would train on the background color.

And that's actually what happened. They, I forget the exact numbers, but they trained a bunch of, images and developed a classifier. And it turns out that when they changed the background color for a cow or a camel, it just predict the opposite label. So as you can see, this is a problem for generalization. And instead, what we're looking to do is develop a causal framework that says, well, if basically it uses counterfactual logic, and, we'll get into that in just a sec.

Good so far, Ben? Yeah. It's a good overview of of their the intro and, like, what they're trying to do. And I think anybody who's listening in who's ever tried to build a CNN classifier, regardless of of how sophisticated they're going, will run into the same thing I have in the past where you have images from the real world and people have have looked at and said, okay, I I know what's in this this image. I'm labeling it as x.

This other one is labeled as y. And when you train the model, it's gonna learn things that's dependent on what it sees in the data. And whatever the simplest path is to differentiate between things, it's gonna learn that pattern. Yeah. That's a really good point.

Algorithms tend to well, not tend to. They're optimized to take the simplest path. So if there's some correlation or some feature that in this case, in this sample of the dataset is really powerful, then that feature will be very important. And when you exit that data space, well, the model won't generalize at all. So basically, what the method looks to do is it looks to develop environments based on the sampling schemas.

So, in the example of cows and camels, one could think about the country that they're sampled in, and that might be indicative of the background. One could also think of the time of day or just other in other pieces of information about how the data are collected. And then what you look to do is leverage a counterfactual framework. And so let's let's get into that in just a second. For those of you that don't know, a counterfactual is essentially a data trend that we observe that we would or that we would have observed without a treatment.

So you can think of it as the control group in an AB test. If we see a data trend and, I don't know, we change the button from red to green, well, that's the treatment. But in the control group, we want to see what the data would look like with a blue button. And so with that framework, we can look at average treatment effects and differences between those two groups and then use that to develop a causal framework. And they sort of borrow the same idea, but what they do is instead of, let's we create, let's say, 10 environments.

Each of those are 10 environments from different countries. So Tanzania, Ireland, the sandiest, and the least sandy places. So from that, what they do is they look to develop rules. And if anytime those rules are broken, then the model is penalized heavily. So you can sort of think of it like a regularization term that's added into the loss function.

And that's actually what they exactly develop is they look to find instances where a rule is broken, and therefore that rule cannot be true. So if a cow ever appears in a sandy background, then sandy background cannot create camel or cannot cause camel. And vice versa, if a camel appears in a grassy background, grassy background cannot cause cow. And that's sort of how humans think. If there's ever a rule broken, then causality is broken.

So instead of optimizing for correlation, we optimize for causality, and that's basically the high level framework. And it has this very interesting effect when they implemented this in PyTorch and ran it through some of the tests that they did, where it showed something that most people who are familiar with training supervised learning, this is worse. Like their training loss metric went down compared to a traditional optimization technique. I think in in the paper, it was like with their colored MNIST data. So, like, 76 or 74% accuracy on training.

And then when they they took that on to test, it was, like, 10% accurate, which is pretty bad. And then with their approach, their training was in the sixties on both training and test, on the holdout validation. So this approach of making those restrictions within those environments and saying this penalization term is so strong that I'm gonna deweight this cycle of optimization to say this is actually not this is not correlated because it's not causal based on that that inference that's going on. Yeah. And that's a really good point.

The purpose of most predictive models is to generalize past the dataset. I think that that's pretty agreed upon. And if your data are not perfectly representative of future data, then you might have issues. And even if it is, you might have features in there that for whatever reason are very indicative of the y value, but in the real world are not. And so by ensuring that you're learning causal relationships instead of spurious correlations, for example, you generalize a lot better.

But as Ben said, that can kill your training accuracy because you're not picking up on random features. Yeah. Definitely. And and you can see this with even open source datasets that are very simple, where you're trying to predict, you know, something relatively simple. Like, oh, I wanna do bike rentals in San Francisco.

I wanna use that dataset. And you might throw in some data in there to say, well, I want I want, you know, month of year in there, and I want adjacency to a holiday, and I want cloud cover percentage. It's gonna pick up on whatever that strongest correlation might be. And maybe it's precipitation. You know, people don't ride rent bikes when it's raining out because that's a terrible experience.

But in data, particularly in the typically spurious assumption, a lot of people having a test being IID. So it's very challenging to get that, particularly if you're not if you don't have a truly ludicrous amount of data to train and and validate on. If you're not very careful with that split that you're doing and making sure that the these two samples are of the same distribution and are independent of one another. You're gonna get some really weird results. It's gonna learn on something that you didn't want it to learn on.

And this happens all the time. So that this paper and what we're talking about is a means of of simplifying the process that you would otherwise have to do with building a causal model. I don't know if you ever had had to build one of those, but they're in the extreme to build just due to the amount of time and effort and computation that you have to do. Yeah. I I have.

I remember one of the first projects that I was working on was a modeling project, and I talked to our resident expert in causal inference and causal modeling. I was like, which is harder, predictive modeling or causal modeling? And he was like, that's a dumb question. Causal modeling is predictive modeling is tends to be more towards optimization and causal modeling, you this was his argument, you have to actually know what's going on under the hood and make sure everything makes sense and then explain it. Predictive modeling can be a black box, and if accuracy is 95%, we're happy.

Mhmm. And it's interesting to see these two fields merge because black boxes are becoming at least over the last several years, black boxes have become less acceptable, both from a model trust standpoint. So if stakeholders don't trust it as much and then also from a generalizability standpoint, we tend to like you can back test, obviously, but we want to know that future predictions will be not that bad. So, having causal knowledge of what's going on is really effective. Yeah.

And having that causal relationship with rejecting potential features, whether we're talking about images or just data with new changes in data that's that are coming in for prediction, the this recording and and measuring of the real world in order to generate predictions, the model would be aware of that penalization effect of spurious. So if you have a true happens, which is flash like it. Every model is gonna experience that. There's gonna be some foundational shift in your feature data at some point. When that happens, this approach can actually counteract that to a a large degree, provided that it's the training data.

Exactly. And Ben started to get into, one of the more technical pieces, and I think we've he sort of painted a high level picture so we can start descending into the weeds. But he started talking about the assumptions of modeling. And standard modeling, basically, if you think of most machine learning training scenarios, it's called so this method is called invariant risk minimization, and standard machine learning is called empirical risk minimization. Basically, you can think of risk as sort of like a loss function, and empirical means that it's the data that we observe.

Invariant refers to the causal framework so that it must invariably minimize risk. And for empirical risk minimization, there are 3 core assumptions that the paper cites. The first is that data are IID, so independent and identically distributed. And as Ben said, especially within a trained test split, that's often not the case. And then even well, if you can just think about it, when you sample from different environments and then you treat those environments as IIID, that's invalid.

So with train test splits and then just other training if we just look at training accuracy itself, that assumption is often invalid and that leads to problems. The second assumption is that as we gather more data, the ratio between our sample size and the number of significant features should decrease. So as Ben was saying, if we have a bunch of random features that have very spurious correlations, then this assumption might not be true. But if we're truly learning causal relationships, then as we gather more high quality data and more correct data, those causal relationships should become more apparent and more clear. And then the final assumption sort of is we have to assume is that we have a model that can explain the relationship between our features and our end.

If you can't do that, then that's a problem. But often, we just have to assume that. So what the paper basically said is 12 are not valid, and then 3, we just assume because we have to. Yeah. And that that assumption.

That assumption, particularly I mean, the first assumption is as they cited in the in the second half of the paper in their coffee shop transcription, which is really amusing. The big lie of ML is IID. And that that holds true just as you said with the train test split, but it all it I think it holds even more true when we're talking about utilizing an ML model on data that's never it's never been seen. So when we're doing daily scheduled predictions, there's no guarantee, unless you're looking at a fixed closed system that you've everything about it. And then you've really pared down your feature set to things that are observable.

You're, you're looking at machinery operating and you're, you're measuring the, the vibration analysis and the RPM and, you know, it's fixed set of, of. Look at the entire environment of that, of all the latent factors that are associated with that. So even in those cases, you're you're still not guaranteed IID over time even though we all say that, yeah, it is. It's we're taking sample from the population of training at this time. That's what we're using.

And then when we're predicting in in real time or in batch over the next n number of weeks months years, it's gonna be the same distribution. Well, generally not. Things change. Entropy happens and things fall apart. And that's true of pretty much any model.

But your your third point, the assumption that actually, it's predictions. Yeah. Those are really good points. You sort of hinted at one thing that I wanted to highlight, which is causal learning isn't always the best option, maybe. Could you give some examples of data generating environments where straight up optimization might be the best option?

Oh, that's a good one. And I think you hinted at it. It's it's a knowledge of the data generation mechanism and knowledge of future data generating mechanisms. So if it's a physics based system or a machine, we might be able to assume that prior data will look exactly like future data. But in most product applications and most big tech applications, that might not be the case.

So I was wondering if you had any ideas on that. Yeah. Definitely. I was just trying to think of some of the use cases I've been involved with over the last couple years. It's they're very few and far between where we can say, hey, the for training and everything that we're predicting on.

We can validate this, as you said, through back testing. We'll look at random samples of the data over a period of years, and we see, yeah, there there are some variations, but the variance within each of the features being collected is it follows, like, over extremely long periods of time because we're tightly controlling the data that we're collecting. It's sensor data. It's measuring Newtonian pro properties of things. So in the respective use cases like that, we're measuring equipment or physical properties of something.

Those are that data being collected is bound by an environment law, which they kinda get to in the paper as well when we're talking about environment, environment. If we're constraining what we're measuring and what we're predicting within the conf boundaries, you have fixed limits on when you're measuring physical systems. For talking about measuring, predicting whether a rotating piece of machinery is going to fail. We know that we're, we're not able to measure below 0 Kelvin, or we're not gonna have a temperature value that is gonna be hotter than the core of the sun. So you have these constraints, and you know, over time that you're gonna be in this range of what is the temperature of this piece of equipment at this place on Earth Cause there's a melting point to any, any piece of, equipment.

And then that environment constrains itself to the point where you can say, yeah, I have ID because in training test holdout and prediction datasets, but an infinitesimally small subset of ML application that I've seen. And a lot of those aren't even handled with supervised learning. They're handled with physics models. You know that relationship. You already have that causal understanding.

So you just programmatically write that and say, hey. If I have these conditions, this is a problem. Otherwise, continue not sending me alarms about something. Supervised learning isn't used for some because it can be easier to implement than writing a physics model. And it can also allow you to do a lot of them in parallel, without having to write a bunch of custom code for monitoring systems.

What you alluded to production use cases in industry that people are applying machine learning to, a lot of them have to deal with humans. Like, how is this person going to interact with my company's products? Or what is going on with this geopolume? What is the risk of this thing? And is this person trying to do something to attack my company?

All of these sorts of things have so many latent variables that you can't collect. So isolating the data that you are collecting to getting a causal relationship is the the big gain that this paper talks about. Yeah. Exactly. And I think you made a really interesting point in that most physics based models use physics models, which makes a lot of sense.

And for that to work, it's I mean, well, first of all, that's the goal of modeling is to predict the future. Right? And if you have a really good understanding of the system, then you probably don't need machine modeling or machine learning like Ben, hinted at. And in that case, you can just develop some mathematical model that creates the relationship between x and y. But for machine learning, we're looking to develop an estimate of that relationship, and so causal learning, basically, if you're not if you can't write out a mathematical relationship between x and y, causal learning is probably a really good option for you.

If you don't have perfect understanding of the data generating mechanisms, then causal learning can improve any generalization to new datasets. So that that was that was basically the gist. Yeah. I I couldn't agree more. And as you mentioned earlier, when you asked your your colleague before, like, hey, which one's more challenging?

I I kind of chuckled at that because causal learning, when when you're building those those sorts of models, you need to understand the problem space. You don't need to understand it, Chip. That's the point of of the modeling is to say, hey, what actually impacts what? But you have to understand the problem space to such a degree that you can actually build the relationships that you're gonna be testing with empirical data. And that involves creating tags of saying, if this changes, then it changes this, and this has a relationship here, and these don't have a relationship.

And you have to design all of that and then run a bunch of experiments saying, hey, start varying things for me and selecting vectors of data and tell me what the relationship is. It's very complex. It requires domain expertise. Whereas, as we've been saying, you know, traditional ML, you don't need to have all of that. It helps if you do, and it's really beneficial if you work with somebody who really understands.

In fact, you should be doing that, but you don't have to understand it to the degree that you would for causal modeling. Completely agree. And even with causal modeling, there's another layer of whether you need statistical significance of the causal relationships. That's often pretty easy to add in, but just another note there. And then I also wanted to quickly, before we redescend into the weeds and talk about optimizations, and I well, I just wanted to touch on the philosophy behind this because I think it's super, super interesting and I don't know, there's something interesting about it.

So a long time ago, people studied the brain and said, hey, we have neurons. That's a cool framework. And then in the eighties or nineties or whenever it was, neural networks were proposed as a method for solving problems with computers, essentially. But back then, we didn't have the infrastructure, the computing power, or anything like that, and so we were unable to run large scale neural networks. But with the rise of computing, we started using neural networks for basic things and we were able to sort of approximate the brain.

Causation is not a real thing. There, you can't go touch causation, you can't go grab causation, you can't look at causation. Causation is a construct by the human brain and it's made through the relationship with neurons. And so what we're looking to do is take neural networks a step further and just black boxes in general, but take them a step further and make them more like the human brain. And essentially, what we're looking to do is approximate what humans do as a child.

So a child sees a pan on the stove and is like, might not touch that. They get burnt and they say, oh, things that are on the stove are hot. Well, a year later, they come back and they touch the stove, but the stove is off. Well, the stove can't be the cause of burning my finger. Maybe there's some other relationship.

So they go out and test different things. And what essentially this paper is looking to do is build in that trial and error learning framework into the loss function. And what they do is they create test environments of different scenarios of touching the pan when the stove is on, touching the pan when the stove is off, and they ensure that the same cause occurs in each environment. If it doesn't, well then it can't be a cause. It might be a strong predictor or a strong correlation, but it can't be a cause.

So I just thought that was freaking awesome, and it's cool to see how algorithms are starting to become more like humans and borrow the best parts of humans. So we'll see we'll see how that goes. That's an awesome analogy, actually. With with traditional ML that we're looking at for the, that use case of touching the, the stove and getting burned by a hot pan. A traditional supervisor, if it doesn't have all of those, those combinations of when was the last time something was cooked in that pan on the stove?

What is the state of the knob for the burner of that stove? What type of pan is it? You know, there's a lot of data that you could collect about that, but if you don't have all of those permutations in your and you don't have enough samples testing, like, Hey, is it hot? Yes or no? Based on this data of the state at that moment, you run the risk of it doing what the child did at phase 1 is just saying pan on stove, hot burn.

Yes. It sees that the pan is on the stove versus the pan is in the cupboard saying it's burning hot. Even though in reality, we know from our experience and our brains can understand and we can, you know, put our hand above the the stove and say, do I feel heat coming off of this? So the model, if you're not careful with your training data, is gonna learn a spurious correlation that you don't really want it to learn. Yeah.

Exactly. Where we all are. And as data science practitioners, that's where we all are. Whenever we're building a model, we're fighting against that. Yeah.

And just think about what would happen if you got all of your data about the stove from the kitchen in your house versus a kitchen in a restaurant versus a kitchen in a restaurant that is at peak hours between 7 and 9 PM, there it's like the data generation is so, so important. And again, what this framework looks to does to do is overcome crappy data generation mechanisms or biased sampling because you can't have perfectly unbiased sampling almost ever. Yeah. There's too many we we just came up with a handful of them, and there's an infinite number of latent variables that could affect that. We could spend the next several hours just whiteboarding a bunch of ideas.

Like, what other data could we collect that could explain whether a pan is too hot to touch or not? And that's the reflection of the reality of any project that you work on is you're never gonna collect all the data. It's impossible. You're never gonna have true causal understanding of a problem due to the fact that you can't collect the empirical evidence to explain a causal relationship that would be able to be accurately represented in traditional machine learning. So this approach of saying, I'm gonna sample things and I'm gonna say, holding this one element constant and I change these other things, what does that actually do to the, the prediction, my optimization that I'm doing?

And how does that impact how I should deweight this one particular feature? And it it's a brilliant approach. You're reading through the paper times this week after you send it to me, it's like, this is so cool. And it it's, it's actually not too dissimilar from how ML explainability algorithms work. So like the shop algorithm patients is using something a somewhat similar technique where we're holding certain indexes in our feature vector in variant and synthetically varying other features to determine what is the actual importance of these factors.

How much do they impact this trained model and what it's actually predicting and saying, oh, this is the directional magnitude of this one feature. This is how it actually impacts. This paper that we're talking about is sort of taking that idea and saying, let's just put that into the training process. Old elements, a whole vector index is invariant, change other things around it based on our sampling and say, do we detect conditions where if we change this one thing, the model, the prediction is completely wrong? And those really big switches dramatically impacting the prediction.

Those need to be weighted very differently than the ones that are just, oh, there's a there's a maybe a weak correlation here. Yeah. I have a positive correlation, but it's the ones that, hey, if I just switch this one thing, it completely nullifies this prediction. That's the improvement here. My curiosity, and we talked about it before we started recording today is, like, what is the computational complexity of doing this?

Where we have to build these environments and these synthetic elements, whether whether it be a convolution over a matrix or we're doing a vector processing. This is a lot more iterations that we would have to do in order to determine that relationship. So it'll be really interesting if this does take off. What is the value to people in industry comparing these two things? Because it it's gonna be added cost.

There's maybe added accuracy. So I'm curious to see what that where people's cost benefit analysis will fall out on this. Yeah. Yeah. It's a good I I have no idea.

And just another note on industry adapting to this method. This paper only covers classification, and they're working on regression topics, but that that was left for future work because it's apparently a lot more challenging from a mathematical standpoint, and I could see why. But, yeah, right now, at least as of the date that this paper was written, it it they only have it for classification. Mhmm. So Yeah.

I imagine this would be this would be nontrivial to do for a continuous predictor because you would have to estimate the magnitude change and how much does that that deviation. And you'd have to run a whole lot more iterations in order to determine that. It's like that in in the the shop packages, that's the difference between like your tree solver versus your, the the tree solver. If you're running with a a tree based algorithm, that thing executes pretty quickly even on a a relatively enormous dataset. And then if you if you can't support that and you have a model type that isn't tree based, you have to use like a kernel solver over there's also linear solvers, but they're, they're restricted.

And so that kernel solver is brute force. It is I'm I have this black box that I'm throwing a vector of information at, and it's going to provide a prediction. And I'm just gonna mutate a bunch of stuff and throw it at that and see what the the differences are. What is between these? And it takes a long time to run.

It came very long time to run. It's on the order of, like, 100 to a 1000 x longer than it took to train the model in the first place just to be able to explain it. Yeah. That makes sense, but god damn. Yeah.

Yeah. It's a pretty beefy hardware. Some of our customers at Databricks who are needing model explainability because of legal requirements, sometimes they they don't wanna or they they try to use certain types of they're like, yeah, that's cool. We know everybody's using that, but we don't wanna use it because it's not a really good fit for our use case. So but we need explainability, and then they get down the road of, like, alright.

We're gonna use SHAP. And they're, like, wow. Why does this take so long? And, like, this cost 1,000 of dollars thing. So, yep, sorry.

Like we're working on it. When we figure it out, you'll be the first to know, but it's, it's, it's computationally intense. So it'll be interesting to see this taking off and eventually packages being written or pre trained classifiers that Facebook can offer to the world based on these papers saying, Hey, here's the way to do this and here's the algorithm. And we created a structure that supports this to know if when you need basically a 100 GPUs and you need a cluster of them in order to run this even for relatively simple data. Yeah.

One of the companies that so we have weekly speakers. Either someone internally presents an ML paper or we have someone externally come in. And one of the people that came in was a a founder of a startup that was trying to just basically sell us his product, unfortunately, but it was all about model explainability, and there's so many new companies that are working on model explainability. Obviously, not all of them will succeed, but it's such a hot topic right now because of, a, generalizability, b, trust. Also with the general AI discussion, everybody's like, you tell an algorithm to make a bunch of paper clips and Terminator happens.

Well, if we know what's going on internally, maybe that won't happen. So I think there's there's a bunch of different reasons for why it's such a hot topic. And Databricks is, I'm sure, gonna put out some great products and but it is also challenging. Yeah. Challenging, and it's expensive to do it.

Yeah. And that's Exactly. That brings me to another meta question, not meta the company, meta the concept. At what point does ML become such a commodity because of the advances and the general acceptance that people have? Like, algorithms, such as what we've been talking about, get to a point where, Hey, they're really good and they're really generalizable and people trust them.

We can explain them pretty well. When MO becomes a commodity, anything that becomes a commodity becomes optimized for price. And what is the performance considerations or need that come up when this becomes something that just everybody does? So the premise is machine learning algorithms become something you can buy at the grocery store, basically. You can make them really quickly and they work really well.

Maybe not quickly. Just they're reliable enough that people trust them, and they're ubiquitous throughout industry. Every company has their their suite of ML models that are running every single day. And how much is that gonna cost the company for things that are just expectations? And what is that load gonna be on cloud providers.

I see it pretty much every day. There's allocation problems on all of the 3 major clouds that we have, where somebody's like, hey. In case I wanna do modeling, You know, we were doing this one model, but we found that if we split it up based on region and we create, you know, 1800 different models, they're all more accurate than that one meta model. And it's like, yeah. But sure.

You're just removing one you're creating different ecosystems, these different environments, basically, for each of these models. You're you're eliminating a bunch of latent factors that are influencing the performance of the one meta model. But then they're like, woah. We tried to run it asynchronously. We tried to kick off a job that did training all in parallel.

Do it because we got a cloud provider limit. They're like, do we need to go to a new region? I'm like, no, you need to talk to the cloud provider to get more allocation, but what's your budget for this? They're like, woah, we didn't know it was gonna cost that much. Like, yeah, it's expensive.

But if everybody's doing stuff like that and everybody is is going down that route, how big are data centers gonna be? How much silicon are we gonna be mining and forming into wafers? It's an interesting thought to to consider where these models become more powerful, more predictive, more reliable because they're they don't require, you know, a team of 8 statisticians and 4 ML engineers to build a a solution that's based bunch of statistical analysis to determine spurious correlations before you even start feature engineering work. A model can just say, yeah, you can throw the some garbage data in here, relatively garbage data, and then we're gonna find these spurious correlations and reject them from the model's, influential weights. That's what I mean by commodity.

Gosh. When it becomes easier, then it becomes more expensive. Yeah. And even with this method, you could argue that you should throw in more data because it will only find causal relationships theoretically. So let's throw in ESPN data.

Let's throw in Twitter's most recent feed. Why not? And so, yeah, that that's a really good point that future cloud computing demand will no doubt increase just due to scaling of industry. But as we have more robust algorithms that can handle really crappy data, there's a lot more automation in this model fitting process. So people would just throw everything at the model and see what it can output.

So yeah. I mean, I'm now. Arguably, people are already doing that. That was the big the big promise of XGBoost when it was released. It was like, hey.

It'll it'll autonomously reject poor correlations. You can throw whatever you want at it. It's only gonna look at the most important features. And it's, yeah. Yeah, that is true.

That's how that algorithm works. However, it's going to optimize for the, the greatest correlation that it has. And if you throw a bunch of junk into it, it's going to optimize pan on stove means hot. Then that's the only thing that's gonna look at. All those other features that you throw in there, it just ignore them to such a level that it's really only looking at 1 or 2 features.

So that was an issue and it still continues to be an issue for that that algorithm. And then other companies have have fallen suit. All XGBoosters and a come come big tech companies have released their own sort of versions of extreme gradient boosting. So you have light GBM, you have Catboost, you know, these algorithms do similar things and trying to solve kind of the same sort of problem in slightly different ways. But people got the wrong message, I think, of like, shouldn't just be throwing garbage into a model and then it'll just magically do some stuff for you.

But this paper could theoretically do that. It could say, hey, these things are are pretty important. Instead of just saying I'm finding these these two features that are of greatest correlation, it could be rejecting the things that have extreme correlation with your target and in favor of 30 other features that are actually less correlated, but they have a better chip to that predictor. And you could get away with throwing in just tons of data. Like, hey, I'm gonna throw a 1,000 features into this thing and 10,000,000 rows of data.

And it probably on holdout validation, it probably will be any any other approach that's out there. Yeah. Just to be also perfectly clear, this isn't a XGBoost equivalent because there's a lot of design in how we create our environments, and there's lots of subject matter knowledge, and that's a pretty labor intensive process. So in the cows camel example, we knew we first of all knew that background was gonna be a meaningful feature and a potentially correlational feature and not causal feature. 2nd, we knew a ton about what different locations have in terms of their natural landscape.

So based on being just everyday people and having subject matter knowledge about different countries and where cows and camels live, we're good to go. But think about more complex examples. I don't know that it will be that easy to develop those environments, and it's tough to come up with an example off the top of my head, but it assumes a lot of, a, subject matter knowledge, and then, b, it assumes that those environments are very good examples of the sampling schema. So Right. Again, we assume it's IID.

It's not IID almost ever, but then we have to say, alright. It's not IID. What is the true sampling schema? And that's that's a really tough question in my opinion. And and, personally, that's other than compute power and computational complexity, I think that's one of the biggest issues with this method.

So it's not a short term quick and easy fix. It's for really complex, really important models where we do know the the data generation scheme a little bit. Yeah. I mean, definitely. And you have to not only understand what data is being collected and what for our our image example here about cows, we would have to understand what are the potential latent factors in that environment that we'd have to collect data on to explain the causal relationships.

So if we're doing picture generation, instead of just data augmentation that people normally do with CNNs, we're like, oh, I'm gonna flip the image. I'm gonna invert it. I'm gonna, you know, I'm gonna mess around with the color space. I'm gonna throw grayscale versions of this in there to try to counteract. That's traditionally what you would do with the CNN to get it to generalize better and hope that it doesn't learn the things you don't want to learn.

Like, oh, if it's if it's green, then must be cow in the background. But what happens when you put a cow on a boat? That's an environment an environment that a cow is not normally in. They don't live in the ocean. But what happens if you're transporting, a bunch of cows and somebody takes a picture of it?

Do you want it to detect cows? And if it was only just looking at the background, what would that classifier do? It would have probability membership being extremely low to anything that it had seen before, and it probably get pretty confused. It'll give you a prediction in most frameworks, but it won't be something that you can rely on. But with the correlation and building that environment saying, okay, I need I need all of these different picture environments in order to define the ability for it to disambiguate between cow and camel?

Do you also superimpose pictures of cows in space on the moon? Like, those are the things, like, that's that's examples. But any supervised learning that we're doing, we should be thinking about that in the applications of an algorithm like this is understanding that problem space for predicting what a customer's next action is gonna be interacting with our product. What are all the different possible avenues? What if we're saying probability to churn based on how frequently we're sending push notifications to their phone?

Well, what are all those latent factors that go into that? And there's a lot. So, yeah. Interesting philosophical, thought path. Yes, it does.

All right, man. This was fun. Good discussion on a really cool paper next week to a pretty fun group of people. It's going to be the open source ML flow development team, some colleagues of mine. And it'll be a panel discussion.

Michael will be grilling all of us on all the questions he wants to know about developing one of the successful open source tool kits that's ever been made. That should be a fun one. So Yeah. Make sure to prep your questions, man. Yes, sir.

Cool. Yeah. That's great. And I'll see you next week. And also thank you everyone for tuning in.

Take it easy. Have a good one.

Cows, Camels, and the Human Brain - ML 182

0:00

42:06

Playback Speed: