Navigating Common Pitfalls in Data Science: Lessons from Pierpaolo Hipolito - ML 183 - Adventures in Machine Learning -

Navigating Common Pitfalls in Data Science: Lessons from Pierpaolo Hipolito - ML 183

Welcome to another insightful episode of Top End Devs, where we delve into the fascinating world of machine learning and data science. In this episode, host Charles Max Wood is joined by special guest Pierpaolo Hipolito, a data scientist at the SAS Institute in the UK. Together, they explore the intriguing paradoxes of data science, discussing how these paradoxes can impact the accuracy of machine learning models and providing insights on how to mitigate them.

Hosted by:

Special Guests:

Charles Max Wood •

Pier Paolo Ippolito

RSS Spotify Apple Podcasts YouTube Amazon Music

Show Notes

Pierpaolo shares his expertise on causal reasoning in machine learning, drawing from his master's research and contributions to Towards Data Science and other notable publications. He elaborates on the complexities of data modeling during the early stages of the COVID-19 pandemic, highlighting the use of simulation and synthetic data to address data sparsity.

Throughout the conversation, the focus remains on the importance of understanding the underlying system being modeled, the role of feature engineering, and strategies for avoiding common pitfalls in data science. Whether you are a seasoned data scientist or just starting out, this episode offers valuable perspectives on enhancing the reliability and interpretability of your machine learning models.

Tune in for a deep dive into the paradoxes of data science, practical advice on feature interaction, and the importance of accurate data representation in achieving meaningful insights.

Transcript

Charles Max Wood [00:00:09]:
Hey everybody. And welcome back to another episode of adventures in machine learning this This week on our panel, we have Ben Wilson. Hello. I'm Charles Max Wood from Top End Devs. And this week, we have a special guest, and that's Pierpaolo Hipolito.

Pierpaolo Hipolito [00:00:22]:
Hello. Hello, everyone.

Charles Max Wood [00:00:23]:
Do you wanna introduce yourself real quick? Let us know who you are and what you do, why you're famous.

Pierpaolo Hipolito [00:00:28]:
So, yes, I'm here and, I'm currently working as a data scientist, at the SAS Institute in the UK. And, part time I my spare time, I also contribute to towards data science, Indian publication online, and some have to also be published on other online publications such as free web camp or kittinguggets and and so on.

Charles Max Wood [00:00:49]:
Awesome. So you're a data scientist? I'm sorry. What did you say you were working? I didn't quite catch that.

Pierpaolo Hipolito [00:00:55]:
It's our since then. Salesforce is off by the store.

Charles Max Wood [00:00:58]:
Okay. Yeah. Very

Charles Max Wood [00:00:59]:
cool. Your company headquarters is about 6 miles from my house.

Pierpaolo Hipolito [00:01:03]:
Yes. In Cary. Yes. Uh-huh. Yeah.

Charles Max Wood [00:01:06]:
Awesome. Well, you wrote an article talking about the paradoxes of data science. And in particular, yeah, you you outlined 4 different paradoxes or 4 different things that you can, run into with your dataset. And, as Ben is often fond of pointing out, your data kinda determines how good your, machine learning models are. And so it was I I thought, oh, this would be interesting to dive into and talk about, okay, you know, how can we get a misread on our data and, yeah, how does that happen? So do you wanna first, before we dive into these, give us a little bit of background on why you wrote this article or, you know, was there something that prompted this?

Pierpaolo Hipolito [00:01:45]:
Yes. So I think the idea originally came from, the the service study I did for my master degree. So I did a master in artificial intelligence, at the University of Southampton. And for my master, I decided to cover a topic called causal reasoning in machine learning. And, part of the topic that I've covered during the the project, that's last about 3 months, was researching about how society can be embedded in machine learning models and, what kind of problems you can have between getting stronger relations between society and correlations in datasets, variables. And, for example, how can you shape your files in the case of COVID or things like that? When you for example, like, as as you might remember at the early of beginning of 2020, like, February, January 2020, there were also people trying to, for example, think, oh, let's see how AI can help with pandemic development and try to stop or or predict the number of cases and these kind of things. But, at the beginning and now now they can do it also quite well, but the beginning was quite impossible because we didn't have any actual data to to use to train their model or to create model in the first place, and that was sort of a paradox. And, therefore, you know, the data is always probably not to work.

Pierpaolo Hipolito [00:02:58]:
Did a did a lot of different techniques in order to augment the data or creating models, vehicle environment as a synthetic environment in which people moving around and spreading the virus or use mathematical models to to see how people can change states and so on. And that's what prompted them the the writing of the article and and so on. I think also the book by Julia Baird by the book of why was also a really good starting point where he also is famous in some paradox, which is one of the 4 paradoxes I worked in the access on.

Charles Max Wood [00:03:28]:
Very cool. So do you find then and I'm just curious as we get into this. Do you find then that people tend to they get more or better data? Are they able to overcome some of these issues with understanding the data, or is this something that you fall into even with a good data set or a more mature data set?

Pierpaolo Hipolito [00:03:48]:
Obviously, that's if you have some process of data quality in the first place, that that makes the work of the data scientist or, decision maker itself easier. But, you can still go into this kind of pitfalls. So so you you need to have some form of basics of knowing how the underlying system you are trying to model really works, and then you can model it so that you can make predictions on on it.

Charles Max Wood [00:04:11]:
Makes sense.

Charles Max Wood [00:04:11]:
It's a very interesting set of topics that you brought up with simulation modeling when you you have a sparsity of data coming into a problem. And this is something that, as data scientists, we deal with quite a bit where you might have a new project that your boss is saying or somebody at the company says, here's a new initiative we need to solve for. Go figure it out, data scientists. And even if it's a very well defined problem, which most of the time it's not, but sometimes it is, you go back to look in your data warehouse or in your your data lake, and you suddenly have that sinking feeling of, like, we don't even have this data. Talk to the front end devs or back end devs, like, hey, we're trying to look at this, this signal here, and they're like,

Charles Max Wood [00:04:55]:
oh yeah, we we were

Charles Max Wood [00:04:56]:
gonna capture that or put that in the logs, but we figured nobody would ever use it. So they turn it on, and you might have a month's worth of work where you're trying to get everything started doing your research, and then you just have a month's worth of data to play with. And to make an effective algorithm, you actually need years of data maybe to solve this problem. So what are some of the techniques that you found for doing that that simulation of saying, like, you said, people have been doing it with COVID, and how do they do that? Or how have you seen that done?

Pierpaolo Hipolito [00:05:27]:
Sure. So, yes. In with with regards to COVID, like you mentioned, I I I let alone there are 2 main approaches you can take. So one is, with, like, sort of, epidemiology models, which are based on using, differential equations. So in in this case, for example, you could start by creating a set of ordinary differential equations, and, each equation represents a compartment into the model. And then, as, you simulate and run these equations against time, then you can see how different, people are in, number of people in the real population. So moving in a compartment to other into the simulation. And in this case, for example, most simple model that you think of is could be a SIR model, which is a model with 3 compartments.

Pierpaolo Hipolito [00:06:16]:
People that have susceptible to disease, people who are infected from disease, and people who are recovering from disease. And therefore, you have discreting compartments with these 3 three equations representing different compartments. And depending on the number of people in the compartments and the one compartment to another. You can then run a simulation for which you can get the sort of, curves that's one of the scene, quite a lot around the which, they pick up the the pandemic, and and so on. That have been around the least part of them by different governments. For example, I know I know that at the early stage of the pandemic, we'll take our first season. The team from the Imperial College of Economics using a code. A sort of a compartmental model with, something like 20, 24 biopsies.

Pierpaolo Hipolito [00:07:00]:
So how complex this thing can go in order to make more actually dissemble the actual variable virus. Since that, you can take into account many as many virus as you want. So instead of having just susceptible, in fact, they recovered a comparison for people who actually died because of disease, compartment of people who have been vaccinated and before they're immune from disease, compartment of people who have time limited immunity. If they got COVID, they are immune for just a certain lots of time, and then they can get it again and so on. So you can have adjustment compartment as you want in order to try to make it to look as as close as possible as the real world. And the other approach is that you can take is by using, agent based modeling. And this is used, not just, mini video video model, but, also, like, for example, for training military applications or for also our main vehicle simulations. I can also it's more like taking using the Formula 1, raising or simulate tracks and so on and how a car can perform track.

Pierpaolo Hipolito [00:08:01]:
And, in this case, yes, basically, the main approach here, one of the most under one could be, like, to to get a class or non g training class and, which are centered to, for example, a person in the population and that's all the different characteristics which a person can have such as age or a certain type of, commute that he might might might do within a city And other factors such as, I don't know, being older, younger, female, or male, and so on. So that to make sure that it's, for each of this factor then it's, associated with a certain probability of having getting this more self remote from our disease or not and so on. And then, you can instantiate as many people as possible using the class and, obviously, run an algorithm to simulate how they interact each other within a artificial environment so that you can get the exact same response that you can get from a technology that more knowledge based.

Charles Max Wood [00:08:56]:
So this is rooted in game theory and Bayesian probability statistics where you're saying, hey. For a given population, I have n number of, say, if we're talking about COVID, we start with a a loci of, say, 10 infected individuals, and we say, map out how many people you're going to come in contact with, and then what is that multiplicative effect or that geometric increase in the number of people that might be infected if people don't change their behavior based on, you know, the current state of reality. And you can kind of approach a bunch of different problems with doing this as well. I know there I mean, I I've personally worked with companies that have done this sort of approach with tracking, like, cell phone location data of customers where they're saying, hey. We know where these people are in general. We don't know who they are. We just know a cell phone had an IP address ping that was triangulated to this street corner in Manhattan, and we know where they're kind of going, and we can simulate where they're gonna go and and where similar people to them are gonna go next week or next month and then sell that data effectively. So it it's interesting to hear about these approaches for something that's really important like a pandemic.

Charles Max Wood [00:10:12]:
What would you say is an effective strategy if somebody's thinking about applying these principles to their business from a data sciences perspective? It's this isn't traditional supervised learning. This isn't traditional deep learning where everybody seems to be focusing on those two things in the data science world. How would you recommend people get started in this this sphere of data science work?

Pierpaolo Hipolito [00:10:36]:
Yeah. I think that's much more a niche type of application since that, for for many cases, you can sort of gather data easily. But, from other cases such as for the beginning of the pandemic, you are you don't have anything. And that could be also for other applications, which has, for example, predicting, escaping rules for volcanoes or things like that. Or you want want to know model, how many people can escape from a CV if a power kind of rocks. That's because you don't have any much data about anything or things like that. You can just use this kind of model simulations, and and that's where they come useful. But when you update, obviously, it's better to produce the real world data to the publicating some form of artificial data, which then you use to make the models.

Pierpaolo Hipolito [00:11:19]:
And in in order to to to do it also, if anyone here is interested, in my case, for my project, I mainly created everything from from scratch, and that was a custom application, etcetera. But, there are also some open source packages for library engineers such as Mesa, which is Python, or ASH, which has also JavaScript and and c plus plus API which you can use. And they also provide you with, some form of for for some common place common common problems. So it could be, for example, some example the ad was, modeling the logistics, so optimizing the logistics for a store, for example, for modeling the the how a wildfire spreads in the forest. And for this kind of simulation, they also provide other suggestions that we can also focus the just on creating the best model instead of having also to create a fancy 2 d or 3 d graphs, you know, to showcase how the actual model evolves over time.

Charles Max Wood [00:12:13]:
Yeah. There's a a a customer of my companies that has been working on tracking the population density of both predators and prey in in national park, and it is interesting talking to them and asking them a couple of questions, like, hey. How do you do this and stuff? And they they walk through it, and it's almost identical to what you're saying. It's like agent based modeling. We're creating a a set of conditions based on mathematical models, and they're basically just passing partial functions around inside this class saying, hey. Here's the probability that this happens based on what we know of the interaction between these species on a very limited observer basis. But they wanna forecast, like, what should we recommend to the federal government of the number of prey that humans can hunt versus how many are available for the predators, and how do we get an ecosystem balance. And it's one of those things that you can't actually run those tests in the real world because it has a huge impact.

Charles Max Wood [00:13:11]:
And you don't you can't just run experiments, do, like, a design of experiments, but, like, well, this year, we're gonna allow hunters to cull 40% of these of these creatures. You know, you destroy an ecosystem doing that. So the the way that they get around that is this simulation of saying, let's just run a 1,000,000 simulations and see what happens. Yeah. That's that's

Pierpaolo Hipolito [00:13:30]:
that's an also a quite common, other application for because they I don't even know the name, but there are a set of 2, 3 equations which represent a span of the situation that are commonly used in in the in the style of each, like, oh, there are essential equations.

Charles Max Wood [00:13:43]:
Yeah. It's really, really interesting for anybody who's listening who is really curious about some of these mathematical models. Some of these equations are very old that predicts what stability will be for a particular population, or a particular, like, natural phenomena, but it applies to a lot of other things as well. Trying to remember that one that one, magical equation that gets a periodicity repeating. But there's an equation out there that if you feed in a certain value into it, and I think it's in the denominator, it'll based on what that value is, what that k value is, it will generate either a repeating binomial series or multinomial series that actually maps to observed data in populations. And it also translates to stuff like routing of vehicles, like how many paths can we actually successfully do in a steady state situation. So it's pretty fascinating stuff. So what are some of the other applications that you've seen for handling early stage serious problems that we need to solve now, such as the beginning of a pandemic and saying, what should we be focusing on, or how important is this?

Pierpaolo Hipolito [00:14:54]:
Yeah. I I think these these are the the most common ones. So then for another thing I commonly see after, for example, could be that, for some cases, where you don't have any data to start with, but you know that you can get data faster. You can also in order to try to tune these models, you can also start with the model in simulations. And then for some indication of COVID, after 6 months when you actually get the data, then compare compare the curves that were predicted with the partial ones that you can find from the models like other parameters almost, and and then get better results. So that's definitely another one. And then, for example, apart from that, I see now there is also for supervisor models or unsupervised models to create actual synthetic data, for example, for art form showing more problems which could be like over sampling techniques, under sampling or ebooks for example. I I saw also different type of some big names in financial solution creating synthetic data or financial enhancement data and actually serving them as a product.

Pierpaolo Hipolito [00:16:01]:
So I think there is interest in this topic. Also, we especially considering that, data can be also quite expensive to to buy and creating your own synthetic data can be can be also an edge on can be your signage basically on on other people.

Charles Max Wood [00:16:17]:
Yes. So synthetic data create creation, when we're talking about supervised learning, it can be a technique to sort of reduce the probability that your model is gonna receive completely unexpected data and not know how to handle it. And we're talking about linear models. Your your space is effectively contained by what your training set feature vector is. And if you get something that is such an outlier on enough of those vector positions, if they are dominant in that linear equation, you're gonna get a prediction that is not gonna make much sense. So and tree based approaches are even worse where you're confined exactly to that that sort of label space as well as the feature space outside of that. It it's not gonna extrapolate what could potentially happen. So simulations can can sort of curtail the possibility that you're gonna not respond to a black swan event, such as a pandemic, crippling the financial industry of of many companies.

Charles Max Wood [00:17:20]:
Is are there any downsides that you've seen for people that are using your company software, which is I've only ever seen really SaaS users, do this because it's pretty sophisticated stuff to intelligently create synthetic data. But are there any downsides that you have? Like, warnings, like, hey. Don't do this for this use case or, yeah, definitely do this.

Pierpaolo Hipolito [00:17:41]:
I don't know. I think it's quite user specific. So you always have to pay attention closely to what they're doing for for example, an analytics center that we offer, with SaaS is a response to create actual synthetic data. And I don't know how precisely that works, but it's something that I've been working the last few months for RND for now. And, but it it's also a quite important to remember in his in his case, like, quote from George Box, which he said, that all models are essentially wrong, but some of them are useful. So, there is always because you are trying to make a prediction on a population from a sample, there is always gonna be uncertainty and errors. So just keep in mind that, keep that in mind and be sure that, you always check for drifting the data or anything after you deploy the model because because, the model is not an actual recognition environment, but just an approximation of it.

Charles Max Wood [00:18:38]:
And for most use cases, when we're talking about even something like modeling out pandemics, there's a lot of latent factors that go into that. I'd say the latent variables that are actually incredibly important for modeling that out are uncollectible data, unless we have, you know, some sort of weird surveillance state going on, in countries around the world where everybody has a camera on them at all times. We know exactly what you're doing, who you're talking to, or what the proximity to physical interaction is. And but for a lot of use cases that people are doing in industry, it's similar. You are trying to predict whether a customer's gonna churn. You're trying to predict whether this activity is broad. You're trying to figure out if somebody's gonna buy this thing. The actual underlying influences, those latent factors that really drive what somebody is gonna do or that what the classification of that action is are unknowable.

Charles Max Wood [00:19:34]:
It's not gonna be in your training data. So how do you handle that? If you're talking to a layperson at a company and as a data scientist, and they're like, well, why doesn't the model get this right? How do you communicate that to them?

Pierpaolo Hipolito [00:19:46]:
Yes. Like, obviously, you need to make sure that they when you ask about they just they start by asking a business question. Once you set up your business question and you know what you're trying to achieve, you need to make sure that you have, all the elements made that possible because if you don't have the the data necessary on the, basically, it's necessary in order to achieve that. Then, all the other way you can try to go for working out a solution could be sort of to find a proxy variable or something that can communicate with the original variable that you are trying to interpret, but you don't have, but it's connected with so that you can sort of correlate between the 2. That's, I think, the main way you could try to to focus on that. And, for example, one way to do that, is by using Kalman filters and particle or particle filters. That's something that can can be used, especially since you see, like, in robotics or reinforcement learning type of applications where you you are trying to, for example, one actual in the example is, like, if you're trying to track the position of, of a car where it's moving around the map or in the world, and there is gonna be points in which, for example, the car might go under a tunnel or might go in a real real post in a, like, a forest or something, like, a street signal. And in order to try, for example, when you go under the tunnel to to to to predict how where is the car, You just take measurements of before and after you have information and tries to work out what's going on in the middle, basically.

Pierpaolo Hipolito [00:21:20]:
And that's how this kind of particle filters or carbon filters can get more, you know, in order to to estimate things with the proxy variable that you're on top. And but that's just a a very specific application. So, like, if you're working out with the standard supervised learning techniques, then if the, like, feature engineering with the the features you couldn't have and trying to see if you can work out how can they relate most to your business base is probably the best solution.

Charles Max Wood [00:21:49]:
So feature interactions in particular of extrapolating the the covariance associated with 2 separate features to provide further information. Is is there a limit to that of of how many elements or how how many features in a vector that you would say, hey. Don't go crazy. Don't do something like, hey. I'm gonna throw a 1,000 features into this model, and I'm gonna interact them all. What sort of output would you expect from something like that in supervised learning?

Pierpaolo Hipolito [00:22:18]:
No. I mean, I the general advice I think would always be to try to solve a really complicated prob problem in the most simple way possible. So Mhmm. Especially as even if you have seen, really good results at the end and you have thousands of features, then, and you actually want to deploy it. You only have to maintain it and make sure that you can sort of understand how the model works. And especially in some countries, like, for example, in the EU, there is a g the GDPR act. There is an act to perform the right of explanation of a customer. So, for example, that could be if you create this model for, deciding if someone should get a loan or not, and, they a person that come up, doesn't get a loan, and you go to the bank and ask, okay.

Pierpaolo Hipolito [00:23:04]:
I didn't get the loan to get it back. Can you explain why or what can I do today? And then the guy will say, okay. There is this model of 2,000 variables that does decide things to us. Okay? I don't think that's a a really good explanation.

Charles Max Wood [00:23:17]:
Yeah. A 100% agree. Even if somebody has 10,000 variables, should be going through feature reduction and saying, I'm gonna do exploratory data analysis. I'm gonna interact features. I'm gonna iteratively go through and do, covariance estimates and cross validation of different feature subsets and saying which ones actually solve the problem to keep it as the the simplest artifact possible. That's that's something that I don't see nascent teams doing very often, unfortunately, and that's something that people assume, well, I can throw, you know, 10,000 features into XGBoost, and then I can calculate feature importances, and then I can just take the top end. And then you have to kind of explain to them, like, that's not how you do it. You need to use other tools, statistical tools to It's

Charles Max Wood [00:24:05]:
the term sounds so smart, though.

Charles Max Wood [00:24:08]:
It sounds lazy, and it as engineers, a lot of us are lazy. Right? We wanna get to the next thing pretty quickly, and it seems like it's a good idea or that it can save you a lot of time. It actually just wastes a ton of time from every time I've seen somebody try to do that. You run into, another interesting problem with supervised learning where the curse of dimensionality. As that feature space grows, you now need more rows of training data for a model to even detect a correlation signal correctly, and that becomes, you know, exactly as you said, Pierre. It's it becomes expensive to start having to have that volume of training data, and gets complicated.

Pierpaolo Hipolito [00:24:48]:
Yeah. I do think it's even more interesting now with, how the whole field of transformers are evolving with semi supervised learning. Yeah. So and personal how they started with, NLP, and now they are going out. So also vision data or sound data. So I I think that's also gonna be interesting next year. So that that is gonna go on.

Charles Max Wood [00:25:09]:
With the sound data. So can you talk a little bit more about that, about what that research is like?

Pierpaolo Hipolito [00:25:14]:
No. But I just said because I I know that that's hugging face, which I'm really punch and if you can solve that. So, yeah, basically, yeah, you can use cell. So, like, transform, you know, you can transcribe, NLP data and then process it or engine it up. And I know, for example, so also Unity, the computer gaming engine is is doing quite a bit of research on that and also in concentrated data since that, obviously, they they have this sort of modeling environment like you mentioned before, like, the data used for games. So they're trying to make take advantage of them also for, you know Interesting.

Charles Max Wood [00:25:47]:
So because you have hardware that's running that game engine that has generally free cycles of a pretty beefy GPU sitting in a desktop, people are running that. You can just have a portion of those those threads in CUDA running a simulation at any given time so that you get dynamic gaming effects that respond to player input and how they're playing the game.

Pierpaolo Hipolito [00:26:11]:
Yeah. I also know that Unity can use quick Python in order to pay touch for the RL models. So they they are exploring on that. I don't know how precisely they're gonna use it since I I never worked for them or I didn't know anyone more. But, for sure, they they have interest since I since I did they developed SDK and everything. They they have findings everything in that area.

Charles Max Wood [00:26:32]:
That'd be an interesting metamorphosis of of video games in general, if you can have any what's that?

Pierpaolo Hipolito [00:26:39]:
No. It's like also, like, you know, unreal engine with optimizing, for example, great racing hosting. So that means I'll say, I got this in gaming. So that's that's and maybe see how personally you can ping ray tracing to machines that that we could be able to automate. And so, yeah, the the the there is definitely quite a quite a lot of room of improvement for this kind of application.

Charles Max Wood [00:27:00]:
Yeah. Currently, ray tracing on on, GPUs, that's a brute force algorithm. It's going through and calculating every point of a light source to every, like, every surface that it's rendering. If you can have an AI solve those.

Pierpaolo Hipolito [00:27:14]:
Yeah. Where that make that make it much fun.

Charles Max Wood [00:27:17]:
Yeah. Yep. So it doesn't have to do that path and calculation. That's pretty interesting. It's almost like using an autoencoder to say, here's what my expectation should be from this point source in this location. Now just generate the like, render the the frame for me on the next cycle. That's pretty cool.

Charles Max Wood [00:27:32]:
So one thing that, you know, kinda pulling it back toward this article a little bit, do you find that these paradoxes because you you gave visualizations. And so for me, I looked at them and I was thinking, well, I can see how I could come to that conclusion based on what I'm seeing on this graph. But do you find that these paradoxes affect the accuracy of machine learning models, or is it just something that we fall into as we look at the data and kind of try and find a pattern ourselves?

Pierpaolo Hipolito [00:28:03]:
I I think, like, if you leave, the models, the all the variables, it needs to actually solve the paradox, then it also can get trapped in it. Like, in this case, for example, we were neglecting the in the case of the example on the article where the cutting per neglect in the h as a variable, like, in the paradox. And then if you get this, then it can just give the labels and the exercise script and it neglect the h then the part the model could could get confused since that's the it just looks at the how the different features are compared to each other, correlate to each other, and and make prediction based on that. It doesn't have any external knowledge like we, human can actually do.

Charles Max Wood [00:28:41]:
Oh, got a bunch of dead space. That'll be edited out. Sorry. I was just looking over your the article real quick because I couldn't find the, the exact link to that one. But Simpson's paradox, that's one of my favorite ones to explain.

Charles Max Wood [00:28:55]:
Yeah. Again, right, it it gets back to because, yeah, I looked at it, and I kinda did the lazy evaluation and was was going, yeah, it looks like it says this instead of actually applying a little bit of rigor and going, oh, this is what these numbers actually tell me.

Charles Max Wood [00:29:11]:
Yeah. So one of the things that I I really like about that article is not just the explanation, which is great, and I find that I am fascinated by your reading style because I I read a bunch of other things that you had posted over the the past couple of days. But the plots, the visualizations that you use to actually tell that story are something that I don't see a lot of people doing. I see a not a lot of, like, newer data scientists, I'll call. Like, people that have gotten into this field in the last 5, 6 6 years. I see these explanations coming, or I used to see them all the time back when we used to be called analysis engineers or status you know, statisticians, but this used to be the bread and butter of everybody's analysis before you even talk about modeling. You have to go through and in this article, you're in Simpson's paradox explanation, going through and and creating this visualization where you're displaying the Pearson's correlation coefficient of these these two variables and saying, this is how they're related. Here's my evidence of fitting a line to this and how correlated they are to one another and extrapolating from that what is the interaction between these these features? Like, how closely related are they? How critical do you think it is for pretty much any ML to go through this this exercise that you did in explaining and teaching this on just general features? And when do you think that's an appropriate time to do it in an ML project?

Pierpaolo Hipolito [00:30:38]:
Yeah. I particularly, like, for creating good data. I guess that's one of the my favorite things when I'm thinking about it, creating analysis basically. So and that's also why, for example, I, tell my favorite task to your sensor model so that we wanna go more where we see where I also have to go in which I I really took time also to create interactive suggestions. So that, for example, some of them that I also embedded in not the defaults. You can actually rotate or they wait.

Charles Max Wood [00:31:04]:
Mhmm.

Pierpaolo Hipolito [00:31:04]:
For example, for explaining how the parameters works and how to find the best combination of private parameters or work without encoders. Creating a recording, for example, on an encoder using on an x and actually be able to draw on it on a web interface. That's the type of experience. You can actually test the model and see how it integrates the output and then work out what's going on, because you make it more interactive, and easier to understand. Especially because I as I mentioned before, like, I think obviously it should always be to to try to to create the simplest solution, to a complex problem. And, in in order to that, yeah, you need to dive deeper into what what you're trying to model, first of all. And using sanitation is one of the easiest way to do that because otherwise, you can also, you know, create some exercises and things that can help you, but without actually seeing how the the things works out in monograph, I I think that's just the part to the the view that you might miss missing out, especially, like, as I said before also, like, reducing the dimensionality of your data using PCA, other folders, or anything that are also really helped, especially when you work with pretty sparse data or new text data that's up to convert into my format. And I I think that's also a really simple icon or, like, I yeah.

Pierpaolo Hipolito [00:32:23]:
So I think that data preparation, so engineering your features and, you actually be able to to expect that now. But, also, you as a user are creating the model. You can understand how how you would predict a basement data. That really helps work out. So basically, and that's what we present, like, how I would really make this to, how they create a works. Like, how doing the how do you think really well? 80% to the outcome or something like that?

Charles Max Wood [00:32:54]:
Yeah. Totally. And creating those those charts and plots from my own personal experience, I don't know if I'm just old school, but I look at the charts that you created for these articles, their correlation plots. I'm like, man. I do that too. So, like, I but I I do it in that way, and I'll color code based on, you know, label cuts that I do on the data. And, you know, I'm looking at continuous series, but I'll save them all off somewhere so that after I've gone through, you know, testing a bunch of different model approaches, trying to find the simplest, most performance, cheapest solution generally, I always go back and look at those when I'm trying to do interpretability analysis from running, say, SHAP on top of that that model and saying, hey. Figure out you know, I'm gonna sample a a 1000 rows from the dataset, and I wanna see the force plots for SHAP for these.

Charles Max Wood [00:33:43]:
But when I'm looking at those, I have the correlation plots open right next to me with all of the color coding, And I'll look and I'll say, oh, yeah. That makes sense why the model has learned that because the correlation relationship between these is this. And even creating synthetic rows to run it through SHAP testing and then referencing these correlation plots, that can help figure out where that space is. And, yeah, it it's super critical, I think. And, hopefully, more people can read and look at the examples that you've done in these in these articles because it's an insight into how it's supposed to be done, and you're going through a deep analysis of of your data.

Pierpaolo Hipolito [00:34:23]:
Thank you.

Charles Max Wood [00:34:24]:
Yeah. That's Simpson's paradox. That is that's probably my favorite one that can be the most interesting paradoxical ex like, discussion that you can have on a dataset because it exists pretty consistently, I've found, in real world data where if you just plot a continuous variable against another continuous variable or if you're doing, you know, an ANOVA analysis of of categorical data and you're visualizing that, it can be pretty misleading because there's some confounding variable that is actually explaining that that noise that exists in that plot. And if you don't actually color code that or map that or split that and do that per subsampled subgroup analysis, you might lose that signal in explainability. You use the the age versus, cholesterol level in in your example, but it's pretty rare that I've found an actual real world use case where I don't see interactions like that happening. So to the listeners out there, I highly recommend reading that article in particular from him, because it really does break down some of the ways that you can get burned in ML, without really understanding your feature data. Oh, we had another pause.

Charles Max Wood [00:35:34]:
Yep. I'm just wondering, like, what other gotchas can you run into? And is there is there a good way to make sure that you're not running up against that stuff?

Pierpaolo Hipolito [00:35:43]:
I yeah. I think, like, the accuracy paradox is is working very compared to the overfitting. So I'll try to, for example, focus too much on that, move things and then stay away from the bigger picture, make you lose points when trying to attribute on real world data. So and that's gonna be that is also another thing to do is why common because at times, especially, you might try to use yeah. When you try to use more com more compact com models more complicated than necessary, then that's what usually what happens. Like, you overfeed to the actual training visa, and, then when you try to make predictions on out of sample and data points, you you you get false performances. So and, basically, that's also why it's a paradox because you when you train when you try to create a model when we're actually trained, you try to improve, the accuracy as much as possible. But, when you test it in a app set on the test set or as a validation set, you just notice that trying to improve the accuracy in the 1st place then drove worse accuracy when you actually did it, later on in in production.

Pierpaolo Hipolito [00:36:56]:
And and and that's why we yeah. It's really important to to not focus too much on, like, if you make up your own target, the only thing that matters, then it stops being a good target. Like, because you you that's there is your way to to natural process, which would be in between creating a a new product.

Charles Max Wood [00:37:20]:
Yeah. I mean, it's I'd say that it's endemic, and I have a name for it that I use. I call it the Kaggle paradox even though it is the accuracy paradox. But it's almost the fetishization that data scientists have these days of competing, of, like, sort of winning this the accuracy game at the expense of maintainability and explainability. I've seen people spend months on projects that they could have solved something at an accuracy value for classification, for instance, Like a binary classifier, they could have had an an area in our under ROC of 0.94, but and they could have done that in a week's worth of work of building a very simple model, just doing their their feature engineering to get it to be somewhat accurate. And, on a holdout validation, they would have a result like that. Instead, they spend 7, 8, 9 months working with an ensemble of you know, an implementation of basically somebody's white paper that hasn't been proven or referenced, And they install PyTorch, and they build this graph embedding solution that I think is gonna, you know, work really well. And after that almost year of work, their result on the same holdout validation is 0.957.

Charles Max Wood [00:38:40]:
It's like, what was that really worth it? I mean Yeah. It it's cool, I guess, to build something like that, but how expensive is that gonna be to maintain? And can you explain that? Like, how it arrived at that that prediction?

Pierpaolo Hipolito [00:38:54]:
Also, in terms of car carbon footprint, it's probably not the best. Uh-huh. Mhmm. And, yeah, I I saw about it a few actually a few weeks ago, a tweet or something online or maybe it was a LinkedIn of someone saying, like, yes, if you are a data scientist now and, you have to solve a problem, of a model you have to take in production and you have to solve a problem. There are 2 two things you need to do. 1st, create a linear model which can go into production, and then spends months creating a really weird and sophisticated model so that you can just put it on your CPU

Charles Max Wood [00:39:34]:
later. Yeah. I mean, resume driven development is a thing. I've seen entire groups at companies just dedicate themselves to that. They just wanna do the most cutting edge thing, and it it does a disservice to that company when you have an entire team with that mentality, when that baseline, that first attempt I don't even use linear models to do my my baseline, by the way. I'll do a simple fit of, like, a simple regressor if I'm doing, you know, a, a regression style problem. If I'm doing categorical, I'll just use an an if else statement based on EDA. I kinda have a a breakdown in my head of, like, okay.

Charles Max Wood [00:40:14]:
Here's the relationship. If the values are between this and this, it kind of correlates to this for a prediction, and write a very simple decision tree in SQL. But that's my baseline to compare everything else against, while I'm doing experimentation, but I timebox everything too. I'm like, hey. I if I'm testing out an algorithm, it it gets 48 hours of attention. That's it. If I can make it work and it seems promising in that 48 hours, it's a candidate. If it's too complex to even or the API just sucks so bad that it's almost unusable, it's out of contention because it's gonna take too much time and money.

Charles Max Wood [00:40:50]:
And as you said, carbon footprint, that's a real thing. Electrons are precious. They come from fossil fuels, you generally, so don't waste them.

Charles Max Wood [00:40:58]:
But but I like electrons.

Charles Max Wood [00:41:00]:
Everybody does.

Charles Max Wood [00:41:02]:
Yep.

Charles Max Wood [00:41:03]:
But the overfitting, underfitting stuff with that that effect in in the pursuit of accuracy at all costs. It also comes with it a process that you need to adhere to while you're developing stuff, where, you know, you see people doing stuff like hyperparameter tuning with grid search and cross validation, where they're running through and they're doing random splits. Maybe they're stratified, maybe they're not. They're doing under sampling, oversampling if they have a massive class imbalance. And when you're doing that that random iteration of cutting for train and tests and over and over and stuff, what do you recommend for that final evaluation when you're talking to people about avoiding the accuracy trap of this situation? Are you saying, hey. You're you're pursuing the ultimate, you know, cross validated accuracy measurement above all else. How do you make sure to say, hey. This at least you have confidence in production that this is gonna work pretty well?

Pierpaolo Hipolito [00:42:01]:
In in this case, probably, yeah, if it's a classification problem, make sure that, there is a good balance between different classes, first of all. Then, then you say by and sensitivity, guide you not not not just accuracy. And, then, yeah, look at the and also the leads or the spanometrics, you know, to to get track of, how they vary from training to prediction to actual testing. Yeah. And, also probe your model with, with the testing. Yeah. Also, come up with some good out of sample, measurements, etcetera, that you can actually take it on to see if, the model can get weird to really from some specific edge cases like like you would do when creating a span of a piece of software that you think of edge cases that might free pay your model from. It.

Pierpaolo Hipolito [00:42:50]:
And so that you can go also, you know, where where are actually the issues of bandwidth that your model is using there and where you can can go wrong or not or how to change the margins or these kind of things. And that's can be also important, for example, a security point of view because, indeed, when you talk about cybersecurity attacks with ML, there are actually enough spyware people trying to probe these models to to actually understand where they draw a line at the prediction and then try to go pass them in this way. So that's actually something yeah. That's also a sort of can be a penetration text environment, like, making sure that you understand how your model can attribute it wrong. So it's, yeah, you can actually model the input production in some ways and reason for that.

Charles Max Wood [00:43:31]:
That's funny that you mentioned that. I think we talked about that a couple weeks ago, Charles. One of one of the guests was we were talking about fraud detection and how how hackers actually how criminals criminal hackers actually try to detect what they can get away with of, you know, running brute force simulation attacks, and that's what they're actually doing. Oh. They're building this sort of table of saying, I'm gonna do these actions in this order. Try that with a burner account. See what happens. Now I'm gonna try another thing.

Charles Max Wood [00:44:03]:
Where do I actually not get detected for doing something illegal? And that's where they'll operate. Like, hey. Your your system doesn't detect this, and I'm gonna I'm gonna steal a couple $100,000 from you. You're not gonna even know. And it it actually does happen. And so you brought up a great point. That's something that a lot of financial institutions do with that simulation. They have white hackers effectively that are trying to defeat that that algorithm and say, hey.

Charles Max Wood [00:44:30]:
This is definitely some suspicious behavior. Let's see how the the model behaves with it. And if it doesn't detect it, then they know, okay. We gotta retrain or continuously retrain and monitor for those those edge cases.

Pierpaolo Hipolito [00:44:42]:
Yeah. And I'll show you some also how creating overly complicated models that we saw sort of a low for unintended consequences when, some lead has a a triggering input to then cause a second super important structures, which you wouldn't have expected and everything goes down. Mhmm.

Charles Max Wood [00:45:00]:
Yeah. With some of the interactions that happen with with super complex architectures, yeah, there's no way that a human can reason about how the internals of that work. And that's why my advice to people is always, if it's something that is so important that there's a lot of money on the line or your team's existence is on the line as a data science team in a bigger company, It has to be something that you actually understand how it's reasoning through the correlations that you're providing it. Because if you can't think about the worst case scenario where people rely on the output of this implementation to make a decision that influences, say, 10% of your company's revenue, which could be 50 times or a 100 times your entire team's budget. Think about what an executive is gonna think about if you cause that amount of damage to the company because of your model. So understanding how it works is far more important than, as you said, that resume candy. Like, oh, we built this cool thing, and I'm gonna put it on our CV. It's like people respect the fact that you put out 10 linear models that actually worked and provided value far more than that super cool reinforcement learning system that was so overkill for predicting churn of a co you know, a customer.

Charles Max Wood [00:46:15]:
Yeah. But I wanna prove how cool I am.

Charles Max Wood [00:46:17]:
I mean, I tell people that's what the weekends are for. Nights and weekends do some cool stuff, and that's actually what Kaggle's for. Like, use that use that tool for what it is. It's your way of showing off and and building a portfolio if you if you wanna be that that person. But when you're working for a company, you gotta take it seriously, and you need to produce something that is maintainable, solves the problem, and is explainable.

Charles Max Wood [00:46:42]:
Yep. Cool.

Charles Max Wood [00:46:43]:
That seems to be a common theme in almost all of your articles is, like, talking about that foundation aspect of of data science work. And I personally find it fascinating seeing somebody who comes from academia sort of rigor of how to do stuff, and you've converted it into something that I don't see a lot of people do, which is real world practical advice on here's how this should be done, but your posts aren't filled with pages and pages of mathematical proofs. They're more like, here's some charts and visualizations. Here's me explaining why this is a a thing, and here's a link to the proofs and stuff for you know, you wanna read about it, here's where you go read about it. But here's something that, you know, everybody can can consume. I think it's fascinating.

Charles Max Wood [00:47:28]:
I hope

Charles Max Wood [00:47:28]:
you write a book, man, because, I like your style.

Pierpaolo Hipolito [00:47:30]:
I'm reminded to make sure.

Charles Max Wood [00:47:32]:
Yeah. When you're when you're a famous, world renowned data scientist, remember us?

Pierpaolo Hipolito [00:47:38]:
I don't know. I think that's my my name will happen.

Charles Max Wood [00:47:42]:
Alright. Well, we're kind of getting toward the end of our time. I just want to, you know, call out the article again because I thought it was really well done. Ben's pointed that out as well a couple of times. If people want to find more stuff from you or, you know, maybe connect with you and ask you questions, usually, we're looking at, like, LinkedIn or GitHub or Twitter or something like that. And then it looks like you're also on Medium. Do you wanna just let people know where they can find you?

Pierpaolo Hipolito [00:48:10]:
Sure. So they can message me or anyone can message me or link to the next day if you need to, or they can also go to my website where I have a contact page, if you want to send an email that you can reach them to the and and I also have a newsletter, so you can get a subscriber to to get updated when a new article comes out. Awesome.

Charles Max Wood [00:48:29]:
Nice. Alright. Well, let's go ahead and do some picks. Now picks are just shout out stuff we like, stuff we want to let people know about. Ben, do you wanna start us out?

Charles Max Wood [00:48:38]:
Oh, I've got a nerd pick this week.

Pierpaolo Hipolito [00:48:40]:
Woo hoo.

Charles Max Wood [00:48:41]:
And it's what's consumed a lot of my time this week, which is, working on libraries in Python. And I hadn't used it before that much because a lot of the stuff that I've been doing over the last several years were sort of applications engineering, like using somebody else's APIs and now working on stuff that Mhmm. Building from scratch for other people to use, getting into the Pydantic library and, doing something that is run time, type checking for Python. That is a fantastic library. Massive kudos to the the engineers that built that and open sourced it. It it powers a lot of the stuff that's in fast API, which is a also amazing, package that's out there, but type checking is good. It helps you write far easier and more more knowledgeable and readable unit tests and catches a lot of errors that you might otherwise miss with a non compiled language.

Charles Max Wood [00:49:40]:
Yep. Very cool. I'm gonna jump in with a few picks of my own. I think last week, I picked the 360 degree liter, but I'm gonna pick it again just because I've wound up doing a bunch of coaching, and I answered some questions on our weekly q and a call, which, you can find at top endevs.com. Just scroll down. It's under events. But, anyway, people have been asking, how do I manage this, or how do I deal with this, or my boss is doing this? And, this book really does kind of outline, hey. Here's how you deal with people who are, you know, above you.

Charles Max Wood [00:50:17]:
Here are the people here's how you deal with the people that are around you, and here's how you deal with the people that work under you. And it really, is just a terrific book that outlines a lot of that and a lot of the approaches and how to be a leader even if you don't have, like, lead dev or manager or CTO or something for a title. You can still lead on your team. And I think it's an important skill to have to be able to just let people know, hey. Look. You you can make a difference wherever you're put. So, I'm gonna shout out about that. Also, top end devs.

Charles Max Wood [00:50:53]:
So as this goes out,

Pierpaolo Hipolito [00:50:55]:
because I think this will

Charles Max Wood [00:50:56]:
go out next week. It's either next week or the week after as we record this. Top end devs is getting a new website, and I've really just gotten a lot of clarity on what I would like top end devs to be and, you know, what our mission is for the dev community and things like that. And so the the couple of things that are going on there, one is is that I am working with people to put together courses on, various things related to software development, but I'm also, tapping into I have a lot of friends who work not necessarily in tech or dev, but are experts in some of the soft skills that I believe technical people need. And so we're probably gonna have some master classes and or courses on things like networking and not not like IT networking, but, like, networking with people or speaking at conferences or some of the leadership stuff that I talked about with 360 degree leaders or how to find a job or how to collaborate with other people and pairing and stuff like that. And then I do intend to also have the technical content. So if you're interested in authoring any of these courses, go to top enddevs.com/ author. You know, we're looking at video content and some audio content on the premium end.

Charles Max Wood [00:52:11]:
The podcasts aren't going anywhere, but this is kind of the next stage of where I see things going, for us and for ways that we can make a difference for all of you. So, anyway, I'm gonna pick that. And then if you are looking for sort of 1 on 1 walk through coaching, typically, people are getting on because they want to talk to me about creating a media channel like a YouTube video or a, podcast. So I've been coaching people on that. I've also been coaching people on, going freelance, and I've been coaching people on just kinda taking the next stage of their career. So if you're interested in any of that, you can also go to top endevs.com/coaching and see what's offered there. So I'm gonna shout out about all that. And then I am gonna do a board game pick.

Charles Max Wood [00:52:58]:
I need to start keeping track of what I'm picking so that I don't repeat. But last Saturday, I was teaching people how to play games at a video game convention here in Utah. It was a rather small one. And I did it as part of a marketing push for a friend of mine who owns a board game store here in Utah. And the game we played or I we taught, like, 6 games, but the one that I learned on Saturday before I had to take my turn teaching was, Planet x Search for Planet x. So if you like logic puzzles and you kinda like the dynamic of trying to figure out what the answer is like Clue, they kept comparing it to Clue, but this game is way more fun. Go check it out, the Search for Planet x. You do need at least one smartphone in order to play it, but, essentially, what it is is it's just the it's the piece of the game that gives you the clues.

Charles Max Wood [00:53:51]:
Right? So it's, hey. Here are the clues you're gonna get, and then when you do research or you survey part of the night sky for items, right, it'll it'll fill you in and say, okay. You know, there are so many asteroids here or whatever. But it was way fun, and so I really, I I enjoyed it, and I'm gonna pick it. So, go check out The Search for Planet x. We'll have links to all that in the show notes. Pierre, what are your picks?

Pierpaolo Hipolito [00:54:14]:
For me, it would be the the book of why, like I mentioned before, by Peter Pan, in case you are interested into digging more about causality or the sort of paradoxes and why it's impossible to to make models experimental, and that they can embed some form of external knowledge of the world. So, yeah, that probably is my my take for this space.

Charles Max Wood [00:54:34]:
Alright. Very cool. Well, thanks for coming. This was a lot of fun. I almost asked you where where people could connect could connect with you again. We'll make sure all that winds up in the show notes. But, yeah, I just love thinking through some of this process and going, okay. How do I make sure that the foundational data for my models is correct and giving me the right information? And then from there, yeah, Now how do I build this so that it's useful to people? So this was really helpful.

Charles Max Wood [00:54:59]:
Yeah. It's great.

Pierpaolo Hipolito [00:54:59]:
Sure. And thank you very much for, inviting me today.

Charles Max Wood [00:55:02]:
Yeah. Alright, folks. We're gonna wrap it up here. Until next time. Max out.

Pierpaolo Hipolito [00:55:06]:
Bye. Thank you.

Navigating Common Pitfalls in Data Science: Lessons from Pierpaolo Hipolito - ML 183

0:00

0:55:07

Playback Speed: