Michael_Berk:
Welcome back to another episode of Adventures in Machine Learning. I'm your host Michael Burke and I'm joined by...
Ben_Wilson:
Ben Wilson.
Michael_Berk:
And today we are going to be talking about distributed time series, which is just time series at an extremely large scale. And this is actually a topic that I have been really looking forward to, and I'm really excited about. And the reason is Ben is arguably a world expert in this field. So the plan for this podcast is I'm going to ask lots of dumb questions and Ben will be enlightening both you, the listener and myself. And by the end, we should have a better understanding of the basics of distributed time series and have some practical advice for thinking about and implementing these methods. So Ben, before we get started, do you mind walking us through your experience in the time series world and specifically the distributed time series world?
Ben_Wilson:
Sure. So I don't know about the world expert thing, but about a year and a half ago, or maybe it was a year ago, we started to see at Databricks, a couple of our customers were trying to do something that everybody's always wanted to do and strove to do with generating forecasts of inventory and just whatever they needed to do for their business to. to try to predict the future with time series data. And historically people had aggregated data to a higher level and then wrote some business rules that said, well, if I have this location that sells this thing, I can't build a model for each of the things at that location. So maybe I'm gonna aggregate to a regional level and build fewer models because it just takes so much time to generate those. And it's really complex to write all the code and maintain that infrastructure. So with the advent of distributed computing and it becoming more available and accessible to people, a lot of people have gone to saying like, well, we've got Apache Spark, we've got pandas UDFs, we can just run a ton of our forecasts across a Spark cluster and it'll be fine. And then they realized like, oh, we need to, we need to log information about this. We need to store the model artifacts. It's the object that was created from a training event. And we need to store the metrics and parameters that were associated with that. So, because we have audit needs or we need to see how stable this is over time. So they start using MLflow and they start writing out for, you know, 500,000 models, each of those 500,000 events. like the artifacts themselves, and then each model might generate, you know, six different metrics and 20 different parameters, and it quickly overwhelmed the systems. And most of their computation time was spent just writing to a database and writing really small data in high volumes to a database, which databases don't typically like that. So we saw this pattern. we analyzed how people were doing what they were doing and realized that there was two libraries that everybody was using. Facebook Profit, which is now Meta Profit, and PMD ARIMA, which is an automated tuner for stats models, time series models. And then there are a couple other people that are using stats models based stuff. And we set out to try to figure out a way to make this work for people so that they didn't have to write 20,000 lines of bespoke code in order to run time series forecasts on their business. And we came up with an open source package that released about seven months ago that. does handle the aspect of interfacing with tracking services on MLflow. It'll sort of log all of your data in as few steps as possible and doesn't overwhelm MLflow's tracking service. That morphed into us asking internally in engineering, like, hey, could we make this run like stupidly fast? making it so that we could do a million profit models in less than an hour. Could we do 10 million in less than an hour? And we wrote some prototype code that some people looked at and they're like, wow, you need to patent this. So we filed some patents, patent office looked at that and they're like, wow, yeah, this is original work. Here's your patent. And then About four months ago, I was told, hey, take those ideas and actually build it for real. So I don't know about world expert. It's just, it was an interesting problem that a lot of people were struggling with. And in the process of that, we figured out a lot about time series modeling and how to do it at extreme scale.
Michael_Berk:
Got it. And before we get into a case study, just for context, Databricks has done a lot of distributed computing in general. So that's sort of the bread and butter of the organization. So this was all in sort of a Spark Databricks environment. So it lent naturally to some of the work that we were doing. But cool. So just for one more piece of background before we get into the case study. In prior roles, I had been tasked with developing KPI forecasts and KPI is key performance indicator. So it's a bottom line business metric that is super important. And based on those KPIs, people might change business strategy, or it might just be nice to know what your revenue will be in a year or two years. This was a sort of a challenging endeavor because maybe we have one base KPI that's the most important. But then there's a whole swath of 15 other KPIs that are also really important. And developing a time series model for each of them is, I mean, the first one is like kind of fun, second one less fun. By the 15th one, you want to put a bullet in your head. So ideally there would be some automated way to have this work efficiently and accurately. So by the end of this, hopefully you'll have some ideas on if you do want to implement something like. multiple time series forecasts for KPIs in your company, you can maybe do that in a few hours or a weekend. But also we'll go into the crazy scale that Ben has seen. So this case study that we will use to anchor our understanding. We are a hot dog stand company based out of New York City. As you might know, there are lots of people in New York and lots of people that eat hot dogs. And we want to run a hot dog stand company. let's say 500,000 forecasts a day on ingredients. So let's say ketchup, mustard, sauerkraut, buns, dogs, you name it. The incoming data we have is a batch process that is hourly. So we have hourly data coming in. And then going out, let's say we want to develop a weekly adjustment to our ketchup acquisition strategy. And that will be informed by these forecasts. Starting off, Ben, how would you think about this at the most simple and dumb way? Or simple and you know what I'm saying.
Ben_Wilson:
So the simplest thing that I would start with is look at. thinking of it from a company perspective. And this is where most people get started. Like, hey, we just need to make sure that our main warehouse has enough ketchup and mustard and relish on hand. And we need to make sure we have enough hot dogs and buns to meet the needs of the entire boroughs of New York City. So that would be like a central warehouse forecasting, like inventory management system. And we would run that on, we would train that on each individual item over a long period of time. Might also want to do an aggregate of just general sales as well. Like how many hot dogs were sold and then look at what the similarity is between that forecast and individual products. Uh, cause the hot dog stand, we might not just be selling hot dogs. We might be selling French fries. and maybe people want ketchup with their french fries or mustard with their french fries or whatever. So we might wanna understand that, but that's gonna be a relatively small number of models and it's probably doable by a single person to maintain that. Step two after that would be how do we get the right number of products on each cart each day or each week? wherever those carts are stored at the end of the night, there's probably a, a very small warehouse in a district within the city that's being rented out where they're, they're keeping on hand stuff for that week. So that could be an optimization of, uh, that operations side of the business saying from the main warehouse, Hey, we have 30 crates of ketchup. Where do each of these need to go? Uh, on these 15 different know, many warehouses that we have. And if you have a forecast, you know, oh, well, demand's going to, you know, typically high at this location in it. They're, they're probably going to need three. This other one might only need one. This other one might only need one every three weeks. So by having those forecasts and having, you know, projected output and giving that to the hands of the operations people who were deciding. Like who gets what and what's the status of everything. That could be like a logical next step. And then the final, you know, level of, of complexity with that would be what we're really talking about today, which is each cart, each ingredient. I need to know exactly, you know, where to actually put the smallest unit of, of these storage things too.
Michael_Berk:
Practical question, would you, so we have sort of a hierarchical structure and I think this is really common in tangible ingredient time series forecasts and just other things in general. Would you take individual cart catch up forecasts and sum them up and have those be the warehouse requirement or would you have a time series model for the warehouse, a time series model for the distribution center within the district and then a time series for each cart?
Ben_Wilson:
It gets really hairy when you start summing up the forecast prediction estimates of individual models. I've seen people do it. And when the models are trained really well, and the forecasts are all really tightly controlled, whether there's no like ridiculous forecast happening, it's hard to tell the difference between aggregating the raw data into a single series and training a model versus aggregating all of the the individual predictions. But the other 99% of the time, when you try to do that, you get ridiculous data. Like, the forecast just don't make any sense. Because you might be really far off on 10% of your individual unit forecast. Maybe there's extreme variance on a new stop that's been, maybe we're putting a cart for the first time. in this part of the city that we've never operated before, we only have a limited amount of data. And we get burst traffic of people discovering it for the first time. And then you just get this lull in the data for days on end, where there's just not much traffic coming there. When you don't have an established amount of stability, of like a baseline, those models, let's say that 10% of those models produce ridiculous things. Now, you can write programmatic controls into your logic for that and saying, hey, disregard these things. But that requires a lot of work and a lot of human intervention, which most people are not gonna be doing. Usually have better things to do with your time. So if you're not culling that from a summation, it's gonna be... very risky when you sum those up and you're like, hang on a second. It's saying that next week I need more hot dogs than we needed in the last six months. Like what is going on? If you're nobody's checking that and testing to see if that, and of course, in this, this theoretical company, what we're looking at, nobody's going to actually put that order in, nobody's going to automate that. Like they shouldn't, uh, there should be somebody looking at that and being like, wait a minute, that's a lot of hot dogs, are we sure? And if you do something like that, and the models output, or the meta output of a bunch of different models is so bad, when people look at that result, they're just gonna be like, yeah, data science team, they're a bunch of idiots, not using this anymore. Like, look at this number, this is ridiculous. So it's risky, which is why I've never done it before. I'll just build another model. If somebody wants like, Hey, how many hot dogs for, for Brooklyn over the next six weeks, I'm going to create a Brooklyn hot dog model. That's just, you know, the summation of the last year or two years or three years of data is just some by location for the, those keys and then build a model on that model is going to take like maybe 30 seconds to train. It's no big deal. generate an output and that's just going to be a grouping condition key to present.
Michael_Berk:
Yeah, that's really an important insight that if your model throws out a million hot dogs and then you start placing an automated order for a million hot dogs, actually a million might make sense. Let's say a trillion hot dogs. That's going to be a problem and you will lose a lot of faith.
Ben_Wilson:
and
Michael_Berk:
But
Ben_Wilson:
probably
Michael_Berk:
Ben
Ben_Wilson:
your
Michael_Berk:
hinted
Ben_Wilson:
job.
Michael_Berk:
at... Yeah, probably. But Ben hinted at a really interesting point in that... Individual time series tend to be a lot more volatile, but not just that. The prediction intervals around those time series, if you sum them up, you almost always have an unusable forecast in my opinion, or at least in my experience. That's because in, let's say, a time series with three observations or something, you can have crazy wide confidence intervals. You just don't have the sample size in certain... food trucks, let's say, to support summing all of that up. So if you're using prediction intervals, it's a lot better to aggregate and then train a single time series versus having many summed up time series. Just a quick practical note, because it's bitten me before, and I wish I knew.
Ben_Wilson:
Yeah, I mean, I've seen people's results of that. And I've tried it. I'm not gonna say like, oh, I knew this from the beginning. Like, I'm so knowledgeable. No, I've definitely tried it. When we were doing product forecasting at an e-commerce that I worked at, ran a bunch of models and I was like, I wonder if I sum all of this up. Does this get me out of having to write a bunch of ETL to create this other model that somebody's asking for? I tried it and then I saw the results. I'm like, wow, that's funny. That's broken. Okay, can't do that. Lesson learned.
Michael_Berk:
Yeah. Yeah. So bringing it back to the case study, um, we can start out, let's actually start off small. So if a listener was out there and they're looking to develop, let's say 15 KPI forecasts that are automatically going to retrain every X amount of time, how would you approach that from an infrastructure and a programming library perspective?
Ben_Wilson:
So the interesting part of the case study that you brought up is the periodicity of incoming data. And if we're getting hourly dumps of sales and we're very confident that we're not missing data, there's not a lot of late arriving data. Maybe for that first hour that things are coming in, we're going to have trickling data coming in. So the hour before that, the thing that I always talk to people about with time series forecasts is As a data scientist or machine learning engineer, when you're generating these things, you need to think of it not from your perspective of like, yeah, we know that it's a lot of effort and it's a lot of time has gone into building this. You've tested out probably dozens of ideas, maybe hundreds. And it's painful to go and train these things to get them to perform well and maintain them. But... look at it from the other side as a consumer of your results of your output. They're going to be looking at your crystal ball and to them it is a crystal ball. Okay. There's this magician, you know, telling me what my future is going to be. And if you're really far off, uh, and it continues to get worse over time, they're going to lose faith in that implementation or in you. as a practitioner. And it's probably not even your fault. It's like a latent variable that's really screwing up the forecast. So what I talk to people about is think about how to consume those real world, you know, feedback that you're getting as it's coming in and checking against what your internal customers would be seeing. The data set that's out there that's saying this is what our forecast is, we have hourly forecast data. If we're doing this in hour chunks of sales by hour, we have last week's forecast. So today's Friday and the model generated forecast on Sunday, well, the Sunday forecast, what did it say Friday was going to be as it was coming in? And how far off are we? Now if you're right on the money. then cool. If you're within a couple of percent, awesome. If you're more than, you know, 30 or 40% off of what the actuals are, and it's consistent, like there's just this down trend the forecast didn't catch. And the real data coming in is is tanking compared to what the forecast was. It's time to think about kicking off another training session on this new data. Might not have to be immediate, but that's the great thing about nighttime. Because people have to sleep, you can have all that stuff kicking off at night, like a retraining event. If it's detecting, like, hey, we're really far off from where our forecast said we were going to be, we should retrain and republish that data set to say, here's the updated values. But if it's If it's like really dead on, you know, within 1%, why retrain it? It's fine. Don't need to kick off another run. Um, it'll eventually degrade. It's not like you can train a forecast or like generate a forecast for hourly data eight weeks out in the future and expect it to be accurate eight weeks out. That's not how this stuff works. You might be good for maybe a week, maybe two weeks. kind of okay on hourly frequency, but generally it's not gonna work that well for really long-term forecasts. But that's like the big thing that I tell people, like hey, think about what the perception is of the quality of what you're producing, and if you have data coming in that you can trust, and you can compare it against what your forecast said, use it. Set up a very simple automated job that's just checking that.
Michael_Berk:
That's such good advice, actually. I just got one of those light bomba ha moments. Cool. Anyway, so let's say we're doing those 15 time series forecasts. We want a automated retrain based on a trigger. Let's say you have some sort of infrastructure in place already. What sorts of libraries would you look to use to have a distributed automatically retrained? time series model.
Ben_Wilson:
I mean, for simplicity at extreme scale where I wouldn't have to be setting anything, I'd be looking at PMD, ARIMA, profit, and maybe that deep learning profit library that we were talking about last week. Anything where somebody is running optimization of hyperparameters for me, particularly with an automated system, because... If you lock in your hyperparameters or your terms for any of these models, as your data changes, things are not going to go so well. You're going to start getting really bad results really quickly. And if you're not watching that, or you don't have conditional logic, that's checking to say, Hey, I just tested these things. How does it compare to the settings that were used prior to this? you know, how, how bad is it becoming? Um, yeah, things can go poorly. So definitely look at automated libraries and then infrastructure for execution of distributed training. I mean, personal bias, of course, Apache spark. Uh, it's just, it makes it simple, but that isn't the knock stuff that's out there like, uh, Ray from any scale. You want to do distributed, you know, Python training. It's a great library. It's a great infrastructure. Definitely take a look at that. If you're really up on base Kubernetes operations, you wanna have AKS or EKS or whatever, handle your Kubernetes deployment, you can set up pods to execute individual time series models. Yeah, there's a lot of options out there for handling orchestration. But. you have to pick something. You can't just say, hey, we're gonna run this on a single VM in the cloud, or even worse, hey, I'm gonna run this on my laptop, and it'll work, it'll keep running. Cause it won't.
Michael_Berk:
Yeah, so with those 15 KPI forecasts, you might be able to get it to run on your laptop with no problem. But let's start talking about scaling it to 500,000 forecasts. What sorts of problems do you run into at that volume relative to the 15 forecast?
Ben_Wilson:
I mean, the first problem with 15 forecasts on a laptop is it's running on a laptop. Um, nothing that touches the eyes of another human at a company should ever run on your personal computer ever as a data scientist, uh, should always be. CICD involved unit testing. There should be orchestration management. That's all production ML stuff. Uh, and if you're running something on your own computer, Talk to a software engineer at your company and get some ideas or a DevOps person and they can train you up Or just read some books on how to do that stuff It's not that complicated But for scaling up, let's say we had something that was running on on a VM in the cloud and it was You know how we would probably write that is a for loop I would say for the the keys that we're trying to you know, the names of these data sets that we're trying to do for each of those, fetch that data from wherever it's stored. It's in some database somewhere. Um, and then we have some metadata that we're applying to define what that series is, where we're going to be writing our outputs to, where we want to log metrics and parameters to, and where we want to store the object of what we just trained so that we can go back in time and look at that if we need to, whether we need it for validation or just to see, hey, how good was this over time? When you're scaling from that level to from a for loop of 15 models to 500,000 models, you can't stick with a for loop. It will take for a very long time to run. And before we were recording this today, we were talking about a test that I was doing for this package that that we talked about earlier in the broadcast. It's called diviner. And in order to do testing of this at extreme scale, I had to write a data generator last night. So I wrote two versions of it. One intentionally stupid, which was, if any of you listeners are familiar with Apache Spark, I basically wrote a for loop on the driver that created a synthetic data set out of NumPy, which put it in a pandas data frame. And then once that was complete, it basically created a Spark data frame from that and wrote it to Delta. And that's all driver bound. It's single threaded. That for loop is just taking forever to run. And actually I started that at, uh, 10 15 last night. It is right now one 30 P, uh, one Oh three PM. It's 81% complete. Uh, right after I kicked that off, I rewrote that implementation using Panda's UDFs, which is using all the executors on a cluster. So I get full distributed access. I did a repartition of the data set from a primary grouping key to make sure that each partition would match with a core available on the cluster. And for the for loop, I did 10,000 series, like synthetic series. For the Pandas UDF, I did 5.5 million. and it finished in 17.3 minutes on the exact same hardware. So that's the difference between a for loop, like synchronous blocking single threaded operations, and using 164 CPUs across a cluster. Things go fast like Sonic when you use all of that hardware and use a sort of an infrastructure framework. and a platform like Apache Spark to distribute that. So that's
Michael_Berk:
Yeah,
Ben_Wilson:
what you
Michael_Berk:
that...
Ben_Wilson:
have to start thinking about is like, how do I use all of these CPUs? And how do I make sure that my IO, my CPU, and my memory are all sort of capitalized in the most efficient way possible?
Michael_Berk:
Yeah, it's the difference between 17 minutes and 17 hours slash probably longer.
Ben_Wilson:
It'll
Michael_Berk:
And
Ben_Wilson:
probably be
Michael_Berk:
the
Ben_Wilson:
a whole
Michael_Berk:
10th
Ben_Wilson:
day.
Michael_Berk:
and it's 10,000 versus 5.5 million. I didn't hear that the first time. That's insane. Um, one note on that as well. I've, I've built some synthetic, uh, time series data generators. And if you're ever using pandas for those computations, you will be screwed. Um. Sticking with NumPy, like even if you don't go a full Spark implementation, sticking with NumPy and Numba, which converts simple loops to machine code, I think, or at least some lower level code, you can
Ben_Wilson:
Yeah,
Michael_Berk:
get
Ben_Wilson:
it's
Michael_Berk:
a lot
Ben_Wilson:
optimized.
Michael_Berk:
more performance, yeah, and then parallelize it using something like Ray. But pandas will kill you with data generation. Another very useful tip.
Ben_Wilson:
Mm-hmm. Yeah, if you're doing, I mean, effectively, a Pana's data frame is a dictionary. So it's hash key entries of NumPy objects. So a row of data is an array. Or it's series based. So a column of data is an array. But array manipulations, if you can use the underlying C libraries, and that's what Numba does, it's all like stupidly optimized. to that particular hardware that it's on. Yeah, everything's faster. But when we're talking about extreme scale stuff, we're like, hey, we need to generate hour-level data over a 30-year period, and NumPy can generate that array in less than a second. That's completely trivial. But when you need to do that 5.5 million times, you're no longer like CPU bound by stuff. It's not, that's not the issue. It's more moving data from machine to machine and leveraging stuff like Pandas, UDF and Spark. You're using PyArrow to do the serialization and deserialization across the JVM to Python and back again. And Arrow is an awesome piece of software that allows very efficient serialization of Python objects down to primitives, which are then, you know, byte encoded and shipped over the wire. So, yeah, it's, it's really good, really fast.
Michael_Berk:
Yeah, so if the last five minutes sounded like French to you, that's okay. The main takeaways is leverage a library that allows you to run parallelized code sort of automatically so you don't have to write it all from scratch. If you are writing it from scratch, it can still work, but there are good solutions like Spark that will do it for you. So just do a little bit of research on parallelized libraries because that can save you tons of time and tons of performance down the road. Is that the one-liner bend?
Ben_Wilson:
Yeah. And I've seen people who want to get a little bit too clever for trying to eek as much performance as possible. Um, I've been guilty of it in the past where I'm like, Hey, yeah, I have parallelization here. I have multiple machines that can all communicate to a, a master node. Well, what if I try to get even more parallelization through the use of thread pools? or process pools on executors. And you have to be very careful when you're going down that path. I always recommend to start on a single machine, like that just has a single core, or multiple cores, but it's a single CPU. And if you're gonna be writing multi-processing or multi-threading operations, run that at significant scale in that context, and make sure that it works the way that you want and that you're handling retries and you're handling, you know, you're not getting into a race condition effectively with process management or you're not starving a thread pool because that stuff is very, very challenging to troubleshoot when your kernel dies. You're like, well, Linux just crashed. Great. Where's the stack trace? Well, there is no stack trace because the entire kernel is dead. Your logs are gone forever. So just get familiar with, with where the limits are going to be. Um, instead of just copy, posting stuff from blog posts that you might see about somebody implementing something that's meant for like a rest API service where you're like, well, see, they're using, you know, multiprocessing pools and stuff. It's like, yeah, because that operation is, is expected to be in, in milliseconds response. Whereas your ML task that you're doing is supposed to, is going to be taking 10 minutes to run. When you lock threads for that amount of time, weird things can happen on your operating system. So just be cognizant of that when you're trying to do that stuff with ML.
Michael_Berk:
Cool. So let's talk about models. We have the classic do nothing, create a case statement of, if it's this price, then we'll call it this price. If it's that price, we'll call it that price. That is maybe a good baseline, but that probably won't get you the performance and the accuracy that you need. So some common alternatives are using Arima-based models, so basically autoregressive-based models. on univariate data with some tricks. There's also some deep learning methods. And then profit is a worthy mention as well. So how would you think about developing these types of models for something like 15 KPI forecasts versus 500,000 time series forecasts?
Ben_Wilson:
For 15, I would probably err on the side of accuracy. So because I've built enough of them in my time doing this profession, I'd probably start with stats models because I'm a masochist. I'm just familiar with those libraries now after having used them so much. And I know what I need to do to decompose a trend and analyze. residual values and figure out where an autocorrelation and partial autocorrelation factor needs to be to inform where I would even begin to test. And I know the relationships for it. I can analyze the data set and understand what would be good combinations to test for the parameters submitted to some of those models. That being said. If you can tune manually something like Holtwinters or Saramax or Serima, if you really understand how all those things work and you understand your data set and you understand the stability of variance over time within your data set, you're probably going to beat any other implementation that's out there. You're going to do better than profit does. You're going to do better than PMD Arima. You're even going to do better than the most advanced LSTM implementation that's out there. Um, it just takes time and a lot of effort and those numbers need to change over time. And every time you do retraining, you should be testing out a different set of them. So I would probably wrap all of like each of those implementations in something like Optuna where that Bayesian optimization would go through and test combinations that I would want to validate as data is changing over time. And whatever one wins out, I would choose that to generate the forecast with each scheduled retraining. That doesn't work when you start scaling up. When you start talking about, even talking about 100 models, there's no way I'm doing that. manually tuning them. Sorry. It's just not tenable. Could one person maintain that? Sure. You're not going to do very much else at your job though. Because every day you'd be coming in looking at how is this performing. Or even if you set automation to do alerting, you're probably going to get, you know, five to 25 alerts a day on 100 models that are running every week. You're going to have to be spending time retraining and then pushing that result. your code repository for that project would just look like, you know, five PRs a day of updating hyperparameters based on testing. And that's all you would be doing. So when we start going to that level, um, I mean, I, I probably even balk at 15. If it was like five, I'd be like, yeah, I can do that. Um, anything above five. I'm like, yeah, maybe I should automate this. Cause people are gonna, if it's good, people are gonna want more. So in that case, I'm looking at things that can automatically do optimization, even though I'm sacrificing accuracy. Uh, because if we're, if we're doing in this use case, 500,000 individual forecasts, the amount of impact to the business of high variance, poor predictions, uh, on an individual item is much less than that 15 that we're talking about. Cause those 15 would probably be stuff that the business is using to make decisions like, Hey, should we buy 20 more hotdog carts? Should we hire 50 more people? Like, what are we doing with our business based on what the forecast are showing? Whereas I'm predicting mustard consumption for the next two weeks at this one hotdog cart outside of grand central station, I'm not going to be too concerned with, Hey, that's 5% less accurate than if I had manually tuned it, or 10% less accurate. It'll be okay. It's, it's just mustard. So that's kind of what I'm thinking about using something like profit or PMD Arima. Uh, and I would test both of them probably, um, and let them do what they do. Uh, maybe I use both of them and average the results together so that they can sort of balance one another out for each of them. But I would proceed down that path. And then I would, in the process of building all of these and evaluating everything, I would be, as I'm building that framework that runs all of this, would be thinking about how do I take that incoming data that's coming in every hour to compare it with whatever I'm producing so that I can set alerts? And how can, what sort of code do I have to write to alert me that? There's particular individual series that are problematic in a large like retraining run. Like maybe it's potato chips at, you know, hotdog cart number 732. Okay. It's predicting 100 X what any other cart was saying in that run. Is does that make sense? And give me the ability to flag that, put it into some sort of an audit log where I can then get the actuals of that historically and then get the prediction of those problematic ones in a visual way, show it on screen and let me know, hey, maybe I should flag this. Or maybe I should have automation set that, hey, we have low confidence in this prediction, don't use it for decision making, use what you would normally do. or some way of alerting people that, hey, this is probably bogus. And automating
Michael_Berk:
Yeah, one
Ben_Wilson:
that
Michael_Berk:
simple
Ben_Wilson:
is
Michael_Berk:
way.
Ben_Wilson:
going to be important.
Michael_Berk:
Yeah, you can just, as Ben said, you can just sort of put rules or bounds around your forecast and say, well, we know that it can't be 2x the maximum historical number of mustard packets, so we can just flag that and say, oh, use a default of like the past seven-day moving average or whatever you're feeling. But yeah, creating bounds on that is super valuable. And then moving on to the 500,000 forecasts. So here you mentioned with 15, you would maybe do some manual tuning, maybe use Optuna to optimize as well. But make sure that accuracy is super, super high. Can you translate high accuracy to 500,000 forecasts in the same way?
Ben_Wilson:
No way. I mean... If you had a problem that had almost no residual signal in it, and it was comprised almost exclusively of seasonality effects and a trend, then fitting a profit or PMD ARIMA or a ARIMA based model manually with fixed parameters, you could get almost a perfect prediction. Totally plausible. Um, but if what you're predicting is that predictable, uh, you probably don't need to be doing forecasting. You priority know the reasons why it's going up or down. It's like, Oh, we, we made money because we opened the cart on this day and we have fixed inventory that sells out every time we open the cart and it sells out every single time predicting something like that is stupid because you know that. Historically every time that we open it we sell all of our product out and then we close down for the day It's it's only useful when you have like Unbounded conditionals on the data that's actually being evaluated and a lot of latent factors that you're trying to Ignore and get to the signal so In a real world you're never gonna get that level of accuracy in legitimate time series forecasting. So you're going to have to sacrifice some gains in accuracy with doing extreme volume. It's just the nature of, you know, if you put enough time and effort into it, maybe you could mitigate that with clever code and, you know, clever decision logic and increase your code complexity to a point where it can. it can handle all of those conditionals that you would have at that extreme level. But most people have better things to do with their time. So usually you just communicate to the business, be like, Hey, we might flag some things in predictions that might seem ridiculous. So please expect some things that don't make sense. But if you see this flag next to this, know that, Hey, we're not certain about this. If you communicate that beforehand, it does a lot in buying you sort of a bit of forgiveness from the people that are looking at the data. They're like, oh, they know this is garbage. OK, I'll ignore this.
Michael_Berk:
Yeah, and another way to do that is you can use plus or minus on your estimate based on some confidence level. Usually 95% is like the A-B testing.05 alpha.
Ben_Wilson:
Mm-hmm.
Michael_Berk:
That's often way, way too conservative for a forecast and you get really wide prediction intervals. But maybe like the 50th percentile, like plus or minus 25%, that would be another way to display that information.
Ben_Wilson:
Yep, and the other aspect of that is whoever's going to be consuming the results of something like this, where you know there's probably going to be some garbage in there. If you expose that data in raw tabular format or in a visualization, make sure that the consumers of it understand what everything is on that. And that doesn't mean sitting down with each individual user and sitting at their computer and boring them to death with explaining what... confidence intervals are and what a probability score is. They don't care, or they might care, but they don't want to listen to you talk about that generally. If they're just an operations person trying to order hot dogs, they don't care. But on the visualization, it's important to make that as clear as possible. There are keys that you can do, color coding, and write some text on there that's very succinct, that explains. Hey, in these areas that we're shading, we could fall anywhere in these, in these bounds. Or there's like, Hey, there's an 85% chance or 90% chance we're going to be somewhere in these bounds. Don't point and point to the actual predicted values that are coming out and saying, and explain what those are. Like this is the highest probability value that's going to be here, but it's not, we're not trying to predict the future here. We're just giving a probability estimation. And if you can convey that to an end user, they might see it and be like, oh, OK, I'm in a range here and not just this one value.
Michael_Berk:
Yeah, that's a really good point.
Ben_Wilson:
We just
Michael_Berk:
Cool,
Ben_Wilson:
like to
Michael_Berk:
so
Ben_Wilson:
believe
Michael_Berk:
we're...
Ben_Wilson:
in magic, man, as humans. Everybody loves magic. And
Michael_Berk:
No,
Ben_Wilson:
when
Michael_Berk:
it's
Ben_Wilson:
you
Michael_Berk:
so,
Ben_Wilson:
see...
Michael_Berk:
it's so true.
Ben_Wilson:
And there's nothing that pisses people off more than a magician that exposes that they're not actually doing magic. So make sure that people don't think of you as a magician.
Michael_Berk:
Yeah, honestly, that's really freaking good advice. It keeps like re-hitting that light bulb moment for me because there's been so many times when people are like, what the, you said it was gonna be this, how is it not this? And like, well, no, that's what the model said, not me. And there's like probability within that estimate. It's really helpful to hedge your bets and sort of explain without going too deep. how these processes operate, and the underlying assumptions. Cool, so I'll quickly recap and then Ben, if you have any additional tips and tricks, we can quickly get to it. But today we talked about working on high level distributed time series forecasts. So working on forecasting 500,000 time series or even if you're at a business doing something like 15 KPI forecasts, that's not really tenable to do manual training and managing and monitoring and all those things. So In this process, the first step is always understand the use case and understand how it will be consumed. Second step is, as you start going about building this, it's important to use a parallelized library on the cloud, not a laptop. We advocate Spark, but there's plenty of other methods for doing this. If you're going for that 15 KPI goal, something like stats models with Optuna and having some manual oversight. That's a really good approach. But one caveat is this does require subject matter knowledge and pretty deep knowledge of stats models. So if you are just trying to slap something together and hope it works, it might not work, just a heads up, but profit is a really good default for business use cases. And then if we're moving into the 500,000 ingredients forecast for our many hot dog stands, something like a parallelized Holts Winters or a Rima, something that does not need any manual intervention and leverages something like Optune or Train, that could be a good option for you. And then just a couple quick practical tips. The first is to hedge your, sort of hedge how you portray the forecast. It's not magic, there's uncertainty in the forecast, and confidence intervals are really like fundamentally a tough concept to understand if you don't speak the statistics language. So having empathy for that is important. And then finally, setting it up for consumption, using color coding and clear definitions, that really can save you a lot of Slack, as in Slack messages and email questions. So really keeping the end consumer in mind is super important. Is there anything else, Ben, that you wanted to shout out?
Ben_Wilson:
The other thing that's burned me in the past, and it was something we discussed before we started recording was black swan events. When you get an event that's in it, that's going to affect your time series. And when it happens, you might not know what the heck is going on later on, you know, like a posterior investigation of, of that event, when you define what actually went down that affected the data, you should be cleaning that data up. with time series. You don't necessarily want to clean that up with supervised learning because that is data that was in the past. That's something to learn from if there's other variables that can explain what was going on there that you collected. But for time series, there's no reference of that event unless you're using exogenous models and then you would have to have some sort of estimate in the future of that happening and... that becomes ridiculous. We talked about that last episode. But when you're using these standard models, I always go through and clear that data up. And the analogy that we used is, hey, we're opening up a new, or we wanted to celebrate some holiday or something, National Hot Dog Day in Central Park. We took three food trucks on National Hot Dog Day, and we gave out free hot dogs for four hours in the afternoon on Sunday. If we were doing ingredient estimations of those, it would look like, whoa, sales were really high there because we basically cleared out the carts because people were lined up to get free hot dogs. But those events should be removed from the ingredient forecasting as well as our sales or whatever we're doing about order numbers. because those might be logged in the data as an order, but profit was zero because it was free. So it's important to go back and clean that out, or else the forecast is going to get pretty confused because it's going to see this massive residual spike, which some libraries can handle just fine, or some algorithms can handle just fine. Other ones will start to get a little confused, particularly if it's a very recent event. It'll just increase variance and make stability go nuts. So always keep that in mind.
Michael_Berk:
Yeah, time series are univariate, and so we can't use exogenous variables, so it's important that the time series is representative of the trend. And when a black swan event occurs, it's by definition not representative, so it needs to be dealt with. Cool. Well, I guess that's it. It's
Ben_Wilson:
Yeah.
Michael_Berk:
been Michael Burke.
Ben_Wilson:
And Ben Wilson.
Michael_Berk:
and we'll see you guys next time.
Ben_Wilson:
Alright, take it easy everybody.