Time Series Models in Machine Learning - ML 087 - Adventures in Machine Learning -

Michael_Berk:

Hello everyone, welcome back to another episode of Adventures in Machine Learning. It is Michael Burke and I am joined by my co-host.

Ben_Wilson:

Ben Wilson.

Michael_Berk:

And today we are going to be having a panelist discussion on time series models. It should be a really fun, high level discussion. And also we'll have some practical tips and tricks for how you can get the most out of time series forecasting. So if we dive a little bit deeper, what we're planning on covering, and who knows it could change at any moment, but we're planning on covering some high level methods for thinking about time series models. Then we're going to go into some specific models and talk about the models that we have used and the models that we've seen success with. And then finally, we'll talk about how to prepare data for those models. Sound good, Brian?

Ben_Wilson:

Sounds great.

Michael_Berk:

Cool. So let's dive right in. One thing that I have learned the hard way throughout my experience working with time series models is that they're actually really simple, and a lot of them borrow techniques from linear regression. So just to provide a face to the name, I did my senior thesis on coral reef health, and I was looking to forecast. basically how coral reefs in the Caribbean would be doing five years, ten years out. And also look at, well more specifically look at species in the Caribbean CNC, what the population rate would be five and ten years out. So with that I used, started off with autoregressive models, moved into seasonal arema models, and also tried to get profit to work and that had mixed results. But Ben, what is your experience in... time series and can you elaborate a bit on whether you've used any Arima models or anything of the like?

Ben_Wilson:

Uh, yeah. Um, I've, uh, my first exposure to time series modeling was, uh, somewhat of a comical story, um, somebody who is high up in management, not naming names or places, but, uh, at a place I worked, it was like, Hey, we need to forecast our sales and we're, we're currently doing it in finance and it's just. it's not working the way that we want to. And I was like, well, what do you mean not working the way you want to? And they're like, well, we have to adjust all of the spreadsheets every week. I was like, can you explain more? Oh, what do you mean adjust the spreadsheets? And they show me this financial spreadsheet and it's in Microsoft Excel and it's full of all of these equations. And most of them are just modifications to differencing terms. like, hey, what was it previous quarter? I'm going to add this fudge factor in here based on actual sales. And they are constantly adjusting certain terms that were the baseline of these. And it was projecting out in a predictable way. And I kind of looked at it and I'm like, well, yeah, this doesn't seem right. It seems like you're just, you're just fudging the numbers in order to get these these sort of forecasts and there's a lot of like human intervention here constantly. Like, yeah, we want it automated. I'm like, okay, I know nothing about this, but like I always do when I'm presented with a new problem that I have no understanding of, I go and do research and read about it, ask some people like, Hey, how do you do this at your company? How do you do forecasting? A lot of my buddies in industry that I had talked to, they all came back with, I don't know, finance uses some Excel macros to do that stuff. Like, okay. somebody's got to have a better answer here. And then I talked to one person who worked at a very reputable, very revenue heavy company that I had worked with before. It's like, oh, you want to look in this Python package stats models. That's what we built for our finance team. It's got like a reema, it's got, you know, serema. You know, you want to be careful about, you know, using the... The tuneability of Saramax though, I'm like, dude, you're speaking ancient Greek to me. I have no idea what you're talking about. He's like, oh, just read the Wikipedia page, read stats, models, docs, you get it figured out. All right, all right, okay. So I did that over a weekend. Monday morning rolls around, find out, I have no idea what the heck I was reading, but it sounded cool. So I started playing around with some of our data sets for sales and ran it through. the examples from within stats models and found that everything was garbage. It was predicting things that went to positive infinity to negative infinity. Some of them, because I was just randomly throwing APIs in there and got some of them that were like, oh, it's like this decay function to zero and like, hopefully that's not real or I got to update my resume. So I went back to the drawing board, read some more stuff, figured out some more, got a book on the topic and read it and I was like, oh geez, now I think I get it. It's exactly as you said, Michael, it's just a linear equation. That forecast, that model that we call a time series model, it's just coefficients in a linear equation that explain kind of the general shape of the data. And once I grok that, I was like, oh, each one of these different models that are in these software packages, it's just different ways of figuring out what those equations need to be. And each one is sort of designed for its own use case. Or they built upon, you know, previous art and made it better and better. And when they, when developers of those algorithms do that, there's never, or it's very rarely that you go from. prior art, make improvements to it, and all of a sudden those improvements replace the prior art. There's always sacrifices that have to be made. It'll work really good for this one set of problems, but it's less generalizable than its predecessor, so it's really bad for these other sorts of time series. That's kind

Michael_Berk:

That's

Ben_Wilson:

of what

Michael_Berk:

a good

Ben_Wilson:

I found.

Michael_Berk:

point. Time series tends to be very sensitive to any type of shift in the data. And you can get predictions up to infinity. My favorite is the flat line. That's always

Ben_Wilson:

Ah!

Michael_Berk:

a good one. And then also you can get into weird negative territory as well, depending upon the method. So yeah.

Ben_Wilson:

the flat line, that's funny. In case anybody's wondering,

Michael_Berk:

Oh, it's bad.

Ben_Wilson:

what that is, is your moving average term is basically one, and it's using the prior, and your regression, your autoregressive term that came out of that is zero. All of the terms are zero. So it didn't fit, it didn't learn anything, so it just says use prior, use prior, use prior, use prior. So it's just this flat line from where the series left off. So it's a really bad model. But... it's better than the, hey, our sales in the next month is gonna be higher than the GDP of Earth. Which,

Michael_Berk:

badass.

Ben_Wilson:

I've seen that.

Michael_Berk:

Yeah, cool. So let's leverage sort of a case study or a scenario and talk through that. And with that, hopefully we can explain some models, how they work, and maybe some of our experience and best practices. So let's say we are looking to forecast how many salespeople we need for next quarter. So we are a software as a service company and we are selling the best And we need to know how many salespeople we need on the floor to get this product out into the right hands. Last year we had 100 people and we have been doing really great so far, but we need to know what we're going to need to hire for for this year. So Ben, how would you start?

Ben_Wilson:

It really depends on what our target is going to be. So when we approach a problem like this, you could do it from a best guess of say, hey, let's just do number of salespeople historically. That introduces a lot of bias and that assumes that we're managing our head count very efficiently. Not a good assumption to make, regardless of how great we think our company is at our hypothetical company is at having salespeople at the right targets. Conversely, what you could do is look at it from the other side and say, how many deals are we closing for our product that we're offering? And we can model out number of deal closures over time and say, this is what our expected deal closure is going to be. So we can then figure out how well aligned the number of salespeople we had in previous years were to the deal closures that we had at that time. And that'll work. Uh, in fact, I would, I would definitely do that and start very simply. Uh, I wouldn't even do it necessarily a Rima at first. I might just do a moving average and just do like a very simple window differencing function and say, what's my trend. And is it positive? Is it flat? What is that rate of change? And then if we needed to get it more and more accurate, introduce more and more complex models. But as a first pass, I'd want to understand that data and see what are our residuals here in this term that we're doing? Is there a seasonality associated with this and use some tools to figure that out? But with the raw data, that we have at our hands, I would probably sit down in a room with you and we'd say, amen, what can we do with the data that we have? Let's think about this problem and say, what if we combine the terms of deal closures and number of people and what if we also figure out some sort of metric that's associated with headcount availability? Can we extrapolate from the data that we collect how many deals didn't go through because we took too long to respond or we didn't have enough time to dedicate to the sales cycle for this particular customer and if we can capture those that can tell us that hey we were understaffed at this time or we were overstaffed at this time and if we can mathematically factor that into the number like the deal closure rate then we can get a metric that we can actually forecast. Um, so yeah, first step is like thinking through all that stuff and then making sure that we're collecting that data and then, you know, play jazz with some stats packages and say, let's try some stuff out. We don't have to get super accurate. We don't have to get super perfect from the get-go. We just need to understand through an analysis of the data, what this data actually is all about. So decompose the trend, which would take out the trend term. We would get the seasonality terms out of that over time and also get the residuals. So if we have those three elements, that's what makes up a time series, a seasonal time series. And then we can do some tests on that and say, hey, is this data stationary? And. By stationary, we mean, can we remove the trend from the data? Is there a trend there that if we remove that, that y equals mx plus b term from the data, does it all of a sudden become flat? And you just have these spikes or sort of like a sine wave that's present in there? And if we do that manipulation and we don't get that, the data is not stationary and there's something else going on there. then we've got a problem and we have to go in and do some clever things with transformations of that data to enforce stationarity. Because if

Michael_Berk:

Yeah.

Ben_Wilson:

you fit a model, you fit that ARIMA on garbage data or data that's not stationary. That's when you get the crazy stuff like, hey, it's going to positive infinity. We need to hire, for our sales team next year, we need to hire all humans on planet Earth. Every single one of them needs to work for us in order to hit our targets. And then, You know, it sounds ridiculous because it is, but a model trained on improper data or improperly clean data will do crazy stuff like that.

Michael_Berk:

So let me try to recap a bit and provide some some colorful examples so sounds like step number one and this is honestly step number one in most things in the world is understand how The system works so you need to know how your target will be used what your target actually Represents how your target interacts with the rest of the components in the system So if you're forecasting salespeople for instance you need to know what are the bottom lines? How do we know how many salespeople we need? In our historical data, were they working above capacity or below capacity? All these things should be taken into account. And you can skip that step, but it's really, really hard to produce a useful model.

Ben_Wilson:

Mm-hmm.

Michael_Berk:

It could be accurate, but it might not be super useful. So from there, we have, let's say, a time series of five years. So the company's been around for five years. And we've... sort of grown pretty linearly or exponentially or whatever you want to call it over the past five years. There's some seasonal components, so there's sort of a sine wave, an up and down trend throughout this curve. And we're looking to then start building a model that will capture this structure and look forward to forecast how many salespeople we actually need. So Ben, you said that you would start with maybe a moving average or like an AR model. What are those?

Ben_Wilson:

That's basically just a lag function on a series. So in the world of BI or data engineering, when you do a window function, you can put in sort of a differencing term. You say, hey, I wanna know what the data is on this row or this particular event minus seven days ago or minus yesterday or minus three days ago, whatever that differencing term is. in an AR model you're actually setting, or sorry, an MA model, a moving average model, you're actually setting that term, that differencing term. We call it the differencing term in most, in Python stats models that goes by lowercase d for ARMA or ARIMA. And... You can also use a period term in there, which is lowercase p, which will give you some additional flavor on associating how long in the past you should be looking at a differencing function. But when you're doing a moving average, you're just looking at what the rate of change is over time. And you can extrapolate that in the future by taking that very simple equation, that lag function, if it happens to be. I want to know what the rate of change is of today's data versus seven days ago. I'm going to get an answer there so I can add that term to my next seven day window added to that, you know, my current term. So if it's, if we're trying to look at, you know, today's Friday for us, I don't know when it's going to be released, but if we're looking at today and we're, uh, Let's say we run a hot dog stand and we're selling hot dogs. And we want to know, hey, how many hot dogs are we going to sell next Friday? A simple moving average would be how many hot dogs did we sell last week? And at the end of today, how many hot dogs did we sell? Well, let's say we sold 40 more than last week. That becomes our moving average term 40. So whatever today's sales were. when we predict what next Friday sales are, it's going to be today's plus 40. And we walk through the data that way. So on Saturday, we do last Saturday versus this Saturday. What's the difference? You know, add that to Saturdays and that gets us next Saturday. So those simple models, it's shocking how well they work for stable data where your only real influence is the trend over time. and there's lots of things that can be explained in that way.

Michael_Berk:

What are some stable data sets that you've seen?

Ben_Wilson:

Ooh, stuff that has enough interaction from a chaotic environment that you have enough data points where the natural chaos of the system doesn't influence anything. So if we're looking at. Say if we were doing foot traffic through Grand Central Station over time, and we do like monthly foot traffic, or number of people coming into LaGuardia Airport, provided that there's no Black Swan events, there's no COVID-19 and shutdown of air travel and stuff. But steady state operations, there's so many latent factors that would influence a single person's decision to come to that airport on that day. That's impossible in the model. Like, you can't do it. There's too many variables. But when we're talking about in aggregation, the number of people getting on airplanes or going to meet people to pick them up from that airplane, from their flights arriving, there's so many people over such a wide area, like a wide collection of latent variables. On aggregation, it tends to pretty much work. So simple models work in systems like that. Or if you're trying to determine if there's any gardeners listening, if you have a garden that bees and butterflies like, imagine trying to predict how many bees are gonna visit a single flower in your garden. If you have like a half acre garden, like a rose garden. If you're looking at a single flower and you're like, oh, I'm gonna try to predict how many bees are gonna visit this one flower today. good luck getting within an order of magnitude accuracy on something like that. Like how do you know what is going to determine that? But if instead you change your perspective to say, I don't care how many come to that one flower. I want to know how many come into my garden every day. And with that amount of events happening and the randomness that's associated with however bees determine which flowers to go to at which time and and on which day all of that sort of neutralizes itself and you now have this much more stable, less noisy trend.

Michael_Berk:

Yeah, that's a phenomenal analogy actually. The relevant piece here is sample size. So if you sort of think about a variance calculation, the larger the variance, the larger that confidence interval or that sort of bluish tint around your line. The wider that'll be, and the less certain you'll be in the precision of your estimate. But if you can get a phenomenon that has a large sample size, it tends to be more closely. centered around whatever the actual underlying trend is. So yeah, birds and bees, good call.

Ben_Wilson:

and hot dogs.

Michael_Berk:

And hot dogs, can't forget about hot dogs. And then before we move on, I just want to dispel something that took me months, if not years, to learn. Moving average is not a moving average in the moving average sense of the word. So it's not a rolling window where you take the average. It is an auto regressive component. of the differences. That was so freaking confusing to me when I was reading a textbook and I was like, it's not a moving average. I see the formula right here. That's not how you calculate a moving average. I'm not sure why they named it that. I guess it is kind of a, well,

Ben_Wilson:

sort of.

Michael_Berk:

I don't even know that it's like roughly a moving average, but it is not the typical moving average you think of. So for R or for ARMA models or ARMA models or even ARIMA models. the moving average component is not a moving average. If you take away one thing, please take that away.

Ben_Wilson:

Yeah, that's why Arima has that term integrated, which is like we're integrating the concept of these effectively series of regression equations and coupling them together in order to create a single equation that explains what that is. And then you can go even further than that, where you're like, okay, we're going to exponential smoothing. And we can do first order exponential smoothing, we can do second order, and then we can do third order, which has its own special name called Holt-Winter's exponential smoothing, which is one of the most powerful of the classical time series modeling. It's one of the last ones that was developed in the late 1970s that is kind of the pinnacle of classical modeling terms about how to do that. And those make really good models. But... There's a trade-off that we'll discuss a little bit later, maybe next time that we're doing an episode on this topic about what's the trade-off between quality of model versus complexity of training. There is

Michael_Berk:

Yeah,

Ben_Wilson:

a trade-off.

Michael_Berk:

that's the topic I'm excited about. Well, probably not today. We'll see. Yeah, so one way to think about this class of models is you can think of it as a linear model, and we incrementally add terms. So the most simple is ar, which is autoregressive, and we just lag prior values. And oftentimes, series are univariate, meaning we only have one variable. The next step in complexity is adding a moving average term, and it's been said that's sort of looking at the differences and adding a coefficient to that. So AR has one component, MA has another component. The third component is ARIMA. At least that's typically how these models are incrementally increasing complexity. And the I stands for how much you're differencing the terms. So an order of one means you take term one and term two and difference them once. In order of two means you difference them twice. In order of three means you difference them three times and so forth. And now with that all assembled we have the classic Arima models. And those are super famous and I'm sure if you've worked in time series at all you've probably seen those models. Then as Ben started hinting at there are a bunch of other additions on top of that and it's really important to note that we're still using a linear relationship where we're gathering linear relationships but we're transforming the data so that the coefficients can actually fit seasonal up-and-down components. So one that I have tried to use with me like mediocre success is SAREMA or seasonal ARIMA and in stats models I think it's just a lowercase PDQ or maybe an uppercase but it just changes the case.

Ben_Wilson:

It's

Michael_Berk:

Do you

Ben_Wilson:

both.

Michael_Berk:

know?

Ben_Wilson:

Michael_Berk:

It's both.

Ben_Wilson:

it's a two-tuple component

Michael_Berk:

Mm-mm.

Ben_Wilson:

for the ordering terms. So open paren, lowercase p, comma d, comma q, close parentheses, comma, next tuple is capital P, capital D, capital Q, and then there's an S term in there for the seasonality. They're complex to tune if you're doing it manually.

Michael_Berk:

Yeah. Yeah. And then from there, you can add on external variables. And Ben and I were actually chatting about this before we started recording. It's rare that that is super effective because if you're using an external variable to forecast, well, how do you know what the external variable will be when you're forecasting? So it's great when you're training. But let's say you're using the classic examples, weather. And I've actually done this for commodity pricing. And I realized, oh, I don't know what the weather is a month from now. So how can I use the coefficient on weather to figure out what the price is? Um, so Ben made a really insightful point, which was if you have scheduled events like a concert or this or that, um, then you can use exogenous variables in your time series forecast, but if not exogenous variables aren't super helpful for forecasting inference, they're useful, but for forecasting. not so useful.

Ben_Wilson:

Mm-hmm.

Michael_Berk:

And that's basically the extent of what I've used in practice. Ben had used more complex ARIMA-based structures.

Ben_Wilson:

I mean, I've definitely built my fair share of Rmax and Saramax and Aramax. Sorry for all the acronym soup, everybody. It's just how all this stuff is. Nobody reads out the entire acronym. But I have used the exogenous variables in terms, just like you said, like it's really important for inferencing. And what we mean by inferencing, let's say, let's go back to hot dogs. Everybody. I don't know if everybody loves hot dogs, but let's just say everybody loves hot dogs.

Michael_Berk:

There's a few

Ben_Wilson:

We got

Michael_Berk:

vegetarians

Ben_Wilson:

our...

Michael_Berk:

out there, so probably not.

Ben_Wilson:

Oh, our stand has vegan hot dogs too. Let's

Michael_Berk:

Excuse me. Sorry.

Ben_Wilson:

see, we got our hot dog stand in Central Park and we're trying to forecast how many hot dogs and buns we're going to need over the next three months because we had to place some big order that... is like, hey, here's how many we're gonna get shipped in every week from our supplier, and they need to know three months in advance. And we don't wanna throw these things away, we don't wanna heavily discount them, we just want relatively what we need for that weekend. And that's really what we care about is that weekend's rush. So, we could try weather. Good luck. You might be accurate for the next seven days. because we would have a forecast probability of rain from a bunch of different weather services. And we could put that in so we could have one model that has that exogenous regressor of rain percentage per day on hot dog sales. And we have that for the last three years. So what we could do is we could put that in in our forecast for the next seven days and saying, hey, next weekend. the forecast is saying there's a 22% chance of rain. And that'll adjust that forecast if the model is tuned well to say, hey, maybe don't order so many hot dogs. There's not going to be that many people walking up. But a better idea, which you alluded to, which is like, hey, what if we put on the Yankees home schedule for a like for New York and maybe the Mets as well, who knows, if anybody listens to the Mets or likes the Mets. But if we had their schedule of when the team's home and when there's gonna be games, we have all that data going back probably 80 years or whatever since the Yankees were founded and we can get that schedule and a lot of those things, they don't change. So we know in the future, we know for the entire next year. when each of those games are gonna be going down. If there's a correlation in that series between those events that aren't explained by seasonality and newsflash, that's not seasonality. Most scheduled things that people are doing, it's not something that's repeatable. It's not like, hey, the Yankees are home playing in their stadium every Wednesday or every other Wednesday. It's going to be different times. They're going to have road games. They're going to be back home. There's going to be like an all-star game. There's all this stuff that happens. Um, but there could be other events that are happening in our immediate vicinity. That could be festivals going on in Central Park that we might want to say, Hey, let's capture when a music festival is going to going to be going on. And if we have that schedule three months in advance, if there's a correlation, that becomes useful. Uh, but for the inference. You could use that for simulation. If we go back to the weather thing, what if we did a worst case scenario and a best case scenario forecast? And that could give us error terms on our own forecast on the one that doesn't have exogenous regressor terms. We could take those estimates of what happens if every Saturday and Sunday for the next three months is rained out and there's a hurricane like on one of the weekends. versus what happens if every weekend is sunny and 74 degrees for the next three months. And that could give us best case scenario and worst case scenario with those terms and those historical data, so that when we run our main model, which could be something completely removed from Aramax or Saramax, we could look at that, at the results coming out of there, and if any point that's forecast is outside of those inference, best condition versus worst condition, we can flag that as saying, maybe we need to look into this, or maybe our model's not that good. Go back to the drawing board. So that's kind of what I've used stuff like that for in the past. Building a forecast that provides boundaries, guardrails for the main production model. And that main production model could be a single model. or in the case of things that I've built that are really important to get right in the past, I've done something that's going to sound super hacky, but it's shocking how well it works, which is take five or six different models all in the same data that all solve in a different way and do a weighted average of them, of their actual forecasts. So from ultra simple to an actual in SQL effectively and average those values all the way to the output of an LSTM or a PyTorch model or something or the output of Profit or PMD Arima and I just average all those together and based on back testing of all of those and the averages I'm actually tuning what those weighted averages should be based on their performance in back testing.

Michael_Berk:

That is hacky. I have also seen it perform really well. But just like sort of thinking out loud, why... Like that's not a common practice in other machine learning applications.

Ben_Wilson:

No,

Michael_Berk:

Ben_Wilson:

don't

Michael_Berk:

Ben_Wilson:

Michael_Berk:

done.

Ben_Wilson:

that in other

Michael_Berk:

Yeah.

Ben_Wilson:

machine learning applications.

Michael_Berk:

Why do you think that's the case with time series? And my initial reaction would be that... Different models pick up different components in a time series. And time series, like univariate data is very complex.

Ben_Wilson:

Yes.

Michael_Berk:

So if you have a big squiggle and then a bunch of little squiggles, and then you have some more squiggles, you put them all into one time series, um, different models will pick up different squiggles better than others. So is that why like we're, we're doing like signal processing and decomposing super complex signals. So models will pick up parts of the signal better than others. Is that why that works so well or do you have other insights?

Ben_Wilson:

That's always been my explanation. It could be completely wrong, but just based on anecdotal, practitioner evidence of trying it on lots of real world data, I've seen it work out better than doing one mega complex, state of the art implementation. I've seen and helped people sort of revisit these problems companies that I've interacted with. They're like, well, we have this cool new LSTM structure. It's got an embedded graph on the LSTM. And we have 1,000 different feature terms that are in this. And it helps to explain all this stuff. Like, OK, well, how does it work on your back testing, on cross-validation? And they're like, what do you mean? We trained on the data up until today. I'm like, all right, let's retrain. Let's zero all the weights out. Let's retrain everything up to three months ago. And let's see how well it predict, you know, how well it performs in the past three months of data. And they're like, Oh, well, it's probably not going to do that. I'm like, just do it. Show me what it results. And it's usually not that great. It's okay. And then I show them like, Hey, what if we used a couple of these, these auto tuning frameworks? Let's use PMD or Rima. And that's a wrapper around stats models, Ceramax and Arima and stuff like that. Let's use Profit and let's do Holt Winters. Let's do a couple of these classic ones, some very simple models, and then some of these automated ones. Let's just average the values together. And when you overlay all of them onto the same plot, you'll see exactly as you said, the squiggles. They start canceling each other out. And what you're starting to do is generalize better to the overall trend. And that's the end goal of a well-trained supervised learning model is generalization. That's what we want to do. We don't want a hundred percent accuracy. You don't want that overfit because when it sees something it hasn't seen before, it's going to go nuts and it's going to predict insanity. So what we can do with averaging all of those out, some of them are not going to respond well to. recent changes in data, it's gonna do crazy stuff. It might start kicking up exponentially towards positive infinity at some future date, or it might start just flatlining six weeks out and you don't want that. But other ones won't do that, so it's sort of the average of all of those is generally does pretty well. But an important thing to bring up with that is be careful how far out in the future you're trying to predict. Forecasting. becomes less effective. I always think of it as like the inverse square law from physics. And I remember it by point source radiation stuff. So you can stand 16 feet away from an operating nuclear reactor. You're going to have a bad day, maybe a bad couple of months. If you stand 16 inches away from that reactor, you're going to have a very short day. And it's going to be a very short life after that point. So the closer you are to something, the more intense and the more accurate it's gonna sort of be. Further away you get, the less of a signal you're gonna have. So I think of that when I think of time series forecasting, that the more intense close proximity to your most recent data of when the series is actually generated, the better you're gonna have of a signal of like what is going on. Further away you get, you're not gonna even be aware of what reality is.

Michael_Berk:

Inverse square means that it's sort of an exponential-ish relationship, or like a squared relationship for the distance from the reactor versus how likely you are to die.

Ben_Wilson:

Yeah, so

Michael_Berk:

Cool.

Ben_Wilson:

for every unit

Michael_Berk:

That's

Ben_Wilson:

Michael_Berk:

awesome.

Ben_Wilson:

distance you go out, it's a squared, it's like square root less

Michael_Berk:

interesting.

Ben_Wilson:

of radiation is hitting your body. Sorry, nuclear

Michael_Berk:

Good to

Ben_Wilson:

engineering

Michael_Berk:

know.

Ben_Wilson:

analogy.

Michael_Berk:

Yeah, that's awful. I'll steer clear of my neighborhood nuclear reactor.

Ben_Wilson:

As long as you're not in the reactor compartment, you're fine. But try to

Michael_Berk:

Noted.

Ben_Wilson:

stay away from them when they're actually on.

Michael_Berk:

Yeah, I've heard they can be dangerous.

Ben_Wilson:

a bit.

Michael_Berk:

Cool, so that sort of wraps up the class of linear-based models. We've been hinting at profit and LSTMs. Ben, which would you like to tackle next?

Ben_Wilson:

I mean, Profit's pretty cool simply because, well, not simply because. There's a lot of reasons Profit's cool. It has an autosolver associated with it that is pretty sophisticated and leverages some very fast solvers that are in other open source packages. I think there's two different back ends now with 1.01. I can't remember being in a. little bit out of touch after being on Paterna to leave, but I think they did a release while I was out and I think they have a new back end that you can select. But that solver, it does a lot of the, I mean it automates all of the tuning that you would normally have to do with you know, Arima based models. You don't have to define you know, your PD and your Q or your S and all those capital PD and Q terms, you don't have to think about any of that stuff. And if you've ever tried to hand tune a Serima model without using an optimization framework, you'll know what I'm talking about, about how painful that is. Where you're like, okay, I have these terms that they seem like they're independent of one another, but they're actually not. So I can't just set, you know. my PDQ to be 102, if I go 112, take the differencing term up one, now it doesn't have the same effect. Like you could take, go from 102 to 204 is not the same as going from like 111 to 222. The relationships exists amongst those terms about how they actually affect what the forecast is going to be, what the model is going to be. And trying to wrap your head around that and tune that and test out all of these different permutations for a particular series, it's really complicated. So Profit does that for you. It figures out what these effective terms need to be in order to get the actual coefficients for building error equations.

Michael_Berk:

Can you elaborate a bit on how it works under the hood? Because with linear models, we talked about how there's an AR term, an MA term, an integration term, and you can keep going. But what are the components of profit?

Ben_Wilson:

for its arguments?

Michael_Berk:

No, for like the... so back to the analogy of having many squiggles, what are the different ways that it captures squiggles?

Ben_Wilson:

So, under the hood, if I remember correctly, and I might not be right now, but it's doing a Fourier transformation in order to effectively, you know, smooth out that term in the series. And it's much easier with transformed data to determine what those. terms are going to be, like what a ream is actually using for figuring out what a lag term is and what autoregressive terms would be. So it's doing some clever things there, but in its process of acting on that fast furrier transformation, it's got some clever tricks up its sleeve about solving for that in the most generalizable manner. It doesn't always work, but... for a lot of real world use cases that I've played around with, it works pretty darn well.

Michael_Berk:

Yeah, one thing that I have seen drive success with Profit is the fact that it tries to break down seasonality with Fourier decomposition. So essentially it tries to fit increasingly complex polynomials. That's at least my understanding of Fourier transformations, not super well-versed. But you essentially add terms on in an additive manner. with different levels of complexity for the polynomial and then after a while you can create an approximation of pretty much any curve. And because seasonality is repeating, that's a really effective way to fit that seasonality. And so I thought that was a very smart approach and I remember actually listening to, what was his name, Sean something, Sean Taylor? The original creator of Profit, it was made at Facebook,

Ben_Wilson:

Yeah.

Michael_Berk:

yeah, Facebook Research. It was interesting to see him talk through that process of where he thought this was a good idea but the project was absolutely failing. Then he brought in someone with a lot more stats expertise and they were able to essentially make the components work together. So that Fourier transformation along with other terms like trend and seasonality, or not seasonality but trend. It really made the model very effective. But one thing that I think is pretty overhyped, and I would be curious for your opinion, Ben, is the ability to manually add change points. So change points are essentially locations in the time series where we expect the coefficients of our model to change because of a black swan event. or there's just some systematic change in the structure. And theoretically subject matter experts can go in and they can say, hey, we expect it to change tomorrow or the next day. I've never actually seen that benefit models in practice. Have you been?

Ben_Wilson:

So one thing that's very important for practitioners of these open source packages to remember is where did it come from? So who built it? I promise you that ex-Facebook, now Meta, was interested in forecasting potential ad revenue. How many people do we think are going to be logging into the site and viewing these pages? So their solution was built to solve a problem at that company. So you're not going to really find a better implementation networking like data that's going to outperform profit. So if you're talking about visitors to your site, uh, clicks on things in emails, or you're looking at, you know, the actions of humans to something that is really accessible to them, that framework is going to outperform pretty much anything else that's out there. At least from my experience, it's awesome at that. Now, if you're trying to use. a tool like Profit to forecast weather or the stock market or... looking at inventory management for. for products that you don't sell frequently. If you're like, hey, how many sales are we gonna have for this tool that my company makes that is $850,000? You're probably not selling 10 of those a day or 100 or 1,000 of those a day. You're probably selling like one every two weeks. So if you're doing daily modeling of that, you have a bunch of zero data in there. It's not designed to handle that because when you do the... the Fourier transformation of that term, you don't get any information out of it. The slope is zero. Or the slope is extremely negative or extremely positive and the model will attempt to fit that spline, it'll fit a spline between those points, but it'll be so noisy and non-useful that when you do the forecast, the data just looks, it looks like a crazy person built it. Uh, cause it's not designed for that. It's designed for relatively stables, like heavily seasonal, uh, data. And social media platforms are very seasonal. Whether you cut it by locality, uh, of like a data center during periods of the day. What are your peaks early in the morning in a city that has good, good public transportation, everybody that's commuting, they're on your platform. Around lunchtime, peak. Right after work, peak. Maybe in the last hour of the work day while people are trying to make it look like they're busy at their desk, peak. Right after dinner, peak. Right before bed, peak. And then you have this huge trough while people are sleeping. So there's a seasonality component by hour of day. There's also a seasonality component by day of week. More people are interacting with your platform, but you always have this Baseline that exists but you're gonna have peaks of traffic of like hey people are communicating with one another a lot on Thursday evening Planning to do stuff together on the weekend or people are doing a lot of communication on the weekend So you have usage patterns that are very predictable But if you're trying to apply that tool to something that doesn't share that so that underlying nature of the data, it's going to be garbage. But don't blame the tool. Blame your data. Try something else out. And that's my advice on that.

Michael_Berk:

Yeah, and Profit recently put out, recently as in the past couple years, put out a deep learning addition to the generalized additive model framework. So it essentially adds a couple of deep learning terms. I actually don't think we should get too deep into it, but if you are interested in doing some fancy deep learning time series, but also have some interpretable components, the Neural Profit library has some cool tools. And I've used it a couple of times and it's kind of fun.

Ben_Wilson:

And don't forget about our friends PMD Arima. It's sort of an, it's like AutoML for time series. So it auto tunes your stats models libraries. It's pretty awesome. The maintainers of that built something really cool.

Michael_Berk:

Yeah, 100%. And then let's move on to the final class of time series models that at least I'm familiar with, which is deep learning. Ben, where do you see deep learning used in time series, and is it a function of time? So I'm going to go back to the time series model. And I'm going to go back to the time series model.

Ben_Wilson:

To be frank and honest with you, out of many, many, many dozens of companies that I've interacted with on this topic over the last four years, I've seen exactly one project in pseudo production that uses a deep learning model for forecasting, just one. I've seen hundreds.

Michael_Berk:

Time series only, right? Yeah.

Ben_Wilson:

Yes, only one that uses LSTM. And it was a very, it made sense that they were using that because of the complexity of what they were trying to model. And they could explain a great deal of the latent factors that actually influenced that time series trend. And it wasn't a problem that was well conditioned for traditional time series modeling. So it made perfect sense. I was like, yeah, this is what LSTMs are built for. This is awesome. Great work. I'll help you get into production, write some unit tests for you. Almost all the other ones I've seen have been demos or research-led efforts. People wanted to use TensorFlow and PyTorch to do this. This isn't to say it's not a good idea. It's just there wasn't enough of an accuracy gain in using that versus using something much simpler. So once they tested out something simpler, a classical time series modeling using stats models or profit or something, the LSTM implementation might have been 5% more accurate. But. It costs 5,000 times as much money to train it and run it every day than it does to do the classical modeling. So when budget comes into fact and code complexity and retraining complexity and maintenance and the fact that you got to keep this thing running into production, the cost benefit analysis just didn't support deep learning.

Michael_Berk:

Yeah, that's super interesting. And just so we have sort of a one-liner on how deep learning approaches time series, at least usually, is they take a recurrent neural network structure where you essentially horizontally stack neural nets. The most effective and sort of modern version of this, at least one of the effective ones, is LSTM, which stands for long short-term memory. And it's basically just a recurrent neural network with gates. So those gates will determine how long you should be maintaining information about prior observations. So that's sort of the one-liner, but that's super interesting that you haven't actually seen it work in practice. Do you have any ideas on why that's the case, other than it's dumb expensive?

Ben_Wilson:

It's just more complicated than the classical approaches. And keep in mind, this is bias. So I haven't talked to every company on the planet that does forecasting, not by any stretch of the imagination. And it could also just be that the companies that were struggling with doing this were just struggling because it's complicated. And then I just wasn't talking to the people that figured it out and were running it in production. So there's a lot of bias that goes into that statement about why, you know, it's, it's not always the best thing to do, but the, the team that I, that I worked with that actually was doing it, they had tried classical modeling. So they went through it the right way in my opinion, which is let's try the simplest stuff, does it work? And they, they showed me all sorts of results and reports like, yeah, we tried this. Here's the results. And I'm like, yeah, that's garbage. Okay, how about this one? Like, yeah, we tried that and here's the result. Okay, that's also garbage. And they're like, yeah, and then we came to RNNs and then we were applying this, this like, trying to determine what the trend was gonna be and then we would apply that to the RNN and it didn't really work that well. So then we tried this LSTM structure and then we modified that and we created like this additional memory that exists within that architecture. So we built our own custom model here. I look at the code, that's pretty clever. Let's see how it works. And we did the tests and it's like, yeah, that's, that's the best solution. And the, that worked because what they were predicting, it was so important for them to get pretty much like dead on for a good solid prediction that it didn't matter how, how many people they had to throw at it to implement it and maintain the code and get it staying in production. It was worth it.

Michael_Berk:

Got it. Yeah. So LSTM and just deep learning in general tends to have a lot more fitting power, because they're super, super complex. I mean, they're neural networks with tons of weights. So it would make sense that it can fit a lot of the complexity of time series. But one potential explanation is that time series are univariate. So it's just one signal, essentially. And it's a pretty strong assumption that that signal has enough. information in it or signal in it that it can forecast really accurately and you need to fit the crap out of it. If you had exogenous factors or external variables, maybe really fitting that intensely would be beneficial. But there's so much noise in a time series that

Ben_Wilson:

Mm-hmm.

Michael_Berk:

deep learning I think tends to overfit and we want general trends instead of the minute changes. That definitely could be an explanation why deep learning, unless you have a really, really structured signal, why it doesn't perform as well as some of the simpler, smoother models.

Ben_Wilson:

Yeah, and their implementation that they were dealing with, they had control over the environment. They weren't modeling human activity with their business. They were monitoring equipment. So

Michael_Berk:

Oh,

Ben_Wilson:

they

Michael_Berk:

well there

Ben_Wilson:

knew

Michael_Berk:

you go, yeah.

Ben_Wilson:

where all these signals were coming from. They had all of the temperature data. They had all of the vibration data. They had all of this data coming in. So it made sense to. that because there's so much autocorrelation that's occurring with these different factors, these exogenous regressor terms, that it made sense to incorporate that into a very complex deep learning model.

Michael_Berk:

That's super cool. Cool, well we're just about at time. I'll do a quick recap and then we will let you continue your amazing lives. So first, sort of starting from the beginning, it's really important to pick a good target. You have to know how the system works and how the target that you're looking to forecast interacts with the components in that system. Second, after you sort of have a good understanding of the problem, it's a good idea to start simple. That's always best practice unless you really, really know your stuff and know that simple will not work. I tend to start with AR models, just an autoregressive model, but you can build out an ARMA or ARMA model, go into ARIMA, SARIMA, all sorts of crazy things. Exogenous variables tend to be good for simulations with those linear based models, but you don't know what that exogenous variable will look like in the future. So it's really hard to use that for forecasting. But for inference and explaining why things happen, super good. Next, we sort of talked a little bit about profit. It's a bit more complex. It moves away from a linear relationship between prior terms, and it fits a Fourier transforma, or Fourier curve essentially, to encompass seasonality, and it also has a non-linear growth term to handle trend. And this tends to do really well on business data, just because that's what business data is structured like. It's usually a growth term with some sort of complex seasonality. And then finally for deep learning, they're often recurrent neural network based. The more advanced version is LSTM. And then on top of that, there's tons of more advanced time series deep learning models. We didn't really get into that in too much detail, but often because time series is univariate and there's not a lot of signal. in a given curve. The simpler models tend to fit external data a lot better, but it does depend on your use case. And yeah, did I miss anything?

Ben_Wilson:

No, you got it all. And

Michael_Berk:

Beautiful.

Ben_Wilson:

stick around for version two of this talk, where we cover the other half of the list that we had compiled before for recording this. And we'll get into some really fun stuff, like how do you scale it if you got to sell hot dogs in every park in America?

Michael_Berk:

Yeah, so cool. Well, this has been fun. Until next time, it's been Michael Burke.

Ben_Wilson:

and Ben Wilson.

Michael_Berk:

Have a good day everyone.

Ben_Wilson:

Take it easy.

Search

Trending Now

Popular Searches

Time Series Models in Machine Learning - ML 087

Hosted by

Time Series Models in Machine Learning - ML 087

Share This Episode

Show Notes

In this episode…

Sponsors

Transcript