Dave_Kimura:
Hey everyone, welcome to another episode of RubyRogues. I'm Dave Kimura and today we are doing a very special episode and this is a crossover with the Adventures in Machine Learning. So on their panel we have Michael Burke.
Michael_Berk:
Hey everyone, how's it going? I'm a resident solutions architect at Databricks. So that means I do a bunch of random crap, whether it be data engineering, machine learning, platform advisory stuff all over the place. Excited to be on.
Dave_Kimura:
and we also have been Wilson.
Ben Wilson:
Hey everybody, I'm Ben. I work on the machine learning open source software team at Databricks and maintainer, one of the co-maintainers of MLflow and a number of other open source projects.
Dave_Kimura:
Alright, so I guess y'all are both pretty heavy in Python or what is your main kind of language that you use for the machine learning?
Ben Wilson:
I'll take this one first. Um, so are the company that we both work for data bricks, the creators of Apache spark and Delta and ML flow, a bunch of other tools, uh, back when I was hired at data bricks, which was just over five years ago, uh, we had a pretty good split with our early customers between, uh, predominantly two different languages. So. Python. PySpark and using ML use cases. We also had users using R. That's not quite my cup of tea, but I've had to maintain some of that code in the past. And then we also had Scala, which is the core language of Spark. So we all had to know two of those three very well, one of them extremely well. But nowadays,
Dave_Kimura:
Hmph.
Ben Wilson:
between internal code that I sometimes contribute to in our infrastructure that's mostly in Scala and then everything in open source that I work on is Python.
Dave_Kimura:
It's pretty cool. So when we talk about machine learning, there's really two different aspects of it. You have the The AI portion where it can generate stuff and then you also have deep learning and you know the my exposure to AI and machine learning as a whole is very limited it starts back just a few months ago when I got a nvidia gpu and was able to actually start doing some of this stuff and it's really intriguing but I think that uh one there is a lot of New technology being created around the ai and You know, I just want to dive into some of that stuff because it's really interesting and it's going to change our future In one way or another for better or for worse we are going to see changes, whether that is improving medical diagnosis or having deep fakes interfere with election processes. So I think that our trajectory is definitely going in one or the other direction or maybe a bit of both. But if we just start with the kind of more consumer side of things on using machine learning, At a high level overview, when would you want to use machine learning versus a just straight algorithm to calculate something? And then the other side of that is how do you actually get into the deep learning aspect of training models, creating models and that kind of stuff. So those are kind of like the ideas that I want to touch on in this episode if y'all wouldn't mind educating me a bit on it.
Michael_Berk:
Yeah, I'll take the first or at least a portion of this question because Ben and I have been doing a few episodes on naming conventions, essentially in the industry and our perspective is that artificial intelligence is this blanket term that sort of means nothing. It's a, it's, it's for marketing purposes and it's really cool. Um, the, I think it was originated from general artificial intelligence. So that's generally AI where it can actually do things like a human. So talk, think. have emotions theoretically, be creative. And we're pretty far away from that at the moment. What we do have are pretty smart language models. And that's probably the reason for a lot of the recent hype. Chat GPT from OpenAI has been released and the public version is making strides in the industry and people are really excited. And essentially what it does is it distills information into a text format. And chat GPT-4 actually supports audio and a few other things. But to your question about using machine learning versus deep learning versus an algorithm, all of them sort of are interrelated. Algorithms are the underlying backbone of everything, whether it be machine learning, causal models that are statistics based like linear regression, or any other predictive, predictive method. So it's important that we define what we're working with here, but I'll hand it over to Ben to continue on the answer.
Ben Wilson:
Yeah, it's a fun topic. Particularly it reminds me of a question that somebody asked me a couple of days ago, just randomly. They're like, hey, Ben, what do you think is the most important software algorithm that's been invented in the last 20 years? And would either of you two like to guess what my response was? Just for fun. The most important algorithm. It's not even
Dave_Kimura:
Um.
Ben Wilson:
that complex.
Michael_Berk:
bubble
Ben Wilson:
It's
Michael_Berk:
sort.
Ben Wilson:
not even that big. Huh?
Michael_Berk:
Bubble sort.
Ben Wilson:
No, I think that goes back over 100 years. The concept of that, that's like Ada Lovelace stuff.
Dave_Kimura:
I was going to say something to do with like mp3s and compression.
Ben Wilson:
Interesting. Do you want to take a serious guess, Michael? It has to do with ML.
Michael_Berk:
Transformers.
Ben Wilson:
close, but even more general. My answer almost instantaneously to this person was, oh, back propagation. That's like the most important algorithm that's been invented. And they're like, wait a minute, isn't that what powers deep learning? I was like, yeah. So neural nets were a thing. They've been the thing for a really long time. Like, hey, I have a bunch of input layers. I connect it to a bunch of hidden. layers which are all this connection that we have that means I have some weight on an equation. Think of it as a linear regression equation and each of my input nodes are a different value that I'm maybe passing into those and they're going to get you know multiplied by this weight. The output of that then goes into another layer that are all connected to one another and it continues And that structure in a lot of the open source ML libraries that are out there, you can still use this data structure, this algorithm. And I think in Spark, we call it multi-layer perceptron classifier. And all it is, it's a neural network that just passes through. It updates its weights of what each cell that's connected to every other cell is going to have as a value for it to multiply against the incoming data. You can have pooling layers in there. You can say like, Hey, I want a soft max layer that's going to, you know, give me a, a basically a logistic output is a zero or one, just binary based on what the input is on some threshold. But the key research that came out was somebody saying, what if we take this data structure and then as it sees data and it gets to a certain point in its propagate through all of those layers and adjusts according to whether it got something right or not. So that's how we train these things, these deep learning models. And that backpropagation is what makes it super efficient for it to adapt without us having us as humans and programmers, we don't have to go in there and manually like sort of brute force adjust, train something like this. You can still do that. And it's called these neural networks, the old school neural networks. You can run cross validation. You can send it a bunch of training data, send it a bunch of test data and say, hey, use stochastic gradient descent to update your weights. It takes forever to train these things though, from that old school way. Back propagation means it's kind of autonomous, does it in epochs, it's got batch processing in it, and it can do very efficient weight updates of itself. as it's going through training iterations. So that's probably the thing that's set. You know, when people say, Oh, deep learning and AI, and they're actually referring to this algorithm. There it's, it's prior art that has this new clever thing. I can't even remember who it was that came over the algorithm, but they're a freaking genius for coming up with something so simple that's so impactful to an entire industry and to the world in general. Without that algorithm, none of what you had said, Dave, earlier would even be remotely possible. Or, before we started recording, you showed us your website where you're doing image generation based on text input. That generative model is not possible without backpropagation.
Dave_Kimura:
That's really cool. And so, uh, one thing that I want to dive into is, uh, open AI and a lot of uses out there, I hate it. So I'll just start that off because we are basically vendor locking ourselves into this, uh, third party company that either through government regulations or some other situation could go away at any given point in time. or become unavailable or have speed issues, uptime availability issues, which Chad GPT has had some pretty horrible uptime just based on its mass usage. And that's understandable. But so I don't really want to get into the consuming the APIs of a company who is hosting the model. I'm more interested in actually creating and using. uh, my own models or models available from like hugging phase.
Ben Wilson:
Yes.
Dave_Kimura:
So if someone is going to get started, where is the, uh, best place to get started? And so that's kind of like a two-part question from one using existing models, but then pre-training or training your own models from scratch.
Michael_Berk:
Ben, I think it would be relevant for you to explain why, like the work that you've been doing at Databricks recently and where your perspective might come from.
Ben Wilson:
Yeah, depending on when this episode airs, this is gonna be cool for me to talk about, or somebody's gonna talk to me about why I just talked about this, but I'm gonna talk about it anyway. So we are working on an integration between MLflow and Hugging Face, the Transformers library. Fantastic product, and the vision of the founders for that company is amazing. It democratized. sort of what you were just talking about. So you can get a research organization at a massive company that has ludicrous amounts of resources, both in extremely intelligent people, extremely creative people, and use cases that can benefit from the vast amount of data that these companies have. So if you look at any of the major model types that are on hugging face, you're going through the model types and you're looking at... the structures for the large language models. Some of those are kind of old. They're all based on the Transformers architecture, which means that they're much smaller than they used to be before Transformers and easier to train. And that's because of a revolution that a group had with coming up with that architecture. But in order to train one of those, one of the core models, so you look at BERT, or you look at BART, BART is a derivative of BERT. You look at Distilbert, Distilbert is a derivative of BERT itself. But these foundational large language models, they're all made by massive companies that have huge compute resources. So Microsoft, Google, Facebook, now Meta, Twitter, LinkedIn, you know, these organizations have data science research groups with people that are not just air quote titled data scientists. They are. PhD scientists in either neuroscience or physics or computer science and they're working on pure research and they come up with a new approach or a new algorithm of how to code that up in PyTorch or TensorFlow and they vet that against a massive training set. When you create something like that from scratch and you're writing the actual structure of that algorithm in you know, base TensorFlow code or base PyTorch code. It's basically the dumbest it's ever going to be at that point. So the weights are zeroed out. There's no data there. It hasn't learned anything. When you take that architecture from zero to being somewhat useful, you're talking about training cycles that are operating on terabytes of data, sometimes petabytes or even exabytes of data. Uh, you know, the data set that was used to train the base language models for stuff like chat, GPT 3.5. It's a lot of data, like all of Wikipedia, uh, a lot of public forums, pretty much all open source, GitHub. Uh, just talking about, you know, trillions of lines of text that this thing is trained on.
Michael_Berk:
Dave, can you guess just out of curiosity how many unique tokens chat GPT-3 was trained on? And the token is like a word, for instance.
Dave_Kimura:
I would say well over 13 billion
Michael_Berk:
That is correct. 499 billion.
Dave_Kimura:
Wow. That's a-
Ben Wilson:
And chat GPT-4 is even bigger. Uh, and then the next generations of stuff they're working on is an order of magnitude bigger than that. Uh, so that process of learning to recognize patterns in human speech in multiple different languages and understanding human derived languages for computers and software code for it to learn all of that stuff. It takes a long, long time and a lot of hardware. So you were talking about OpenAI and it's ChatGPT 3.5, GPT 4. These are trained on a supercomputer somewhere that Microsoft built. It's a truly impressive piece of hardware. It took months to train this thing, months to run on this. And the price tag on that entire server infrastructure, we're talking tens of millions of dollars just for the server racks. You know, the electricity bill every month is probably more than most companies' IT budgets for just that one server. So not a lot of people have that hardware available. So if you want to get into building something that is a competitor to that, or similar to that functionality, most people are just economically restrained from doing that. Most companies don't have the budget to do something like that. So the beauty of hugging face. And the altruism of these massive tech companies is they spend all this money and build something that's useful for them and they'll keep the one that's trained on their data really, really well. And that powers, you know, their website or power something in their business, which everything that you see that's out there on hugging face, that's given by one of those companies, they're using a much better version of that in production, I promise you, or they they've used that as a stepping stone to build something. even more amazing, but they give these away for free to the world to be like, Hey, use this. So the thing that's on hugging face, it's already has its weight applied to it. And it's ready for you to do something called fine tuning. So you can take a much smaller data set and tell in through like the trainer object and in transformers, you can set it up to say, Hey, train on my data. Here's the, my inputs and here's what I expect as an output. try to learn this nuanced data set. But it already knows the structure of language. It already knows which words come after other ones. It knows context. It has this attention mode where it can fetch data from, that's the whole transformer aspect of the architecture. It understands how to retrieve that data in a way that's meaningful for people to read the output and sort of make sense. So you're fine tuning of that. It's just like, Hey, I have maybe a Q and a bot or I want to create a Q and a bot for my website and I have a whole bunch of examples of questions that people have asked and I have the responses that I've gone through and said, yeah, these are good responses to these good questions. I want this, this model to learn this aspect of my business. So you can take that pre-trained model and just say, Hey, just learn this. And it'll adapt to it and it becomes sort of accustomed to what you're expecting. But it doesn't have to go and learn English from square one anymore. So you, you're, you're stripping out, you know, five to six months of training on thousands of tensor processing units, because it already kind of understands the data that's coming in and that that's the beauty of that. So to go way back to your first question, what's the best way to get started? That website. and the hugging face hub. If you wanna get into large language models or they have vision models too, by the way, they have a lot of cool stuff like audio processing, speech to text, they have multimodal models where you can pass in an image and it'll return what it sees in that image or generative models that you can create a very simple chat bot with and it's pretty clever. The precursor to ChatGPT 3 is ChatGPT, is it, not ChatGPT. It's GPT2 and you can create a conversational chatbot with hugging face transformers. It's fantastic set of APIs. It's, it simplifies it to the point where if you want to get started and build something pretty quickly, those APIs allow you to do that. Uh, and pretty soon you'd be able to log it to MLflow. Um, but if you want to go even deeper and you want to do something like, Hey, I want to retrain this thing, not from scratch, but I want to train it on a billion entries of text, or I want to train it on hundreds of millions of images that I've personally labeled. You can do that with their APIs. You don't have to necessarily go and learn PyTorch from scratch. Although most deep learning focused data scientists already know how to do that, and they can go in and tweak stuff, add layers. They can remove layers that they know they don't need or slap a language head on top of a deep learning model. or swap a different one in. So there's lots of stuff that you can do if you know that low level, but I would always recommend to people start with those high level APIs. They're great and you get something useful pretty quickly out of it.
Michael_Berk:
And Dave, before we started recording, we were chatting about sort of your journey from Ruby into Python. You said it was the worst thing ever. Could you elaborate a bit more about your experience and your motivations for exploring the ML world?
Dave_Kimura:
Yeah, and so maybe I'll just keep it very simple I don't want my language to dictate When I need a space or not I don't like the white spacing in python. I hate it It's very annoying. Uh, and it doesn't flow naturally to me uh, but as i'm getting more accustomed to I found using four spaces or Uh instead of two spaces to do indentation made it a bit easier to read, but it's still very annoying from that aspect. And I think it's just a lot of familiarity. I've been doing Ruby for coming on like 15 years now and had never touched Python before that. So I have a lot of bias and exposure to Ruby and some other languages, but very little in Python. So. I think that's where a lot of the disconnect for me was. But I will say I've been using chat GBT for their plus version to help aid me in my Python learning. So I don't like copying anything from chat GBT and then just pasting it in and then moving on. I want to understand what it's doing and what it's trying to accomplish because more often than not, it's give me incorrect things. that I had to go through and debug. So just that debugging process has really helped in a lot of my Python learning and just the junk that Chad GPT gave back. But one thing that I found in fine tuning, it's funny that you mentioned that, is that I was going through a lot of my historical videos that I've created on Ruby on Rails development and that kind of stuff, and wanted to add some closed captioning to it. So naturally I go to hugging face. I find the open AI whisper model that is actually really good.
Ben Wilson:
Hehehe
Dave_Kimura:
It does a pretty good accuracy compared to online services like Amazon has a transcriber. And when I used that several years ago, it was about an 80% accuracy, which isn't horrible, but not where I would want a automated process to be.
Michael_Berk:
That's pretty
Dave_Kimura:
And
Michael_Berk:
horrible, in
Dave_Kimura:
yes.
Michael_Berk:
my opinion.
Dave_Kimura:
Yeah, it wasn't good. So I had to then manually go through and edit everything. But it did at least break it up into the timestamps and made it a lot easier to do that. However, using open AIs whisper model, no through hugging face, I was able to get like a ninety five ish plus percent accuracy where it had main problems is when I was speaking out code or use improper noun. And I think that's understandable because it wasn't trained on that kind of data. So where I'm going with this is I actually made a Ruby on Rails application. It's a data set maker. And I found that the biggest problem with doing fine tuning is generating your own data sets to be able to fine tune that model on. Because it's not easy. If you have someone who's not very tech savvy or someone... who really wants to get in and start fine tuning a model, but that's not really their wheelhouse, but they still want to be able to use AI or this deep learning for a company product, then they're having to go through and number one, find all this data that they then need to convert over into a data set. And from that data set, each model, from what I'm understanding in, keep me honest here. has its own kind of format, if you will, or structure that you need to feed it in order to properly train it. Otherwise, PyTorch is just going to throw all kinds of errors.
Ben Wilson:
Yep.
Dave_Kimura:
So I had a thing where this data set maker, you upload it a video, MP3, MP4, doesn't matter. It'll break it up into chunks looking for pauses in the audio. So it assumes that a pause in the audio is the end of a sentence. So that way you can train or create data sets, go through manually transcribe the different chunks that you're getting from a audio clip or video clip, and then export that into an already computed data set that has all the number representations of the waveforms and all that. So that's something that I've been working on and it works pretty good. I would say it got me a couple of extra bumps in the accuracy while destroying other bits of it. But what is from y'all's experience the easiest way to create your own kind of data sets to be able to fine-tune models? Because I think that's the biggest disconnect in fine-tuning for someone who doesn't do this for a living. but is wanting to leverage some of the deep learning features and functionalities.
Michael_Berk:
Yeah, so you're talking about a concept called supervised learning and supervised learning means that there's a label attached. There's a whole branch of machine learning called unsupervised learning or clustering. And that basically tries to find structure in a variety of high dimensional data. So in this case for supervised learning, labeling data sets is one of the hardest things, in my opinion, in machine learning, because it usually just takes a ton of people. to like the process is just to sit down and label stuff. And you're trying to find shortcuts and best of luck to you.
Ben Wilson:
Thank you.
Michael_Berk:
And you can certainly get creative about it, but this is just one of the more challenging aspects and sort of the softer aspects of machine learning. But I'll kick it over to Ben Perusual.
Ben Wilson:
No, I couldn't agree more. It's perfectly said. Uh, like what you were describing with your tool for your Ruby tool that you built with, you know, probably a pretty slick, you know, front end that you built. Um, that process, if you were to farm that out to a thousand people and say, hey, pick 50 videos. I want you to watch it. And you have a scrub rewind point where it's like, Hey, I'm going to play you at a 15 second clip of this at a time. And I want you to type out exactly what you hear. And then if you set up a quorum with that, where for any given clip, at least seven people. had to do the transcribing and then you vote on which ones match the most. So as long as you have a majority, if there's four people that agree, three people that don't, those four, their values are selected. That process is how you create refined data sets. It really is just brute force humans doing it. That's if you want really high quality data. Have I traditionally always done that when I need to label data? No, it depends on how important the project is. So if I was going to be working on something that was going to be open sourced and this thing has, has gonna be crit, you know, used and critiqued by, you know, tens of thousands of people. I'm going to go the extra distance to make sure that each of the, the bits of data that is used to either do fine tuning or retraining or whatever you want to call it. That data is really, really good. If it's some internal use case where I'm teaching myself something, I'm like, oh, I want to learn how this new architecture works for this thing. It's like a new algorithm that somebody built. I want to play around with it. I might just do a random data generator, whip something up in Python real quick or in Spark, and be like, hey, I just want 20 terabytes of data that are within these constrained spaces in this columnar storage format, or I want to generate. you know, numeric arrays between these values, just so I can have something to teach myself with at extreme scale. It doesn't matter how accurate that is. Um, and that, you know, use that for performance benchmarking or something. That's fine. But I think where most people fit between those two extremes of manually human create, you know, curated data that's been evaluated by experts, which is the most expensive type of labeled data that you can get. If you think about that, if we're doing speech, if we're teaching an LLM to building one from scratch and we're providing it training data, I guarantee you that any of those companies that have put one of those models out onto Hugging Face for use by the broader world, within that team are computer software engineers, they're data scientists, they're machine learning engineers, there's MLOps engineers involved in all of that. There's DevOps that are managing the deployment of this stuff internally as they're training it. But there's also linguists. These are, you know, PhD linguists who understand the nuances of language and they're evaluating what the rule set is for what patterns want to be presented to this language model. And they're also doing qualitative analysis on the output of saying, what are the issues that are common here? and evaluating the outputs of thousands of these things at a time and saying, well, it doesn't quite get propositions right. And they'll identify that. Now, if, if you just sent that to a lay person or to a software developer who might, you know, not many in our profession, uh, are, you know, dual degree holding, you know, oh, you have a science degree and you have a, you know, something in, in a language degree or you're You have your MFA in writing. those Venn diagrams don't overlap, so you're not going to get that identification of issues for language. So you need specialists. So that's the most expensive sort of label data that you would have is having that expert go in and say, here's the issues with the data that's coming out of the model, so we need to correct these on the next training iteration to make sure that we're getting enough examples of this correct behavior and we're adjusting our scoring algorithm to look at specifically this problem that we're seeing. So that's very expensive to do that, like ludicrously expensive. So, but most people don't fall on that side and most people don't fall on the side of generate garbage, random data, just to test something out. Usually on the side of the more expensive, but not many companies are going to, and particularly hobbyists aren't going to be able to hire people to come in and be like, Hey, could you do this? Like get my, my transcriptions right on my videos for YouTube. Uh, Unless you work for a company that IPO and you got a spare $50 million in the bank and you just want to blow it. Yeah, hire a bunch of people to do that. Amazon Mechanical Turk, go nuts. But for the most people who are practitioners are going to be doing this at a company and you're going to have specialists at your company that understand the data or what use case you're trying to do. You're going to try to involve them, incentivize them, ask for their help. you know, thank them profusely when they do a great job, uh, for presenting this data set and then you use that to train. And the data is not perfect. It's very hard for it to become perfect, particularly with language, but you want to get it good enough and enough data so that when you do that retraining, it's, it's going to understand. Well, understand is a poor term. That's a, that's anthropomorphizing an algorithm, but It'll draw pattern recognition between the most common frequently occurring things that it's saying in the language versus things that are, you know, outliers that, Hey, maybe there's one person that did this, they're not so great with writing and they have grammatical errors or something in what they're providing.
Dave_Kimura:
That's interesting. So let me run a scenario by you. I used to work in the time and labor market. So I created time and labor software. And so that's basically employees clocking in and out based on the schedule. And then it would calculate over time accrual information like PTO. And let's say if you wanted to incorporate machine learning into there using a deep learning model, it would have to be something that is rather niche. So probably a language model or a predictive model from scratch. And you want to basically flag outlying behaviors. So someone is supposed to clock in eight to five every day. And the model has learned that at this company or in this kind of location, these people usually clock in late and they clock out late. But then you have some outliers that maybe this person actually clocked in early and you want that to be flagged. So it's going to vary based on patterns at certain locations or companies and that kind of stuff. Would a I guess number one, I guess we always had to ask ourselves, would a machine learning model or a deep learning model be the right fit to solve this task? And if it is, then how would you go about creating some kind of predictive analysis model like that, because you're not going to be able to really use a pre-trained model because your use case is probably a bit more specific or in predictive analysis, is it pretty much all the same and you just fine tune it with what you are experiencing and then what is the expected output?
Ben Wilson:
Michael, logistic regression please.
Michael_Berk:
Yeah, so let's workshop this a little bit. So the use case is we are looking to predict when people will clock out late or clock in early. Is that correct?
Dave_Kimura:
out of the normal, what they normally do.
Ben Wilson:
So flax,
Michael_Berk:
So
Ben Wilson:
like
Michael_Berk:
we expect,
Ben Wilson:
this is abnormal or normal.
Michael_Berk:
got it.
Ben Wilson:
Yeah, that's why I said logistic regression.
Michael_Berk:
Yeah, so that would be a binary label. And what we would look to do is essentially have a one if there will be an issue and a zero if there will not be an issue. One of the core design principles just in life is keep it simple. So using a pre-trained model from hugging face and then fine tuning it would not be a very good solution in my opinion. It might even be adequate to think of sort of a deterministic system and look for patterns. So machine learning has this fancy hype because it can learn, it can train itself. But at the end of the day, it's just a set of weights or it's a decision tree or its coefficients attached to variables. And so you can create that decision process manually. You can create it through intuition. You can create it through understanding a system really well. That's honestly the first step that I always take is try to see, is there a solution that we can automatically create without actually having to train something? But that's a boring answer, so let's keep going. Let's say we actually do wanna train some data. What is the intended use case? So we would just have it run once a day and flag, let's say a day before, if we expect employee X will be out of the time box.
Dave_Kimura:
Yeah, something like that. So, I mean, the use case for this could be predictive labor yield. So in a food industry, you want to know and make sure that you are profitable at a restaurant, that your employees aren't too much outside of a scheduled time of hours, because then your labor yield goes up means you make less profits because you're spending more on labor. Or it could be a approaching overtime that this person, you know, is probably going to hit overtime because, you know, based on the predictive data that we're gathering with their clock in and clock out cycles for the past six months, we see that this pay period, they're going to approach it that could cause problems for companies who are very small and can't afford that kind of thing.
Michael_Berk:
So typically you would want to construct a data set and then start simple and work your way up if the simple models don't lead to accurate results. So in this case, we are predicting a day ahead. So all of our data should be lagged by one day. So let's say we have employee track record. We have maybe employee age. Maybe that's a relevant predictor. Employee, I don't know, you name it, profession, what area they work in. And all of those can be. used as predictors to see whether there would be signal in what they will do the following day. I think also day of the week is probably relevant. I'm way more likely to be late on a Friday than on a Tuesday. So getting creative with the feature engineering and the data set that you input, that's super relevant. And then you sort of want to sequentially increase complexity, like we said before. So the simplest model that would do this again, is just like a group by statement in SQL. You say, hmm. Is this predictor valid and does it help? But one step above that is something called logistic regression, which is just based on linear regression, but it uses a logit link function. And so it's a family of GLMs, or generalized linear models. And this is an example of something that's been around for 200 years, or probably like 100 years, that's still super powerful. So that's where you would start probably, and then maybe work your way up to a decision tree. So some classic ones are XGBoost. For Ruby people, that is probably a logical step because it handles a lot of the complexity for you. But there's also Random Forest, which I'm a personal fan of. And then at the very, very, very end is a deep learning model where you would sort of have a binary activation function that converts your continuous weight into either a zero or a one. So that would be the general process, but yeah, again, we try to keep it simple and deep learning is super cool. But as Ben was talking about earlier, it's really expensive to build the system for it and then to train.
Ben Wilson:
Yeah. And the amount of time that you would be devoting to each of those steps that Michael discussed, if I were to, to guess, if the three of us were working on it and Dave, you're, you're our intern and also our front end UI, uh, full stack developer is helping out with creating the app. And you're just like, Hey, I want to learn from the, the ML engineers, like how, how they're doing all this stuff. So we give you these tasks and this is about how much time you would spend on, on each of them. So. testing out that SQL query, uh, well, actually the first step collecting the data, um, that's probably a couple of weeks of going through joining maybe 30 different tables in the data warehouse to get a consolidated data set that has all the data that you need. And then maybe another week or two of validating that data. So that's doing statistical analysis on each feature saying, okay, I have 200 potential features that I can collect about all of these employees and about everything that we can brainstorm and think about. Well, which of these features are correlated to one another? You don't want to put that in a model, particularly a linear model. So it has like sort of an amplifying effect when you have things that are moving in the same direction at the same rate when you sort of plot them on a chart. If they're auto correlated with one another. that's kind of bad if you have too much of that, because any of the other signals that might be more relevant, they get kind of buried. So like, oh, there's a strong signal when this thing changes, these other seven things change along with it. I'm just focusing my weights on that because that's the biggest amplitude change. Stochastic gradient descent is going to be sensitive to that, which is the
Dave_Kimura:
So
Ben Wilson:
solver for most of these.
Dave_Kimura:
is that how like bias is introduced in those kind of situations?
Ben Wilson:
somewhat, you can kind of have bias would be like in a relevant point of data or a sensitive point of data that you don't want the algorithm to learn. So in this model, if we had gender in there, and all of a sudden the model starts predicting that some bad condition is always happening more predominantly with men versus women or women versus men. That's bias. Um, it's not intentional. It's just because of the data that we had. Maybe we have a poor selection of training data or there's some other confounding data point where the real root cause of why some, you know, maybe it's saying that women show up to work later on Mondays. Well, if we were to drill down into that, into hidden data that we don't have access to, it could be. that they're a little bit late to their job on Monday morning because there's more traffic getting to daycare and it's young women who have children who are doing that, that child drop off to kids that, to preschool or to daycare. So the bias would be, Hey, women are showing up later on this day, but it's actually not that it's, it's mothers who have young children who need that drop off and it's The reason for it is because traffic is really bad around the daycare. So it's not anything to do with them.
Dave_Kimura:
Mm-hmm.
Ben Wilson:
It's just so happens. But if you didn't have that delineation within your data set to say, does have children, how many children, ages of children, you're not going to have that data. So if you put that in there, it's going to, that bias is going to apply to all women, regardless of whether they show up late on
Dave_Kimura:
Gotcha.
Ben Wilson:
Monday, that's how the model is going to generalize. So that's, that's bias or like one aspect of bias. The one that's more sensitive is stuff like race.
Dave_Kimura:
Mm-hmm.
Ben Wilson:
Uh, when that's either intentionally or unintentionally added to a model, that's bad, uh, it's how we perpetuate discrimination with through the lens of opacity of algorithms. That's cause our data is flawed and our decisions as humans are flawed and our data is based on those decisions. So that perpetuates. So there's tools that you can use to detect that and eliminate that, but that's part of that feature engineering stage. of you saying, let's really analyze this data. What's in here? Are we collecting stuff we shouldn't collect or do we not have the data that we really need? And that's what that analysis, and that can take weeks. But once you do that and you get to the first thing that Michael said, like the group by, can we create our own decision logic based on heuristics? That's, we're probably gonna know within a day, like whether that's a waste of time or not, maybe a couple of days. And that'll be. That would be the three of us looking at the results and be like, eh, this kind of sucks. Or like, Hey, are in order to handle all these use cases, our decision logic is going to be the mother of all SQL scripts. And we don't want to maintain, you know, 15,000 lines of SQL. Nobody wants to maintain that. So we might say, Hey, this is, this is ripe for ML. It's like what ML was built for. Logistic regression. Uh, Michael and I could probably get that done in two hours. Uh, like easily that's with, you know, putting something like hyper opt or opt tuning around it, automatic hyper parameter tuning. And it's cause each of us have done that hundreds of times in our careers of building models like that. Um, we would probably also within a couple of days, test out decision trees, random forests, XG boosts, light GBM, you know, a bunch of like traditional statistical based ML packages. Might take us a week or two to go through a bunch of them and test out which ones are different. And then we could find one that we're like, this one's good. Now we go back to feature engineering and let's start culling that data set so we can start fine tuning how well this thing performs. That iterative process is probably going to take another couple of weeks before we have a model that's like, yeah, we're pretty good with this. We can QA this in staging and have outside people come look at our work and tell us how much it sucks. so we can go back and redo it. But if we were going to go down that deep learning path where we have a temporal effect, where you're like, hey, I want this particular individual and I want to know what they did over the previous three months of work performance by day of week and all of these other extenuating exogenous factors that go into that vector of data that's passed in, we're probably looking at three months to four months. of R&D, because that would be a custom model that we'd build. We would be using PyTorch probably, or TensorFlow 2.0, maybe Keras. But we would be crafting not anything that's like a large language model. We wouldn't be having billions of input parameters. So it would be much smaller, far fewer layers that we'd be creating. But we'd have to do experimentation and say, well, what if we added another hidden layer? Does it start learning something different? What if we remove that layer or switch this from a different pooling layer to this other one? What if we use a different solver here? All of those iterations, that's four to six hours of training each time we want to test something. And then we come in the next day. All right, let's try these two things today. See how they work. And we, the, the time that we're doing that we're racking up some pretty hefty cloud bills. Cause we're running it on GPU instances in order to get training to not take five days. So we're burning through money. Whereas if we tested all of those first algorithms, logistic regression, decision trees, random forest, XG boosts, all of these, you know, traditional ways of, of approaching this, uh, all of that computation time combined is probably equivalent to two hours of deep learning cost. So it's like, where do you want to spend your time and money? Now there's
Dave_Kimura:
Mm hmm.
Ben Wilson:
plenty of times where you're working on a problem like that, where let's say we built something based on XGBoost or we did a, we did an ensemble model, Michael did something on logistic regression, I did something in XGBoost. And then you did post-processing logic where you're using Python to write your own algorithm for, for basically. the output of the model, you're supplementing it with additional data and decision logic. We have this thing running and it's 92% accurate and we're super happy with it. It's been running for six months in production. Then somebody says, we really need to get this from 92% to 99.5% accuracy. Here's $5 million budget for you guys for the next year. This is top priority for us. That's when you're going to hear Michael and I say, we need to build our own deep learning model from scratch. Because that's the only way to solve that.
Dave_Kimura:
Mm-hmm.
Ben Wilson:
But you need that time allocated and that budget.
Michael_Berk:
And just to put it into perspective, Ben, how many models have you trained in your career? Roughly.
Ben Wilson:
There's no way I could even answer that.
Michael_Berk:
5,500, 5000.
Ben Wilson:
How many individual training runs have I kicked off?
Michael_Berk:
Sure. Yeah.
Ben Wilson:
I mean, it's in the millions. They can, the high millions.
Michael_Berk:
how many of those have been deep learning models?
Ben Wilson:
How many projects is probably another question for that. I'd say it's between 25 and 30. But
Michael_Berk:
so
Ben Wilson:
when we
Michael_Berk:
Dave.
Ben Wilson:
say, when we
Dave_Kimura:
Hahaha
Ben Wilson:
say like training, training for that it's automated. So it's not like you would, you would do with those other ones that we talked about, like logistic regression. We're going to control the iterations of that training. And when we build that model for that one instance on this one copy of the data that we're going to be using for training, we're going to tell it like, Hey, I want you to train 10,000 times. And. Here's your search space for hyperparameters. Go nuts. That is automated and you are training it, but that could finish in five minutes or 10 minutes or something. And then I can do another iteration or I can use another data set and I can just crank out many dozens, if not hundreds of these in a single work day. But when you talk about deep learning and working on something like that, I mean, just writing the code for the architecture of the model provided, no, This is writing an original one, not something that I'm looking on somebody's GitHub account and seeing how they built it and then copy pasting stuff and building. That doesn't count. When we're talking about building it from scratch, that is I understand PyTorch and its APIs and I understand the mentally and I've written down notes of, okay, this is the structure that I want and this is how I want this to behave and how I want to control training to it. And then I, here's all of my... My parameters that I'm going to be using, that just takes time. I mean, it can take you a couple of weeks to craft something like that. Just in code and making sure that it works properly. And then that's before you even start training.
Dave_Kimura:
Wow.
Michael_Berk:
Yeah, so the purpose of that question, if that wasn't clear, was to illustrate how rare deep learning actually is used. It's big, it's bad, it's powerful, but 99% of use cases or whatever, 5 million over 35 or whatever the number was, that percentage is pretty representative. And typically you can get a lot more bang for your buck, especially if you're not a large company with millions of dollars to blow. you can do a lot more work with simpler models.
Dave_Kimura:
That's cool. So I know we've avoided OpenAI in a lot of these discussions because I'm more interested in the underlying technology instead of using someone else's API. But how do you think, I think it is important that we do talk about it, the legality of it, because even playing with stable diffusion, I was generating artwork and the tags that I used for this was children's book, art cover, monkey hanging from tree and it generated some really cool looking images but then in one of them I saw a watermark it generated a watermark so this data was obviously generated from images that they did not have the rights or I guess legal right to use within their training models. So where does something like that? begin and end from a legality issue. And I don't expect y'all to know the legal ramifications and things, but just your inputs and thoughts of it. Because right now, you have someone suing GitHub, Microsoft, or GitHub Copilot not giving credit to their code that was just copied from their open source project. And I'm sure we're going to see many more cases with artists. and copyright infringement, even though Sable Diffusion 3 had a opt out window that you could tell it to not use your artwork for its training data. But I mean, what's already been created and pushed out there is out there.
Michael_Berk:
That's an awesome find, by the way. A watermark generated by AI, holy crap.
Ben Wilson:
Yeah, I've seen examples of that online in news articles and stuff. I mean, speaking from the perspective of a published author, when I find that people are, are ganking my, my, uh, my book in PDF format and sending it out for people to just download, uh, personally, I don't care. I'm just happy that people are reading it and reaching out and being like, hey, I learned so much from your book. This is super cool. Thanks for writing it. I don't really see that much money from it anyway, because book publishing for tech books, you don't make a ton of money. It can fund a hobby or two. But somebody who's complaining about a picture that they took, that they put their copyright logo on and then saying, oh, that, you know, this AI use this thing and this big tech company is making money off of it. Yeah. I mean, you're already putting your stuff out on the internet for other people to use and that's the price of, you know, sharing things with the world. People are going to use them. And sometimes they can build some really super cool things from them. Am I? annoyed that GitHub Copilot has used my code to make itself smarter. Uh, hopefully, hopefully it didn't make it dumber. Um, no, I'm not annoyed. I think it's awesome. Uh, and, but that's also speaking from somebody who spends most of his time. Most of his like working life, contributing to open source projects that our company makes no money off of. I don't make any money off of that aside from my salary. So. I don't know. I don't personally see the problem. I think it's people trying to stand in the way of or are potentially hindering the progress. Right now, it's stuff like image generation. It's gimmicky. It's cool. It's fun. It's like this neat little tool. Right now, we're saying that about that stuff. We're like, hey, I can pass in some words and it makes a picture. People are like, yeah, that's cool. How could you use that for anything like super really useful? Um, but everybody was saying that about, uh, generative text models five years ago. When GPT two came out, people were like, yeah, it's kind of dumb. That was just the precursors, the stepping stone for making something that now people
Dave_Kimura:
Mm-hmm
Ben Wilson:
are like, it's going to take our jobs. Oh my God. It's like pump the brakes. You're just going to have a different job 10 years from now. Like stuff worrying about that. Um, but. Eventually the tech will become something that people listen to and are like, Whoa, this is, this is paradigm shifting technology. Do I think that chat GPT is paradigm shifting technology right now? Not yet. Do I think in the next four or five years, as these models get smaller, faster, smarter, better trained, more useful, more people like you building, you know, Ruby apps that go around. these sort of APIs and allow you to build something super cool that interfaces, it's going to redefine not just technology and our industry. It's going to redefine the human race, uh, with how we interact with technology and how we automate away things that we hate to do or things that we're not so good at, uh, it's going to augment us. So it's weird when I hear people complain about that sort of thing, about like, it stole my stuff. It's like, Yeah, you could say the same thing about people complaining about the fact that, you know, their tax money went to the Apollo program and we put astronauts on the moon. It benefited humanity in so many ways with all the technology that came out of the Apollo space program. I think of it as that way. We're all in this together as a species and as we're learning and growing and building more things. you know, solving problems with technology is what we do as a species. I think it's just, it's great. But from the legality aspect, I will leave that up to Congress.
Dave_Kimura:
Yeah. There is one really cool website that I actually discovered today. So there's the website, Have I Been Pwned, which is a way that you can go out there and see if your email address has been exposed in one of the data breaches of many different data breaches that's happened over however long. But there's also one called Have I Been Trained? And you can... upload an image to it and it'll look, I guess, in its training data or the data sets that it was used to train on. And I guess is just for stable diffusion and it'll show whether or not if there's been a similar image used in that training. So it's pretty cool. I'll post a link to it in the show notes, but. It's kind of neat.
Michael_Berk:
My Gmail has been pwned three times.
Dave_Kimura:
Ha ha ha.
Michael_Berk:
Apparently.
Dave_Kimura:
So I think that was a great outtake, Ben. Very well said. And I think we're kind of at the time now to wrap things up. But is there anything that y'all want to give as a final thought?
Ben Wilson:
I mean, I want to give just a further shout out to the people over at Hugging Face. Uh, been interfacing with your APIs over the last couple of weeks and integrating with them. And the, the altruism that I see with that company and what their mission statement is about democratizing these tools is what's going to build the next generation of people that are curious, that are. That are just. geared towards building the next generation of things. And the fact that it's free, anybody can contribute to it. And you can retrain one of these things on a specific use case and build your own pipeline and configure it and then push that back so that other people can use it. It's great, it's amazing. And if you're listening to this and you wanna learn how to get started with deep learning, check them out. Checking out how, just search for hugging face, it's one word. They're very popular on the internet. And if you're more of a developer-focused person who just likes to look at APIs, just check out Transformers, the package. And you can look at the API docs and learn how to interface with this and get started. I think you'll be amazed at some of the cool stuff that's on there that people have just given away for free. It's awesome.
Dave_Kimura:
Yeah, I mean hugging face without hugging face. I wouldn't have been able to get Any machine learning done because I mean there's just so much because you basically would they have to start from a model? uh from scratch I guess and then train up that model with a bit of data and then try to use it and it's like y'all said Going to be very dumb You know when I first started out I saw that, you know, I got all giddy because that's all like oh look there's GPT-2 out there on hugging face I can use that Its prompts were horrible like I gave it a very simple prompt like what is Ruby and it went into this dialogue that got like darker and darker and just like vile
Ben Wilson:
Yeah, it hallucinates.
Dave_Kimura:
and So I'm like man this thing just What did it train on just read it comments? Like it was pretty bad.
Ben Wilson:
I think it was Twitter is
Dave_Kimura:
Oh
Ben Wilson:
what GPT-2 was.
Dave_Kimura:
It's just
Ben Wilson:
I don't
Dave_Kimura:
as bad.
Ben Wilson:
know for sure,
Dave_Kimura:
Yeah
Ben Wilson:
but that's a very old model. That
Dave_Kimura:
Mm-hmm.
Ben Wilson:
model was the start of generative text models. It was the big thing that the group that later formed OpenAI, that was like their big starting point and that's what got them funding by Microsoft to say, we like where you're headed. How about we build some hardware for you and we're going to invest in you guys. And we'll give you data too, like all the data you need and look at where it brought them.
Dave_Kimura:
Yeah. Well, I mean, Microsoft could not have done any worse than Tay, so...
Ben Wilson:
They learned their lesson with that one. I mean,
Dave_Kimura:
Yeah.
Ben Wilson:
that was a major stepping stone for everything that's come after that
Dave_Kimura:
Mm-hmm.
Ben Wilson:
of saying, Hey, if we're going to release something to the internet, we need guards in place, we need safety controls. And the research department, uh, over at the, uh, the Boston office for Microsoft research in Cambridge, they have worked many years now on building amazing tech that detects, you know, malicious intents and text bias and removes problematic responses from generative AI models and a bunch of other models as well. So if anybody's interested in reading more about that, check out some of the blogs and research that have come out of Microsoft Research and particularly their fairness department. Really cool people, really smart people.
Dave_Kimura:
Awesome. Well, let's go ahead and wrap this up. Ben and Michael, if people want to follow you online or see more of what y'all are doing, do you have a link to a blog post or a Twitter handle or Mastodon?
Michael_Berk:
I think just check out Adventures in Machine Learning podcast. You can find both of
Dave_Kimura:
Oh
Michael_Berk:
us
Dave_Kimura:
yeah.
Michael_Berk:
on LinkedIn, but yeah, we're the hosts of that. It's also a Dev Chat TV podcast. And yeah, so if you're curious about OpenAI and Chat GPT and building models, we actually just did an episode with Chuck on how to transition from software engineering to machine learning. So if any of that is interesting to you, just check it out.
Dave_Kimura:
Awesome. Well, I really appreciate y'all coming on today. It was a lot of fun. I definitely learned a lot and Confirmed that I am not as smart as I thought I was from the machine learning perspective So but y'all did a great job of explaining things. I caught like 50% of what y'all said and it made sense, but Uh, i'm excited to just keep hacking at this stuff
Ben Wilson:
I mean, so are we. Every week at work, every new project that I work on, even though it's not building models anymore, it's building the tooling that helps people build models. Anything involving ML and AI, it's a fire hose that just never lets up. And the more that you learn about it, the bigger the fire hose gets and the higher the pressure behind it,
Dave_Kimura:
Mm-hmm.
Ben Wilson:
which is great for people that love to learn. So continue the journey.
Dave_Kimura:
Yeah, one more quick question. Just a quick opinion on both of y'alls. For someone who does not know anything about machine learning and they want to get started, you say go to Hugging Face, and that's the best place to get started. But what about the hardware? So can y'all give just a quick couple of sentence thought on should someone who is wanting to get into machine learning go out and buy hardware? Or should they leverage what they have if they have something? or should they use cloud services?
Michael_Berk:
I have a strong opinion on this. Use your laptop until it doesn't work and then use either cloud services or hardware cloud services is more scalable. Hardware gives you more control.
Ben Wilson:
Just be very careful about your cost quota limits on certain clouds. Make sure that those are set up. Cause you really, really don't want that bill or that notification at 6am because you left a TPU instance running on GCP for 18 hours idle and be like, hang on a second, IO what? $17,000. So. Just be careful, people.
Dave_Kimura:
Awesome. Well, thanks again for that advice and thank you so much for coming on and talking.
Ben Wilson:
Our pleasure.
Michael_Berk:
Yeah, it was a lot of fun.
Dave_Kimura:
All right. Well, thank you all for listening and we will talk to you all again.
Michael_Berk:
Have a good day everyone.