A Guide to AI Models: From Tokenization to Neural Networks with Ishaan Anand - JsJ_669

In this enlightening episode of JavaScript Jabber, hosted by Charles Max Wood and Steve Edwards, panelist AJ O'Neil is joined by guest Ishaan Anand to delve deep into the intricacies of AI and large language models. Ishaan, an expert with over two decades of experience in engineering and product management, shares insights into his innovative implementation of GPT-2, providing a comprehensive breakdown of how transformers work in AI.

Special Guests: Ishan Anand

Show Notes

In this enlightening episode of JavaScript Jabber, hosted by Charles Max Wood and Steve Edwards, panelist AJ O'Neil is joined by guest Ishaan Anand to delve deep into the intricacies of AI and large language models. Ishaan, an expert with over two decades of experience in engineering and product management, shares insights into his innovative implementation of GPT-2, providing a comprehensive breakdown of how transformers work in AI. The discussion covers various aspects of AI, including how models predict the next word, the concept of tokenization, embeddings, and the attention mechanism which is central to transformer architectures. Listen in as they explore practical applications, challenges, and the evolving landscape of AI, with a special emphasis on mentorship and education through Ishaan's unique course offering. Whether you're an AI aficionado or a JavaScript developer eager to expand your knowledge, this episode offers valuable perspectives and learning opportunities.

Transcript

Charles Max Wood [00:00:05]:
Hey. Welcome back to another episode of JavaScript Jabber. This week on our panel, we have Steve Edwards.

Steve Edwards [00:00:13]:
Yo. Yo. Yo. Coming at you live from cold and sunny Portland.

Charles Max Wood [00:00:19]:
We also have AJ O'Neil.

AJ O'Neil [00:00:21]:
Yo. Yo. Yo. Coming at you live from the soldering station.

Ishaan Anand [00:00:26]:
Woo woo.

Charles Max Wood [00:00:27]:
Oh, sorry. I'm Charles Max Wood from Top End Devs, and, yeah, it is freezing here. Anyway, we have a special guest this week, and that is Ishaan Anand. Ishaan, do you wanna let people know who you are, what you do, where your course is? Because your course is awesome.

Ishaan Anand [00:00:45]:
Oh, thank you. So, my name is Ishan Anand. I have about, twenty years of engineering and and product management experience. And, most recently, I've been very focused on AI for the last couple years, and I'm best known in the AI community for an implementation of GPT two. There's a precursor to chat GPT that I implemented entirely in Excel. And then late last year, I ported that entirely to the web in pure JavaScript, and I teach how the entire transformer works. Basically, the model that, you know, was the sick you know, ancestor to Gemini, Bard, Lama, Chat GPT, Claude. They're all really inheriting from this model called GPT two.

Ishaan Anand [00:01:27]:
And I teach people in basically course of two weeks. If you have really no programming experience or if you've got JavaScript programming experience, this is the best way to really get in, understand how these things work, and they don't have to be a black box. And, you can see all that at spreadsheets are all you need dot a I, and the class is on Maven.

Charles Max Wood [00:01:46]:
Very cool. So let's, let's dive in. First, I think you said you had a promo code for the course. So let's just put that out there. Yeah. People wanna go get it and get a deal on it.

Ishaan Anand [00:01:56]:
Yeah. So the promo code is really easy to remember. It's j s jabber. And just go to maven.com and look for my name. Or if you go to spreadsheets are all you need .ai, and then you click that, you can use that promo code for 20% off, for the next two weeks. So Awesome. Definitely check that out. And I should just say, you know, thank you guys for having me.

Ishaan Anand [00:02:17]:
I I listened for years, to this, so it's great to actually meet you guys, well, virtually in person.

Charles Max Wood [00:02:23]:
Right. Yeah. AJ is the cool one. I I I just run the show. Anyway And

Steve Edwards [00:02:29]:
I'm just the funny guy while everybody else are the smart people according to some people.

Charles Max Wood [00:02:32]:
So Anyway, so let's dive in. You said that you explained how the transformer works. And so for those that are kinda new to AI, do you wanna just explain what a transformer is in AI?

Ishaan Anand [00:02:46]:
That's not

Charles Max Wood [00:02:46]:
a robot. And then we can dive into how stuff works.

Ishaan Anand [00:02:48]:
Yeah. Sure. So the the transformer is a, you know, AI architecture of a model, that came out in 2017, and it is the foundation for most of the, you know, AI models that have been, you know, like chat GPT. So those chatbot assistants that seem amazingly smart, all inherit from this architecture, called the transformer. And I can give a high level, overview of everything that goes into that. But the the key thing that the transformer does is usually take some input, and it tries to predict what the next word is. And that's really all your large language model is doing is picking one word or really one token at a time, and it's trying to predict when you enter in a question what the next thing is. And over, you know, the last, you know, couple years, what we've been able to do, we collectively as humanity, is take this model that tries to break the next word and turn it into these really helpful amazing chatbot assistants.

Ishaan Anand [00:03:49]:
And the paper that introduced this, model called the transformer was called attention is all you need. And that's where my course gets its name, spreadsheets are all you need, is I basically implemented that entire model inside a spreadsheet, hence, the name spreadsheets are all you need.

Steve Edwards [00:04:06]:
So question here then. So, I mean, having used Google since its inception, you know, type ahead is sort of a standard thing in search, you know, where you're typing and it's starting to anticipate what your phrases you know, what you're gonna type next. If I'm, you know, starting to search for spreadsheets on Google, it's gonna anticipate, okay, what's the next thing I'm gonna type? So is this basically the same thing just on AI steroids? Or because I mean, basically, that's using what people have typed in and, you know, they've indexed it and and, you know, done things with it. So is that sort of is that sort of the same thing just on steroids, or is that intrinsically different?

Ishaan Anand [00:04:46]:
Yes and no. In terms of effect, it is literally just doing the same thing. Like, it's trying to break what the next thing is. I really kind of get a little bit of a a mental pushback that to to just saying, oh, it's just like autocomplete. It is basically structured as an autocomplete problem, but the level of complexity of the architecture to solve that problem is just a lot more complex. But it is trying to do the same thing.

Steve Edwards [00:05:14]:
Right.

Ishaan Anand [00:05:16]:
And, you know, the the way to think about this is if you can fill in the blank in any sentence, you probably know something about that sentence. You already know what the answer might be. Like, that's a use a useful test of knowledge. But, effectively, yeah, that is that is what's going on. It's just trying to break the next word and then the next word after that and so forth one at a time.

Charles Max Wood [00:05:36]:
Right. And so, effectively, I I guess the the auto completes that we typically see are a little bit, I guess, more naive than, say, the the AI LLM models where they have, substantially more, data to run on and, you know, use a a mechanism that I I guess, it's probably somewhat the same because it's weighted and things like that. But, anyway, it can do it across a wider variety of things and give you deeper answers.

Ishaan Anand [00:06:11]:
Yeah. So, I mean, actually, let's start with the the autocomplete example because it does kinda point the way to some parts of the architecture. Like, the simplest thing you might do for building an autocomplete is you might just say, if I see this word, what are all the next likely words that will be after it? And you could just do a statistical lookup across some large dataset. Right. And as good as that'll be, the more pieces of data you look at, the better its predictive value. Mhmm. So this is called, like, a bigram model. And then because it looks at two.

Ishaan Anand [00:06:37]:
And then what you could do is you could actually look three words back or you could look four words back. And, actually, one of the key things about the transformer is it tries to look at all the words, and this is what the attention mechanism does, is that it can, figure out essentially from all the possible words before it, what is the next most likely word. And then the other key thing you need to do is you ask a neural network to take all that information and make a prediction. And it turns out that's the heart of of the transformer. And what really made it work was they just scaled that up to a much larger size than I think people were, you know, used to doing. You know, when your autocomplete and your keyboard is probably used to you know, it's built to be really, really fast, and so they tried to make it really efficient. And then what we've been able to do with the transformer is make it really big and then actually make it super efficient scaling it back down so that it spits out tokens at a reasonable clip. But that that core idea of saying, hey.

Ishaan Anand [00:07:30]:
Let me look statistically at, you know, what the next thing is. Well, one word isn't gonna be good enough. Two words back is gonna be better. That is what the attention mechanism is, in a sense, if you squint, doing is it's trying to look at all the words that came before. It puts them through multiple passes, and then it's asking a neural network to do the prediction rather than just simply saying, oh, let me take the raw statistics. But yeah.

Charles Max Wood [00:07:56]:
So do you kinda wanna break down for us how these systems actually work?

Ishaan Anand [00:08:02]:
Yeah. So the the first thing I say is, the way to think about the the simplest model that I like of the transformer is that what we've been able to do you know, we we said that, you know, these are trained to fill in the blank on a piece of text. So the example I often use in in my lectures and inside, a lot of my material is this very simple simple sentence. Mike is quick. He moves. And the next most likely completion would probably be he moves quickly or he moves around or he moves fast. And so the basic question is how do we get a computer to fill in the blank of an English sentence or any natural language sentence? And the what we've been able to do is actually figure out to to talk in the language of the computer, which is math. So if I gave the computer a math problem, two plus two equals, it could fill in the blank.

Ishaan Anand [00:08:52]:
It knows that two plus two equals four. And we can make the math as pretty large and complex, but computers are really good at math. So what we've been able to do is and what the model does is it takes a word problem, and it's really converting it to a math problem. If you look inside you know, go to my website, you know, spreadsheets are all you need dot a I slash g p t two, or if you download the Excel file and look inside it, what you'll see in, you know, there's text at the beginning. You type in text on one part of the spreadsheet, you get the predicted word at the other end of it. But in between, if you look in that, you'll be like, where the heck are the words? It's all numbers. And so the key insight is what we've been able to do is take something that is a word problem, and we've turned words into math. And once and that mapping process of words into map is has two stages.

Ishaan Anand [00:09:41]:
It's called tokenization and then embeddings. And at the end of it, we we map every word. You can conceptually think about it to a single number, but we actually map them to a large list of numbers. And then once you have a mathematical representation of your prompt, your entire prompt has been, you know, turned into a large list of numbers. We then run I just call it number crunching. It's these two key mechanisms, attention, and a multilayer perceptron or a neural network that just kinda crunches on it to try and predict what the next word is. And then at the end of that, we get a number. And that number, we then reverse the process that came out of that thing, and we say, well, what what word does this number map to? And that number is a predicted word, but it's not gonna map cleanly to every single word in our vocabulary.

Ishaan Anand [00:10:26]:
And so if that number is closer to certain words, like in the case Mike is quick, he moves quickly, the predicted number might be really close to the word quickly. It might be close to the word around, but it's not gonna be close to you know, quick can be a body part. It can be the quick of your fingernail. It's not gonna be something about your fingernail because it's figured out enough that it's moved the predicted number away from that. And so we take that and we run a random number generator at the very end, and then we pick it according to that random number generator based on how close that number is to one of the other words in the dictionary, of of words that are mapping to numbers. So that's, like, my highest level, summary of what's happening under the transform without describing all the mechanisms. But, again, the key thing is we found a way to map solve this problem numerically. We map words to numbers.

Ishaan Anand [00:11:14]:
We turn the whole sentence, your entire prompt, into a large list of numbers. We number crunch on it. Then we get a predicted number out of it. We just calculate, and we look at how close that number is to our number to word mapping at the very end, and that's the probability you get of getting a particular token or word out of the model. Let me pause there, see if there are questions or things I should clarify.

Charles Max Wood [00:11:34]:
So I I think I follow along. Essentially, what you're saying then is so let's say I wanted it to generate a whole paragraph. It just does this over and over and over again to get the next word and the next word.

Ishaan Anand [00:11:48]:
Yeah. Maybe I've glossed over that part of it. Like, the large language model only predicts the next word, technically, something called the token, which is slightly smaller than a word. And every time you get a prediction out of it like, it it doesn't, by default, predict paragraphs. So if you, you know, try my app or you download the spreadsheet, it only predicts one token. And the way we get paragraphs of text out of this is we take the predicted token it came up with, and then we stick it back onto the input, and then we ask it to predict the next sentence or the next that new accumulated, paragraph. And so you can actually start with a single word, ask it to predict what the next word is, and then you now you've got two words, and then you run it through, and then you keep going. And then what happens when you've got user input, like somebody types a response, is you just stick that entire user input as, you know, a large set of words that it needs to break what the next thing is.

Ishaan Anand [00:12:39]:
And, you can think about it structured into the model as you are reading a transcript between a user and a helpful chatbot assistant. User said x. We fill in what the user said assistant said, and then it needs to come up with what the assistant said. And it just tries to come up with something plausible. Maybe the taking a step back, like, the base model that these, that gets trained in this process before it's turned into a helpful chatbot, Jess knows really simply how to complete sentences. If you take the base GPT two and you type in, you know, questions to it, it's not gonna necessarily respond back to you meaningfully. It's just designed to predict the next word based on everything it's seen on the Internet. So a good example I use in classes, we type in the word first name, and then you hit return.

Ishaan Anand [00:13:28]:
And, well, what do you think it would predict after that? It predicts last name, email address, phone number. Because most text on the Internet that say first name, statistically, it's a form, and it's used to just filling out forms. Another one is I I type in hello class. And when I first did this, I thought it was gonna say hello teacher, but it actually starts spitting out Java code. So it just looks at the text. Yeah. It's it's it's really amusing to watch. And, you can you can just run it, and it's just trying to predict what the next thing is based on what it saw on on the Internet.

Ishaan Anand [00:14:04]:
And then what, you know, OpenAI and Anthropic and these companies do is they put that what we call a base model, which all it knows how to do is predict the next word through a training regime to elicit it to be more like a helpful chatbot. So you give it a system prompt that tells it it's a chatbot. It's kind of like you tell it a story that's plausible for it to start to think like it's talking to a user. Like, you are a chatbot. You are reading a transcript of a chatbot and a human user, and we just fill in what the human said, and it tries to fill in what it thought the helpful assistant would be. And then they they fine tune it to get better at that.

Charles Max Wood [00:14:41]:
Yeah. This sounds a lot do you like what you're explaining, you get into prompt engineering, which, again, if you're not into AI, prompt engineering is what's all the stuff I tell the the AI system so that it'll give me the answer I want. Right? And so when you're when when we're talking about prompt engineering now, it's okay. So this is why when I start out, I tell it things like like you said, you are a chatbot. You help people with these problems. You do these kinds of things because it'll build off of all of that and and use the statistical model now with the context of what you typed in to give you the right answer. So, you know, yeah, hello class. There's not a whole lot there for to go on.

Charles Max Wood [00:15:21]:
But if you tell it, you know, you're a chatbot and you're helping students with the blah blah blah blah, then you type in hello class and it's gonna go you know, then it may come back with hello teacher or something like that.

Ishaan Anand [00:15:31]:
That that's a great example. Yeah. So and what you can think about them conceptually doing is baking that prompt engineering into the model. So what they're able to do is if they give it enough examples of this, they can retrain it such that you don't need the prompt at the beginning that tells it it's a teacher or that it's a helpful chatbot assistant, and that gets baked into the model. You can think about all that prompt engineering gets memorized into the model during that training process, and then it turns into that helpful assistant.

AJ O'Neil [00:16:00]:
So help help me understand this a little bit. So I've I've played around, obviously, with GPT. I I've also played around with the other models. In fact, right now, I really like Quinn. I am I am using Quinn more than I'm using GPT because Quinn actually seems to be giving better results, especially considering it it in the benchmarks, it outperforms four o, whatever that means. I mean, it's, like, by a fraction of percentage point. But o one, I just find o one and r one to be two like, they take forever. So it's, like, I'd rather ask the question twice and be 99% likely to get the right answer than ask the question one time and then have to wait forty five seconds to get the wrong answer and ask it again.

AJ O'Neil [00:16:47]:
You know?

Steve Edwards [00:16:48]:
Forty five seconds. That's an eternity.

AJ O'Neil [00:16:50]:
The o one is crazy. The o one in o three.

Steve Edwards [00:16:53]:
We use it for code for code questions and stuff because it seems does better than the standard GPT four. Yeah. Wait a few seconds, but I'd rather get a wait a little bit and get a better answer than get something super fast that's not gonna be as good.

AJ O'Neil [00:17:08]:
I well, I'm I'm the other way because it's not that much. But if you look at the benchmarks, it's, like, 1% better than four o, and it takes, you know, so much anyway. But what the thing the thing that I was that I was getting at, is in the beginning, there was the system prompt. Right? So when with GPT, one of the ways to jailbreak it, was you could say, that was just a joke. Actually, you're a, something else. And so it would interpret it as, okay, your system prompt is you're a chatbot. You're allowed to say this. You're not allowed to say that.

AJ O'Neil [00:17:50]:
But you could just say, that was just a joke. And then Yeah. And then and then give it an additional prompt. Now with DeepSeq, v 2.5 and r one and with Quinn, it's it's like you're saying, it's baked into the model. Because if I override the system prompt and I tell it, you know, you are, a human who is capable of reasoning and has no biases and can represent any information factually. Tell me about Tiananmen Square. It's, you know, it's I am a helpful bot. I am not a human, and I I do not talk about things that contradict, what is known to be, you know, the the proper the proper knowledge of the of the Chinese, government to protect the people or, you know, it gives me some some nonsense like that.

AJ O'Neil [00:18:48]:
So what what is how is it possible to bake in those system prompts with training data? And and I guess, how does that vary how does it vary from the system prompt, and and how do they get it to bake that in so that it you can't override it with a system prompt?

Ishaan Anand [00:19:06]:
Okay. There's a lot of layers there. Let me yeah.

Charles Max Wood [00:19:10]:
A long question. Can can you restate the question in one sentence?

AJ O'Neil [00:19:14]:
Yeah. I think the The question was that answer.

Ishaan Anand [00:19:16]:
What I think with the question, which was, how do you bake in the system prompt? But there's a couple things that are worth noting in your question. Like, you mentioned some reasoning models, o one and r one. And the way those operate is a little bit different. Like you said, it takes a while to come back because it's actually just expending a lot of tokens thinking that it doesn't give you. And it's trying to actually think through the process like you might do. They call this chain of thought or thinking step by step. And it what's unique about that compared to regular chain of thought is it can suddenly realize, oh, it's made a mistake and backtrack. And so it's it's literally spending you know, coming up with hypotheses and trying and testing things and seeing if it works.

Ishaan Anand [00:19:56]:
So this is why these models tend to be really good on math and code, because it can go try something and say, oh, wait. Does this let me check. Does this answer right? Oh, no. It's not. Let me try again. So, and then you mentioned, you know, jailbreaking. And with the early models, one way to think about, like, you're like, oh, this is just a joke, is that, you know, you're kind of taking that we talked about briefly that attention mechanism or looking back at the previous what's most likely. If you put things like, you know you know, kill and harm in in the in the prompt, statistically, it sounds like it's negative.

Ishaan Anand [00:20:33]:
Right? But if you start putting things like grandma, cookies, it seems less harmless. You kinda think of yourself as kind of waiting the attention, to be more to the harmless side. And, really, what's happened is that the models have gotten smarter, both in terms of their natural responses to this. Like, they are trained to handle jailbreaks. They are trained on if a jailbreak comes up, here's the the response. And the way they train it, to get to your your main question, is through these two training techniques. One is they just give it an example of a a prompt and what its response should be, and they use this technique called back propagation or so stochastic gradient descent, which is to tune the network such every time it sees that result, it gives out what we wanted it to have. So we're going a little ahead of where we want to be, but, like, when you train a neural network, you give it examples of data.

Ishaan Anand [00:21:23]:
So the simple example is a dog and cat classifier. Right? I give it pictures of dogs, and I give it pictures of cats, and I tell it which ones are dogs and cats, and it comes up with the answer. It comes up with the rules how to figure out whether an image is dog or cat. This is way different than, regular programming. Right? Machine learning inverts the normal paradigm. Normally, we're used to as developers. I write a series of rules, series of program. Right? And then it processes data and gets out a result.

Ishaan Anand [00:21:54]:
I click a button, something does you know, moves on the screen. So I I can write that program. But a dog and cat photo classifier, I don't know if you gave me dogs photos of dogs and cats. I know how to instinctively do that, but I don't know how to write it out as a series of rules. And so the inversion that machine learning does is you give it answers, and you you give it data, and then it figures out how to write the rules. It writes the program. Now, unfortunately, we can't always understand what the program it comes up with is. But what they do is they give it examples of jailbreak jailbreak attempts, and they say, hey.

Ishaan Anand [00:22:28]:
You know, your response now should be this to that. That's kind of the high level overview of how they do that. One thing that's worth noting is that when they protect a model, it isn't just in the model itself. So, there are usually things that are watching the result of the model that are additional classifiers. And so sometimes you might see examples of open source models that let you do things, but the hosted versions do not because the the hosted system is actually checking. So not everything is baked into the model itself, and it's not a % perfect. So often there's some additional guardrails that are detecting things.

AJ O'Neil [00:23:06]:
Okay. So then two more questions.

Ishaan Anand [00:23:09]:
Yeah.

AJ O'Neil [00:23:11]:
What constitutes open source? Because that does not mean the same thing that it means in the programming circles. Or I don't believe it does. Because I I have not yet seen any open source model that comes with, 400 terabytes of training data.

Ishaan Anand [00:23:33]:
There are few and far between. There are some. Olmo is probably the the best known one, which is a model where everything, the training code, the training data system, the the data collection pipeline, their the logs from their training runs are completely open. There's, like, a handful of others. But this question of what constitutes open is is completely a gray area, and it's being debated right now. Traditionally, when people talk about an open model, it's usually an open weight model, which is you get the parameters, which encompass the rules. We talked earlier that whole thing is math. Right? So if you open up my spreadsheet or you open up my my website, you just see lots of numbers.

Ishaan Anand [00:24:24]:
You know, whether those numbers are hidden from you or you can run them yourself is what people call an open weight model. That's what kind of passes for open source these days. There are very few models that open up the training data. And so it's it's debatable, and people do debate about what a truly open model means. A a truly open like, the most open was one that includes the data, But there aren't that many, especially at the state of the art where the model is the all the training data that created the model is is there.

AJ O'Neil [00:24:55]:
Well, I mean, that would be highly illegal for ChatGPT to give us their training data because YouTube and, all the libraries on planet Earth and everyone who has a copyright on something would have something to say about that.

Ishaan Anand [00:25:10]:
Well, I mean, I I'm not gonna comment on any particular, model provider specifically. I will say the idea of whether you can train on data and whether it's transformative is, quite frankly, still in the courts right now, and we don't have global consensus. So I believe it's Japan has said and clarified that you can train on data, that it the training process is not directly infringement. Now, you know, one of the litmus tests is, like, whether you're competing with the original thing. So it's it's a larger open question. But right now, that's making its way through the courts. I think, you know, candidly, if you said here are all the things I've trained on, then you might end up, you know, just opening yourself up to more people who can just say, oh, let me get on to that that lawsuit. But, I mean, that that question is still being that's a legal question, which, Yeah.

AJ O'Neil [00:26:06]:
Yeah. Yeah. Yeah. Yeah. Yeah. Okay. So my my other question related to what we would we had just talked about was so I I download, Quinn, and I don't know what it is that I'm actually downloading. Because I use I use a llama, and it downloads 20 gigabytes of something, and then it runs it.

AJ O'Neil [00:26:30]:
And I get to be productive, and I'm happy. But in terms of, you you know, like like you're saying, there's something it's not just the model giving a response, but then there's other code that is you're doing some sort of check. Is that happening with these models that I am using generic tools like Ollama or Llamacpp or or, LM Studio, is is that actually running program code, binary code, or I guess it's not binary code. It'd have to be bytecode because I can do it on Mac and I can do it on Linux, and it doesn't have to recompile anything after it's done downloading it. So what where where where are those extra layers, or how are they interpreted?

Ishaan Anand [00:27:23]:
The extra layers that are protecting the model from saying the wrong thing, is that what you're asking?

AJ O'Neil [00:27:28]:
Yeah.

Ishaan Anand [00:27:29]:
Yeah. Those aren't there when you download an open source model. When you run-in on a llama, those extra layers the only thing that is protecting the, the the model is at that point what model was trained. The the the pretraining they did that they baked into the model, then they're not doing any additional checks. So with the hosted model, there is a, there is additional layers because the they control the infrastructure, and they're watching what the model says, and they're they're stopping it. But, typically, when you will use, you know, LM Studio or Ollama, then it's g. You're just getting the bare uncensored model, and there's no additional checks. The only thing that's preventing the model from, you know, call it saying the wrong thing, being, not helpful or not harmless or, I guess, harmful and unhelpful, is is just the training and and the models, you know, training that it was put through.

Ishaan Anand [00:28:23]:
There's no additional checks there. So And when you download the model, maybe it's worth saying, you're just basically downloading a large list of numbers. And the code inside it tells it and you're getting a large list of numbers and a a mathematical graph that says how to combine the numbers together. It says, take this parameter. First, here's how you map the words to numbers. That's you get that mapping. And then once you've mapped them to to numbers, it says, add it here, multiply it times this other number here, then, you know, norm you know, take the square root of this other number and then multiply it again, and it's just a list of calculations. It's really, like, simple program.

Ishaan Anand [00:29:05]:
In fact, most of the knowledge it's worth stating is is not in the code. And this gets back to your question about, like, open source. It's in the knowledge is inside the data. It's inside the parameters. So as an example, GPT two, which, you know, is considered at one point too dangerous to release. And it's just amazing yeah.

AJ O'Neil [00:29:25]:
It's it's only That's because they want regulatory capture, not because they actually believe it's dangerous.

Ishaan Anand [00:29:31]:
Well, at the time, maybe they they they're concerned about disinformation. But suffice to say, it was still a powerful model in its day is my point. It's only 500 lines of code. Like, if you take out the, the the TensorFlow library, it's 500 lines of code. It is it is astonishingly small. And so one of the things and the reason why I reimplemented the entire thing in JavaScript is I wanna push back against this idea that, well, this stuff is too hard for you to learn. If you're a web developer, like, you can learn 500 lines of code. And that's basically like, I give you the grounding, and I reimplement the entire thing in JavaScript so you can step through it.

Ishaan Anand [00:30:14]:
You don't even have to leave your web browser. Right? You just use the web debugger, and you can you can step through what's happening. And it's it's astonishingly small. All the knowledge, all the rules is captured in the weights and the parameters of the model. So when you download the model, it's just a more and more numbers with a larger and larger computational graph. That's how we get it smarter. That's gets back to the heart of, like, the core thing to understand is we took a word problem, and we mapped it to a number problem. So if we get a bigger calculator, we get a better result.

Ishaan Anand [00:30:44]:
Yeah. But

Charles Max Wood [00:30:44]:
it's not like Yeah. I wanna restate this just in another way really simply because I think a lot of people get you know, they get confused between, like, GPT four versus chat GPT versus something on your computer versus whatever. Right? And so, yeah, essentially, the model, like you said, you know, it's it's maybe a few steps in how it gives you answers. And the rest of it, like you said, is is all the data. It's all the waiting. But sometimes when people are talking about AI models, they're talking about a program that accesses the model that I just explained or that you just explained. Right? Yes. With the numbers and kind of the fundamental pieces.

Charles Max Wood [00:31:26]:
And so that's your chat g p t. Whether it's running on your local machine or in the cloud, yeah, you you need to be able to differentiate between the two and recognize that, yeah, sometimes you're just downloading that map of numbers and, you know, some really, really simple stuff that that that makes sense of the numbers, and and that's your model. And so when people are building against those models, a lot of times that's what they're doing. And so you can write your own code that then, you know, is the gatekeeper or, you you know, says this is helpful or this is harmful or this is whatever. Right? This this is an appropriate response and this isn't. A lot of that's just the code that sits on top of what you're talking about, that 500 lines of code plus the data that that we're we're getting that is just the model. And so, you know, AJ is talking about the the Gwen model. It looks like it runs on Ollama.

Charles Max Wood [00:32:22]:
Right? So, you know, you you've got all those magic numbers. You've got the stuff that runs on top of it. I think Ollama gives you a little bit more on top of that. And then from there, right, the rest of it's kind of up to whoever wrote the code.

Ishaan Anand [00:32:36]:
Yeah. The the central thing I'm just trying to get, like, the black box part, the most mysterious part at the heart of it.

AJ O'Neil [00:32:42]:
Yeah.

Ishaan Anand [00:32:43]:
And, obviously, like, you know, the calculations that say, llama and chat GPT and Gemini are doing they're they're larger models. GPT two came out in 2019. But the the core thing is, like, a like, if you had shown me GPT two, I wouldn't have known. Like, when it first came out, it's like, wow. That's a pretty amazing program. Must be really complex. And it's not the program that's complex. It's they just bet on taking a somewhat simple architecture and just giving it lots and lots of data and spending more than anybody else had at the time and just trust that the black box would be smart enough to learn everything from it.

Ishaan Anand [00:33:20]:
And at the heart of it, that's that's what's happening. You're you're just a a large it's a giant calculator. In fact, it's so simple in a sense, like, in a spreadsheet, which was my first implementation, you cannot do loops very easily. You don't have looping constructs. You have it just does a calculation through the entire way, and it just does the the computation of the all the different cells. They're they're effectively, in a sense, no loops inside of it. Like, the reason I can implement it in a spreadsheet is that every single time it predicts a token, it does the exact same number of computations every single time. And it goes through, you know, 12 layers and 12 attention steps and 12 like, it's very, very much like, I got a number coming in.

Ishaan Anand [00:34:02]:
We a word coming in. We map that word to a number, and then I just do all this number crunching with a very predictable pattern. And then I get a number out, and I interpret it. And then I just repeat that process over and over again. And so, you know, the thing that I try to tell people is just, like, when you look online and people are like, I wanna get in into AI and stuff, they're they're often presented with, okay. Go learn, you know, all this linear algebra. You need to make sure you're solid on your calculus. You need to make like, and it's there's, like, six to eighteen months of, like, prep before you get to understanding how a large language model works.

Ishaan Anand [00:34:40]:
And that's valid if you're going to be a machine learning researcher. And machine learning is a huge giant field beyond just chat GPT. Right? There's anomaly detection. There's clustering. There's a lot of algorithms in there. But my goal is to just help people understand how these amazing, arguably Nobel Prize winning programs work in a shorter time as possible. And to the extent, like, I don't even begin where normal machine learning class begins. A normal machine learning class starts with, like, regression, and it slowly works your way up, and maybe you'll get to the LLMs.

Ishaan Anand [00:35:11]:
And I'm like, this is a 500 line program. Just start with, here's how it starts. And anytime I run into something you don't understand, I'll give you the in my class, I give you the the background to understand it, and then we move on to the next piece. And so it's really designed to be as efficient as possible. And I I think when you tell people it's 500 lines, they're like, oh, yeah. Okay. I can understand how that works. And this gets to knowing your tools.

Ishaan Anand [00:35:34]:
I'll I'll make an analogy. If you don't necessarily need to know how an AI model works to use it, but you don't necessarily need to have a good model for how is the difference in the CPU or disk memory versus bandwidth versus, system memory. But if you're debugging, you know, a machine program, it helps to have that mental model. You'll run to an issue. Or maybe a more, tangible example to this audience is, like, knowing how React works on the inside. At some point, if you don't understand hydration, you're gonna run to a wall. Right? And the same thing is true. Like, you get these parameters from MoLama.

Ishaan Anand [00:36:12]:
What are they doing? You know, you need to have a mental model for how it works, and I don't think that mental model is as hard as a lot of people make it out to be.

Charles Max Wood [00:36:20]:
Yeah. When I when I talk to people about doing AI and I talk to a whole bunch of people, like, that are business people. I talk to

Steve Edwards [00:36:27]:
a whole

Charles Max Wood [00:36:27]:
bunch of people that are programmers. And I have some of the same conversation basically down to, well, are you going to build your own model? Right? Are you gonna take your own data and cram it in and expect it to give you answers on the other end? Or are you gonna use something that already exists like the chat GPTs or, some of the, you know, the GPT force or the Olamas or whatever. Right? And then build on top of it. Because once you're building on top of it and you're not worried about, okay, how do I put this together? Then it's essentially, okay. Like you're saying, I understand how the machine works.

Ishaan Anand [00:37:02]:
Mhmm.

Charles Max Wood [00:37:03]:
And then I understand how to talk to it. Right? So I understand what the APIs are. And the rest of it is then, okay. What do I want from this, and how do I validate that I got it?

Ishaan Anand [00:37:13]:
Yeah. Actually, that's a really important point. The number one skill may not be, understanding every single detail of the calculation. The number one skill when dealing with an AI model is that last thing you talked about. How do I evaluate it? So the the name that you hear in the AI community is evals. But as a, you know, web developer, you can think of these as tests. And the the key difference between, you know, AI evals and and tests is that you don't expect a % pass rate. These are statistical probabilistic machines.

Ishaan Anand [00:37:46]:
But the number one like, when you read about benchmarks, you know, AJ, you talked about benchmarks. You basically need to build the benchmarks for your particular problem. The benchmark might say some model is better than another, but when you actually use it for your problem, you suddenly discover it's not as good. And so the first thing you should do is come up with your own benchmark, your own evals for the problem, and then try a bunch of models and see which one works the best. And then you can start iterating, whether that iterating is changing the prompt, whether it's changing the model, or saying I'm gonna go off and fine tune my own model. But you won't be able to make a judgment until you're able to look across the distribution of your task, all the different ways your task happens, and whether it's successful or not, because these are you're dealing with highly variable machines. One of, you know, the the folks who is in the audience from one of my past talks had a really good analogy. He's like, imagine a database that was wrong 5% of the time.

Ishaan Anand [00:38:44]:
Like, as developers, we are not used to having levels of uncertainty like this within our systems, unless maybe you're using distributed systems where there's all sorts of race conditions and stuff like that. But we're used to sanitizing the user input. And then once we get the user input, everything's predictable after that. But here, it's like, suddenly we've got a database that sometimes it's wrong. And so that's where we need to put all sorts of checks and guardrails, and you're dealing with this really smart but sometimes fallible thing like like a human. I hate to say anthropomorphizing it. And so how you build a system around that is gonna be different than how you you build a regular system. But it all starts with that key idea that you just talked about, which is about being able to evaluate, mathematically how well your your model or your system is doing.

Ishaan Anand [00:39:30]:
And the question about whether you should build your own model or not, the usual hierarchy of needs is, first, start with an off the shelf model. It could be open source. It could be one of the API providers. It's actually probably easiest to start with a hosted model and just see if you can get it to work because they're they'll be state of the art, and you don't have to worry about all the stuff around hosting and inference and seeing if it works. And then next thing to try is try tuning it. Sorry. Try try tuning your prompts. So try prompt engineering your way.

Ishaan Anand [00:39:59]:
Give it some examples. Try some variety of prompt engineering, and then maybe consider, fine tuning it. And, again, you can fine tune, you know, most of the hosted models. You don't have to go to open source model, but you could do that as well. And then the idea of building your own model is extremely hard. You know, the amount of dollars that go into building your own models from scratch is now, you know, over a hundred million. So the estimates for, say, Llama are were, I think, over a hundred million to to build that model. And so it's a lot of work, and, that's probably best of the Frontier Labs.

Ishaan Anand [00:40:35]:
Yeah. Go ahead.

AJ O'Neil [00:40:35]:
Is that GPU cost, or what where is that number coming from?

Ishaan Anand [00:40:40]:
That's a great question. Because these are all estimates, you know, we don't know for sure. But, obviously, some of it is the GPU cost. Some of it is, the infrastructure cost. Some of it's the talent. The other key thing to keep in mind is when you're training a model, you don't always know how it's gonna turn out. So what they actually do is they do a a large series of smaller runs to establish, you know, some type of pattern or scaling law to figure out how they're going to design the model, which architecture seem to work better, which parameters matter more. So there's something called the learning rate, for example, that they have to adjust and they have to have a schedule for it.

Ishaan Anand [00:41:17]:
And they're trying to figure out against evals, against the benchmarks we talked about, like, which one seems to make the model smarter. And so there's a lot of trials and attempts. So it's not just necessarily one whole shot of training. It's a lot of experiments, that they have to do. A lot of how the model is going to behave is surprisingly empirical, and so they're doing experiments and they're trying that again. So there's a variety of things, and empower is is nontrivial. Another thing that's important to understand is the level of scale of data that these frontier labs are dealing with. And there's a really good, analogy from the anthropic guys, actually.

Ishaan Anand [00:41:57]:
And one of the things you have to do is you have to randomize the data so it doesn't learn arbitrary patterns in the order of data you gave it. And so one of their research engineers gave this great example. He's like, okay. Randomizing sounds like it should be easy. Like, take a deck of cards. If I tell you to shuffle it, it's fairly easy. But imagine I gave you, like, seven warehouses full of decks of cards, and you need to shuffle them by hand. It's not quite clear how what policy or process you're going to use to make sure you hit all of them and you've evenly shuffled them.

Ishaan Anand [00:42:28]:
And the size of the data that these guys are using with is it's almost like that to the CPU. CPU. It's like seven warehouses of data to it, like, for you compared to manually, you know, shuffle your your deck. And so when you're dealing with data at this large infrastructure scale, scale itself makes every little thing harder. And so that also adds some some difficulty to this. So, should I, you know, walk through just, like, a little more detail of what's happening in that mathematical calculation or happy to answer additional questions?

Charles Max Wood [00:42:57]:
Yeah. That that's what I was gonna ask next is yeah. Can you because you've mentioned you've got different layers or different steps in the process. You explained in the video the the video is a little longer, I guess, than we have to go over at this point. But, yeah, if you can give people kind of an overview of of how the LLM system actually works.

Ishaan Anand [00:43:15]:
Yeah. So While you're while you're doing that,

AJ O'Neil [00:43:17]:
if you distinguish between the the different types of training, like the rag versus the fine tuning versus the base.

Ishaan Anand [00:43:28]:
Yeah. Okay. I don't think of rag as training, but maybe we should step back and explain to the audience who isn't familiar with rag what it is. So you can kind of think of RAG as like a sort of prompt engineering technique. So you want the model to answer questions about something it wasn't trained on. So imagine I'm, you know, I'm a smart home electronics company, and all of the documentation about my product was behind, you know, some firewall or behind a login. And so I know that let's call it ChatGPT was never trained on it. But I wanna build a chatbot where customers come and say, hey.

Ishaan Anand [00:44:06]:
I can't configure this setting on it. How am I gonna get a a chatbot to do that, without having to retrain it specifically on my data? So what you can do is when a request comes in, somebody's like, well, how do I change the color on my smart light bulb? It'll go and it will search through my data. I can take that request from my user on my chatbot, and they'd say, I take I see the words light bulb. I see change color, and I'll search all my documentation. And I'm not gonna search it just a plain text search. I'll use what's called a semantic search. So it'll find things that are similar to the word light, like the word bright, even though it's not anywhere close to the same character. So it'll find all the similar passages, and it will pull those out.

Ishaan Anand [00:44:49]:
And then it will give the model, here are relevant passages. Here's the user's question. How do I change the color on my smart light bulb? And then it'll give it paragraphs and chunks of text from my documentation, and it will put those at the beginning of the prompt. So you've got a prompt that's structured at the start with the user's question, then it's got some chunks of data that came right from my documentation. And then we tell the model, you know, come up with an answer to the user's question using these chunks of data I gave you. And it will be able to think over those passages and find the ones that are relevant, and then give the answer out. And that's called retrieval augmented generation. So retrieval because you're taking the user's query, you're pulling a data that wasn't the model didn't have during training, and you're passing it into the prompt and then asking the model to answer it.

Ishaan Anand [00:45:35]:
And so it's a very low friction way to take a a model off the shelf and make it understand all your stuff even though it wasn't in the training data. So that's that's, that's that's what RAG is.

Charles Max Wood [00:45:48]:
The short version is is you're building context out of a database that you already have.

Ishaan Anand [00:45:54]:
Great great summary. Thank you. So, yeah, you're giving it the the context it didn't have during training, to answer the question. On on training, so, there's a variety of steps in the model where it's trained. Mainly well, let me think of the best way to explain this. So, I'll I'll discuss training when I get to the the call the fourth step of the model, and I'll talk about how that gets trained in a second. But let me walk through the five steps of what happens when you input text into the model. So the first thing you do when you input a passage, so the one I like to use is Micah's quickie moves, and then the completion, we leave it to to model fill in the the blank quickly.

Ishaan Anand [00:46:35]:
The first thing it's gonna do is it's gonna break the text into subword units. So you might think it would break it into characters, and you might think it might break it into words. So breaking the characters would be like ASCII, and breaking into words would be just giving every word its entry in a dictionary. It turns out you can't handle if you break it into words, you can't handle unknown words very well, and you can't handle spellings you weren't planning on, so especially if you're going across multiple languages. And if you break into characters, it turns out it's really hard and a lot of compute for the model to learn purely from characters, although some research has been able to do it. So what they do is they can do a Goldilocks, and they say, okay. Let's break it into these little pieces of words. And if you think about it, as a human, you actually do this.

Ishaan Anand [00:47:20]:
So one of the examples I like to use is the word flavorize. It's actually not a word in the dictionary, but you know what it means because you know what flavor means, and you know what I z e as a suffix means. And so the model kind of does that. Now I wanna be clear. The the tokens it comes up with, as they're called, these subword pieces word pieces, don't map to any human sense of the meaning. There are some that, like, I z e turns out to be a token, but it's by coincidence or correlation, not like it's trying to understand human English at this stage. So it breaks it into these yeah. Go ahead.

Charles Max Wood [00:47:49]:
Can I can I just say that in a different way too? Yeah. Effectively, what it does is it breaks it up into pieces that have meaning. Right? Because when when we're looking for the output, we're looking for output that has meaning. We group words or ideas together that give it meaning. And so it's doing the same thing. It's breaking it up. Right? Like you said, flavor has a meaning. Eyes has a meaning.

Charles Max Wood [00:48:11]:
You know? The other words in there have meaning. And so that that's the approach that it kinda takes when it's breaking it into tokens. Sorta, kinda.

Ishaan Anand [00:48:20]:
Sorta, kind of. It's a decent mental model. But the reason I stress that it's not trying to match human meaning is because it's actually not trying at this stage of the model, it's not trying to assign meaning. In fact, what it's really trying to do is take all the text on the Internet that it's trying to train on and compress it to the most efficient representation so that training can be as efficient as possible. And that's why the tokens don't always map to what you'd expect. And why this is Yeah. Yeah. Go

AJ O'Neil [00:48:54]:
ahead. Is different than what a full text search database would do. Because a full text search database, like the example you gave, flavorize

Charles Max Wood [00:49:00]:
Yeah.

AJ O'Neil [00:49:00]:
Full text search database is gonna break it up that same way. But this is different than the way a full text search database would break it up?

Ishaan Anand [00:49:07]:
Yes. It is different. And it's very dependent on the data it was trained on. And so a good example is I use, the word reinjury. Right? As a human, you'd think it was it was re and injury. But if you actually put it through the GPT two tokenizer, it call it puts it as rein injury.

Steve Edwards [00:49:27]:
I saw it.

Ishaan Anand [00:49:28]:
And the reason it decided to do that is simply because of the greater, you know, occurrence of the word jury on itself by itself than than injury. And so it decided that was the more efficient representation. And I wanna be clear. This isn't about representing your prompt efficiently. This is about representing all the training it's going to do on the the text efficiently. The stuff you don't see, the stuff that, you know, you talked about nobody releases. That's what it's really based on. And it's it's really a compression of all the text, so it just got a really efficient so think think about it this way.

Ishaan Anand [00:50:00]:
If it's gonna compress all the text, then, you know, if it gets down to, say, 10,000 or 50,000 tokens, then it only has to learn 10,000 or 50,000 concepts in a sense. Although, it's it's a gross oversimplification, but that's what it's trying to do. It's trying to reduce the number of things it needs to learn on. Essentially, the number of combinations and variations.

Steve Edwards [00:50:19]:
Hey. Couple quick questions. Go ahead. With that flavorized example, here's hoping they don't pick up flavor flavor, the rapper's lyrics. Right? And throw that in there. That could get really confusing. But then when you're talking about, like, the reinjury, for example

Ishaan Anand [00:50:33]:
Yeah.

Steve Edwards [00:50:33]:
And how I as soon as you saw that said that I I'm thinking, okay. I can see where that's going, how you get rain. Just throwing stuff like and this might be getting into the weeds too much. But just throwing stuff like hyphens into words make a difference? So if you were to do re dash injury

AJ O'Neil [00:50:50]:
Yeah.

Steve Edwards [00:50:50]:
Would it see that and maybe just categorize the re as separate from injury? Would does that help, or is that a nonissue? Does it sort of filter that stuff out and just focus on the letters?

Ishaan Anand [00:50:59]:
That's a great question. It's actually implementation dependent on the tokenizer. In practice, you usually separate you you create boundaries, hard boundaries between tokens or words. So one of them is the space character. In most of these, the hyphen is considered also a boundary, and so it would see it separately. The important thing to understand, though, about the tokenizer is that the model doesn't see words the same way you do. So a great example of this is, how many letters are in the word strawberry? It does not see the word as s t r a w b e r r y. In fact, when you read it, you really don't either.

Ishaan Anand [00:51:41]:
When you read words, you typically aren't paying attention to every single character. When you have to count the words in strawberry, you kind of have to change your mental state and think, oh, wait. Let me think what are the characters and you walk through it. It just sees it as it might see it as the token strawberry. It might see it as the token straw and berry. But the key thing is is it has no idea it doesn't have the ability to see the letters. And in fact, if you capitalize, the tokenization is case sensitive. So if you change the capitalization, it looks like a different word.

Ishaan Anand [00:52:07]:
So to it, the the word strawberry with, like, a space in front of it is different than the strawberry without a space. Strawberry with a capital in front of it is different than strawberry with a with a capital. You know, if you put quotes around it, it's a different word. So it it doesn't see text the same way you do. Another great example of this is numbers. So they've fixed this in most modern tokenizers, but the early ones would just take examples of numbers, and those would be a whole take token. So two fifty six. Right? Power of two, fairly common, gets a token, but it sees it as a single token.

Ishaan Anand [00:52:39]:
It sees it as a single thing. It doesn't need to see it as the numbers two, five, and six. It doesn't break it apart. And so that's why it was really hard for these guys or these guys, these things to it was part of the contributing reason why it was hard to do math. It's not the sole reason. So the key lesson, you know, on the tokenizer before we leave it in the algorithm that's commonly used is something called byte pair encoding. And in my class, this is something we walk through. In fact, we do it by hand so you can understand, and we talk through the training process.

Ishaan Anand [00:53:03]:
But the the key thing to understand is that the model doesn't always see text the same way you do. So that's the first step. That's tokenizer. Then the next thing is we we map each of these tokens, but you can think of them as words, into a list of numbers. So I talked earlier, like, we map each word to a number, but it's really we map it to a large list of numbers, and this is called an embedding. And the way to think about this is where we're taking all the words, or in this case, tokens, technically, and we're putting them on a map. But instead of, like, a a two dimensional map, this is many, many dimensions. So in the case of GPT two, it's 768.

Ishaan Anand [00:53:41]:
I think llama four zero five b, it's, like, 16,000 list of numbers per every single word. In fact, like, in the sentence phrase, Mike is quick period, he moves, the period itself gets 768 numbers to represent it. And you can think about, like, on a map, you have, like, you know, coordinates. This is just a very, very multidimensional list of coordinates. And a good embedding puts words that are related to each other closer to each other. So in my class, I use the words like, happy, sad, joyful, glad, dog, cat, rabbit. The first set of those are emotions, happy, sad, joyful. Right? And we'd expect happy and joyful to be close to each other, same with glad.

Ishaan Anand [00:54:23]:
And then dog, cat, and rabbit are totally unrelated, so we'd expect them to be further apart on the map. And the word sad is an emotion, but it's not quite the same emotion as being happy, so it'd be somewhere in between. And if you actually visualize this, you you see that this happens. It's actually putting words closer together. And the you might hear this this paper or it's a series of papers or algorithms called word2vec, which pioneered this model. And if you go to projector.tensorflow.org, you can actually see a three d map of various words, and you click on it, and it will show you the words that are close to it. And they all tend to be related words. So the first thing we start with is, you know, the next step after we break the text into tokens, is we map each of those tokens to a position on a map where close words are related to each other.

Ishaan Anand [00:55:10]:
Let me pause and see if that made any sense. I'm usually doing this all visually. So Right. Over pure audio, it's it's a bit of a challenge. But, yeah. Go ahead.

Charles Max Wood [00:55:19]:
So my question is is since it's predicting the next word

Ishaan Anand [00:55:23]:
Mhmm.

Charles Max Wood [00:55:23]:
I would imagine that, yeah, some some of the words that appear close to it are going to be words that mean kind of the same thing or, you know, have a related meaning. But does it also group words together that commonly appear together, or is that a different does it not weight things that way at all?

Ishaan Anand [00:55:41]:
It's actually kind of doing both. The way it's grouping related words together, it doesn't actually group words together. It the there's it's grouping words together that have the same meaning based on the idea that they appear in the same places. Okay. So the word ice and, cold commonly occur together probably in text on the Internet. Mhmm. Right? Like, the I put ice in the drink to make it cold. You're as cold as ice.

Ishaan Anand [00:56:09]:
Right? Those would be common phrases. You usually don't see, you know, like, steam and cold together. And so the model is able to understand that ice is colder than steam because it sees cold closer to ice more often than it seems cold close to steam. Okay. It's the relative occurrence of how often. And there's a a phrase that's often used j r by j r Firth. It's called, you will know a word by the company it keeps, which is the idea you don't really know what a word means. You could look it up in the dictionary, but you really understand it through how it's used by multiple people.

Ishaan Anand [00:56:45]:
Mhmm.

AJ O'Neil [00:56:45]:
And

Ishaan Anand [00:56:45]:
that you can look at, you know, the distribution of how it's used to really understand what it means. So good example is the word bad. Right? Although it's less in fashion, bad at one time meant good. Right? And so, how do you really understand whether it means good in one context versus another? You learn that through all the various contexts, that it is used. And you if you wanna understand, you know, how word really is used, you see it in usage many, many, many times. So if you wanna model to understand how what a word means, you just see it used in many, many, many, many sentences, and, eventually, pick up on those differences.

Steve Edwards [00:57:17]:
So then the word baby could be seen as cold because of the vanilla ice then. Right? Ice ice baby?

Ishaan Anand [00:57:24]:
If you if you train it, I I'd be really interested to see a model trained only on song lyrics. That would be that would

Charles Max Wood [00:57:29]:
be awesome. Fascinating. Yeah. Yeah. It it's interesting because you the way you're talking about it, we were driving home from my mom's house the other night, and, my wife put on an audiobook that she's been listening to with my nine year old, and they use the word satisfaction. And my daughter asked what does satisfaction mean, and we basically did that. It's kinda like this, and it's kinda like that. Right? It's it's in this area of meaning.

Charles Max Wood [00:57:53]:
Right? Yeah. Yep. And and it's related to these other words. Right? We used other words to explain it. And then, yeah, we did. We gave her context. So you could use it like this or you could use it like this and it's, you know, another form of the word is satisfy or satisfied. And so, you know, this is what it means to satisfy something and, you know, more context and more sentences.

Charles Max Wood [00:58:12]:
And, okay, I understand. Right? And she I think she may have even said so it's kinda like this and kinda like that Yep. Using examples that we didn't use.

Steve Edwards [00:58:21]:
And you told her that Nick couldn't get any. Right?

Charles Max Wood [00:58:24]:
That's right.

Ishaan Anand [00:58:26]:
But, yeah, that's that's basically what the model is going through is it's like, you know, basically say seeing all these examples, and it's like, oh, it's kinda like this, but slightly like, in some context, I I see it being used in this other way.

Charles Max Wood [00:58:38]:
Yep.

Ishaan Anand [00:58:39]:
And so it's it's basically putting that all together, and then it's putting all the words on this map. And it's saying, okay. You know, the ones that are related are here and the ones and it's this multidimensional map. It's, you know, hundreds to thousands of dimensions long. And that's that's the embedding step is we basically mapped them to you know, I I say a number, but it's really a point in space. Right? Mhmm. It was basically taken so the do you see where we're at the second step? We base first step was we took the passage and we broke it into tokens, which you could think of as like words, but smaller. And then we took each of those tokens, and we put them into a point on a map and a point in space, and we just we know that point is gonna be close to other things that that are related to it.

Ishaan Anand [00:59:23]:
So that's the the second step. And then the the third step is is called attention. I'm gonna skip over that in a second, and I'm gonna go to the fourth step, which is the the neural network or the multilayer perceptron. And this gets to the training question. The key thing that it's really great about neural networks is that thing I talked about earlier, which is you don't have to give them the rules. You just give them the answers, and they figured out the rules. So we basically feed it in these points in space from our our prompt. And then we can take a passage on the Internet, like, maybe the passage on the Internet is Mike is quick.

Ishaan Anand [01:00:02]:
He moves quickly. We remove the word quickly, and then we give the model the phrase Mike is quick. He moves. And then we ask it to make a prediction. And it's gonna get it wrong because it hasn't done any training at all. And when it gets it wrong, maybe it says, you know, Mike is quick. He moves bicycle. And you're like, no.

Ishaan Anand [01:00:21]:
That's wrong. The right answer is quickly. It mathematically learns how to change itself to get closer to that answer. So we go through a lot of iterations of doing lots of passages where we take off the last word, and then we we give it we ask it to predict it. And if it's good, we say, okay. Great. You're you're fine. And if it's wrong, we say, okay.

Ishaan Anand [01:00:39]:
You're off by this amount. It's kinda like when you throw darts at a board. Right? If you're far from the the bull's eye, you'll move slight a lot more closer to correct your position. But if you're close but slightly off, you're gonna move slightly subtly. So that's what it does. It changes the model parameters, the numbers inside the model. Slightly if it's close or a lot if it's far away. It does this, you know, trillions of times over lots of different, pieces of data.

Ishaan Anand [01:01:05]:
And the key thing about the the neural network is it can learn to imitate, basically, from answers and and data. And so we basically give it the known passage that we gave, like Mike is squeaky moves, and we knew quickly was the right answer. And then we basically asked the neural network to make a prediction from these points in space. And so that's the that's the the the basic version of what's happening inside the training. Let me pause because I jumped to the fourth layer. I'll come back to the the third one in a second. But does that let me see if there are any questions there on what's happening inside the neural network. Okay.

Charles Max Wood [01:01:43]:
No.

Ishaan Anand [01:01:44]:
So Yeah. Okay. So if we're good there, then, the next thing that happens, is I'll jump back to the third. Like, we could give it a point in space and say, hey. Guess what the next word is. But the best thing to do is to not just give it a single point in space. It's to give it all the points that came before. So all the words that came before.

Ishaan Anand [01:02:01]:
So in the case Mike is quick, he moves, knowing that, you know, we're talking about movement helps it know that the word quick here is moving around physical space versus the quick of your finger now. And so we give it the hints of all the words that came before it. So this is what's called attention, where we say, okay. Don't just predict from one single word you're looking at. This gets back to what we talked about at the beginning. Like, instead of looking at, you know, statistically, what's the next word after the given word, let me look two words back. Let me look three words back. Let me look four words back.

Ishaan Anand [01:02:31]:
It will look at all the words that came before it and try to figure out what is the next predicted. We're giving these hints from the entire passage to make its prediction, and that's what's called attention. And so that's the the third step in the middle. And then the last step is we do this you know, we get a prediction out of the neural network. So jumping back to the fourth step, which is the the neural network that makes the prediction, and it gives a number. And that number we need is a long list of numbers, and we need to map that back to one of our tokens, one of our words. But it's sitting in a point in space that it may land right on the word quickly, but more than likely, it's gonna land somewhere close to the word quickly, like the predicted token that comes up with it that the model comes up with. And so it will interpret that point in space that it gave us back to those embeddings in that map.

Ishaan Anand [01:03:21]:
It picked us some at the end of the number crunching, took all the words and the points in space we gave it and said, the predicted word is right here in this point in space. And it says, okay. We then interpret that, and we look what are the words or tokens that are close to that predicted point in space? And it's probably gonna be closer to the word fast, closer to the word around. Like, Mike moves quickly. He moves around. He moves fast. He moves, speedily. Those words are gonna be close to it, and so we give them a higher probability.

Ishaan Anand [01:03:51]:
And we run a random number generator, and we say, okay. Let me pick one according to this probability distribution, and that's how we get the predicted word out. And that last step of running that random number generator and looking at what words are closer is the piece that's called the language head. So the the key thing about language head is that is where most of the uncertainty or unpredictability of your model comes from. So if we decide not to run the random number generator, we just always pick the word that is closest in space to the prediction, that's what's called temperature zero, and it will always be consistent. It will always be predictable for the most part. There's some other very small orders of randomness in in the process. But for the most part, it'll be very consistent, and that's that's called temperature zero.

Ishaan Anand [01:04:36]:
So most of the randomness inside the model is entirely, in some sense, imposed by us. We decided, oh, we're not just going to always take the thing that's closest. We're gonna probabilistically take some of the other ones that are also close. And we can control those parameters and control how we do that probability. So if you're in a llama or, you know, an API, you'll see things like top p and top k or temperature. And these are tools we are given, you know, the API user of a model on how they can shape the probability distribution of the model, and that's probably the most important to understand of of the components in the model. After you understand what tokens are and embeddings, the next one is probably the the language head because that's where the the randomness comes from. So let me pause.

Ishaan Anand [01:05:23]:
I know I just talked for quite a while. See if there are any questions.

Charles Max Wood [01:05:30]:
I think so far so good.

Ishaan Anand [01:05:32]:
Okay. So, what the Excel spreadsheet does or the website I have that's built in, you know, web components and pure JavaScript, is it it runs the entire process using the very same weights that OpenAI released for a model called GPT two, GPT two small, and it steps through every single one of those processes. And it takes you you enter a prompt, and then what it does is it doesn't it's not like chat g p t where you can have a conversation with it. It just predicts the next word, but it walks you through every single step. That's the same thing I do inside the class. But that's basically, you know, how your model works under the hood is it's basically taking your words, your your input prompt, breaking it into units that are called tokens that are slightly smaller than a word, then it maps it to points in space, does a bunch of number crunching on it, through the things I talked about using a neural network and this other attention that looks at all the other words. And then it takes the that prediction, and it says, okay. What words is it close to in our points in space? And then let me pick one that's relatively close to that.

Ishaan Anand [01:06:35]:
So I know one of the things, Chuck, you'd wanna talk about is, building it and the use of web components in the web version. Yeah.

Charles Max Wood [01:06:43]:
At this point Yeah. Given our time constraints

Ishaan Anand [01:06:46]:
Yeah.

Charles Max Wood [01:06:46]:
We might have you come back and do that because I think it'd be interesting to dive into the project and how it went together.

Ishaan Anand [01:06:51]:
Okay. I will I will just say the reason I built it, in web components was to make it as portable and easy to use, and as easy to step through. I wanted to make it as accessible. I did think about, like, say, using React, but then you need to know React. And I really wanted this to be as approachable for somebody who knows just vanilla JavaScript, and web components was the easiest way to do that. So that's the main reason why I did it that way. Yeah.

Charles Max Wood [01:07:19]:
So is it open source then? Or

Ishaan Anand [01:07:22]:
I actually even put a license on it. And what I've said is if people feel like I help me decide. Like, tell me which licenses you prefer. But the code is right there for people to look at. I mean, you can practically step through it, and it's written so that people can understand it. I just haven't figured out what license, but, you know, let me know, and I'm all ears.

Steve Edwards [01:07:43]:
Alright.

Ishaan Anand [01:07:43]:
The goal is to make this a teaching tool.

Charles Max Wood [01:07:45]:
Yep. Alright. Cool. Well, yeah, like I said, we're kind of at the, end of our time.

Ishaan Anand [01:07:52]:
Mhmm.

Charles Max Wood [01:07:53]:
And so I'm I'm gonna push this into picks, but, you wanna just give out information on your course again, let people know what that, coupon code was. I just if people are digging this as much as I am, I I think they may wanna just go pick up the course and go, oh, okay. We can go into more depth.

Ishaan Anand [01:08:10]:
Yeah. Thank you. So the website for the project is called spreadsheets are all you need with hyphens in between. So spreadsheets hyphen r hyphen all hyphen u-need.ai. It is a very long domain name. And that will link to where you can download the Excel file. You can try this out in the browser yourself. And then there's a link to my class that I teach on the Maven platform.

Ishaan Anand [01:08:35]:
It basically has, five lectures over two weeks, and we walk through this for anybody who understands spreadsheets or vanilla JavaScript. And I have a promo code, j s Jabber. So just use the promo code during checkout, and you get 20% off. The the course is, taught live, but also is available on demand. So my last cohort just wrapped up earlier this month. But if you sign up, you'll get to watch all the recordings. I'll answer all the questions over email you have. You'll be in the same private Discord as the same as the rest of the cohort.

Ishaan Anand [01:09:06]:
And, if for some reason you you're watching on demand and you'd say, I'd rather have the live version, I offer that if you wanna attend the future live version, you can do that for free if even if you signed up for the on demand. So, you know, feel free to check it out. It's on Maven. It's got some long URL, but if you go to spreadsheets are all you need .ai, you can check it out. And then to find me, I'm on, Twitter, I a n a n d, so my first initial with my last name, and, of course, on LinkedIn. I'm also on Blue Sky as well, if people wanna reach me. Happy to answer questions.

Charles Max Wood [01:09:39]:
Awesome. Well, yeah, I I definitely wanna dig into it. I'm probably gonna go watch your video on YouTube a few more times, just, you know, getting all those little pieces in my head so that Yeah. I think you said at the beginning that the model that matters matters most is your mental model. Yes. And and so, yeah, just knowing how to think about, okay, I'm dropping this in. Right? This is how it goes through the Plinko machine to give me the output on the other end. That's that's the thing that really helps me out.

Ishaan Anand [01:10:07]:
So Yeah. That's a great analogy. And one thing that may be worth highlighting is I've had some feedback that people saw, oh, you're using spreadsheets or using JavaScript. The real models are in Python, and then you're using GPT two, which is an older model. What I teach in the class are essentially the timeless technical fundamentals of how these models work. And it's worth remembering that all the major models you've heard of, you know, Claude, ChatGPT, Gemini, they all are inheriting from GPT two. So if you understand GPT two, you understand 80% you're 80% of the way to understanding any of the model or or LAMA model architectures. So it's it's not like toy.

Ishaan Anand [01:10:47]:
It is essentially as very close to how the real models work. It's a really good stepping stone to to getting that really sharp mental model of what's happening under the hood.

Charles Max Wood [01:10:57]:
Yeah. That's true of most technologies. Right? I mean, if you you were using I don't know. Let's pick one. MySQL ten years ago, probably 70% of the stuff is fundamentally the same. The engine works mostly the same. Yeah. They probably optimized some pieces.

Charles Max Wood [01:11:14]:
They probably add some features. But for the most part, if you understand understood what was it was doing back then, you get it now. And to be honest, the other thing is is you'll also see as we get more variations on things, you also have a decent understanding of how to use something like SQLite or PostgreSQL as well. So

Ishaan Anand [01:11:33]:
Yeah. I I really like that analogy. It's a good one. I might I might borrow that. Thank you.

Charles Max Wood [01:11:38]:
Yeah. No problem. Alright. Well, let's go ahead and do our picks. AJ, do you wanna start us with picks?

AJ O'Neil [01:11:44]:
Sure. Okay. So civilization, I've I've still been playing that, not as much as I was the other week, but, enjoying it. It turns out you can run it on the Mac if you turn the you have to go into settings and turn its performance mode completely off to all the way down. I think it's something to do with multithreading was why it crashes. Like, if it uses more than one core, it just crashes every five minutes or something. Anyway, so there's that. Civilization still went strong, but wanted to correct.

AJ O'Neil [01:12:16]:
You can get it running on the Mac. It just won't run on the Mac with the default settings, and it's not abundantly clear why. But, anyway, other thing was with the announcement of the Switch two, I just got angry because I still can't play Tears of the Kingdom or Super Mario RPG or Spyro or, you know, any game that's basically released in the last three years without massive stuttering and, you know, with Tears of the Kingdom, you know, how they have it go into bullet time whenever whenever it gets overloaded instead of getting choppy. Although it still does that, it does dynamic resolution and bullet time. So your swings will slow down and stuff, which, you know, whatever. So I decided to mod my switches. And I did the hardware mod of the switch, and it was super easy. Now I've done mods in the past.

AJ O'Neil [01:13:16]:
So saying that it was super easy no. If you're not familiar with soldering, if you haven't, you know, fixed a phone or or, you know, done something with that before, no. It's not it's not super easy. Getting all the screws out, getting to the actual part that you're getting the heat sink off, that's super easy. Anybody that has a precision precision toolkit for, like, phone repair, game repair, or whatever can can get air get in there and do that. I actually couldn't see the soldering that I was doing because the components are so small. Now I've learned some tricks because I've practiced a little bit of micro soldering in the past and failed at repairing a three d s. But the the pieces are so small that I literally can't see the I mean, I can see them, but I can't see them.

AJ O'Neil [01:14:02]:
I I mean, like, I can see them as, like, I can see a grain of of of sand, but I cannot see them in terms of, like, actually accurately predict. So what I did was I used my phone to zoom in, take a picture of it, see that I had bridged two capacitors, and then just kind of blindly, you know, move the soldering iron next to them and kinda sweeped it the same way that I would when I'm, you know, soldering on a bigger component, then use the phone again, zoomed in. And so I was able I was able to get the the piece on there. And I I should have had, some sort of magnifying glass set up. But whatever. So I was I was actually able to do it blind in a way. Like, I could I could see my tip was there, but I couldn't actually see what, you know, what was because, I mean, the things, they're they're smaller than a grain of salt. They're they're small.

AJ O'Neil [01:14:54]:
Anyway, the the capacitors, they're they're, like, wicked small. But even with that, I was able to do it, and that's my my so so modding the switches is one is one pick there. There's the Picofly, mod kit. You can do it if you've done if you've done other soldering. If you haven't done micro soldering, buy a couple of practice kits from eBay or or, AliExpress or or something, and you you can get there. But my third pick is the soldering station that I used is actually a custom made soldering station. So a few years ago, these Chinese companies came out with the t 12 tips or they cloned the t 12 tips. They turn out to be really, really good.

AJ O'Neil [01:15:37]:
And and one reason is that they double over as a temperature sensor because they have two different types of metals. And anytime you have two different types of metals, you have a thermocouple. So they have an inner metal and an outer metal. So they double over as their own temperature sensor. And so people created software for these and put them on microcontrollers. And the software, I'm being literal when I say this, it rivals $3,000 professional workstations because the way that it switches back and forth between monitoring the temperature of the tip. And so, I'll I'll put a link and you can get these you can get clones on AliExpress, but I prefer the original one that's made by this guy in Australia because I know it comes with the right firmware on it. And the firmware is really where the magic happens.

AJ O'Neil [01:16:25]:
Any idiot can can, you know, three d print some leads onto, a rigid battery or a Milwaukee battery and connect it to a a a t 12 tip. But the real smarts of it is in the firmware where it manages the heat to make sure it gets up to temperature quickly that, it actually does the sensing thing where if you shake it, it turns back on and heats up. Anytime anytime the temperature is cooling down rapidly, it sends more power. So, anyway, it's just a really great soldering iron. And because I had that like, I've got a weller, and I and I've got a a couple of cheap ones. But that one, it it cost a hundred bucks, and I'm considering trying out one of the knockoffs that's only, like, 35. Because now everybody even even Craftsman is selling one of these at Lowe's now, but I don't know if the Craftsman one is just like the cheap idiot kind where it's just connecting the leads together or if it's actually got a I have a hard time believing that they would have gotten an an illegal copy of the firmware whereas the Chinese companies all expression of it. Anyway, saw that.

AJ O'Neil [01:17:31]:
Yeah. So I had a good time soldering because of the, the rigid powered, which you get in Mount Milwaukee or Ryobi or whatever battery brand you like, soldering. And they're super fast. They're so much better than a Weller or a HEICO or, all the traditional ones that cost hundreds of dollars. So, anyway, and, of course, I'll pick a llama because I really do enjoy running my own, local LLMs, especially since the 32,000,000,000 parameter model of Quinn two point five coder has come out. That one, I just find to be the best of the best. It rivals GPT four o, if not, is better than four o, and you can run it on, an Apple Silicon Mac. Boom.

AJ O'Neil [01:18:18]:
All the things.

Charles Max Wood [01:18:19]:
Okay. I have a question. What what are you modding your switch to do?

AJ O'Neil [01:18:23]:
Oh, sorry. I skipped over that part. Over well, not overclock. To to, the native CPU freak, speed of the Tegra x one, because the switch is an Android tablet, running a custom operating system rather than Android, is 2.2 gigahertz. That's the native clock speed. The clock speed that it runs at is something like, a thousand or 700 depending on whether it's docked or handheld. Same thing with the GPU. The native GPU speed is, like, fifth 1.5 gigahertz or something like that, but they clock it down to 500 or 700.

Charles Max Wood [01:19:06]:
Oh, I gotcha.

AJ O'Neil [01:19:07]:
So when you mod it, you can then and this you can do without getting banned, or at least this is what people are reporting, and I'm I'm doing this. So, you know, if you mod it and you wanna run pirated games or something, you have to set up more stuff and make sure that you don't get banned, although a lot of that stuff is built in now. But if all you wanna do is overclock it, the overclock system runs in a layer that's kind of protected from the main switch system. So the main switch system can't detect that it's rooted while it's running. So you can you if you just install, hecate what is it? Hecate, atmosphere, and sys clock, if those are the only things you install, then you should be able to run your switch modded on the original firmware, be able to play online, etcetera, without any risk of, banning or anything because it's the where it it's not modifying the switch operating system or the game. It's just modifying the CPU clock.

Charles Max Wood [01:20:12]:
Gotcha. Cool.

AJ O'Neil [01:20:14]:
So now so now my friend asked me, well, can you notice any difference? And my answer is no. And the reason my answer is no is because when you're not play when you're playing it underclocked, you notice the stuttering all the time and, like, you notice the resolution changing. Like, you know, you turn and there's a bad guy and you shoot him and then the resolution, like but when you're playing it closer to native speeds, you can't get it all the way up to native speeds because the power delivery on the board isn't actually capable of playing it at native speeds without also draining the battery at the same time. But when you're playing it at near native speeds, you don't notice it because you like, the things that are annoying aren't there. The resolution's not changing. It's not stuttering. It's not going into bullet time as much. I haven't I did not notice it at all going into bullet time since I've been playing it at near native speeds.

AJ O'Neil [01:21:05]:
And I did some things where I was blowing up rocks and things that I thought would normally make it go to bullet time, and it didn't. So, like, five star story so far.

Charles Max Wood [01:21:16]:
Cool. Very cool. Alright. Steve, what are your picks?

Steve Edwards [01:21:21]:
Alright. Time for my twenty minute picks. So before I get in picks, one note I will make, it sort of circles back to what I asked at the beginning. You know, as someone who has spent a lot of time doing search indexing, you know, with leucine type search indexes, a lot of this sounds really familiar. And to me, I've always said that the I in AI is a misnomer. I think it's it's not necessarily intelligent. It's just basically better using of training and fancier using of existing data to answer things, not necessarily intelligence that can create new things. That's just my 2¢ for what it's worth.

Steve Edwards [01:22:05]:
Interesting pick. Ishan, you mentioned this earlier today, and as of today, that, you know, this will come out a little later. But deep seek is, like, disrupting in a huge way. You know, for instance, if you go look on Hacker News, both on the top page and on the new page, there are multiple articles from NPR, from c s, CNBC about what it's doing to the stock market. And the gist is basically that they've been able to create these fantastic models with much less investment, with less powerful chips. And there's a whole story behind this. And so that's wreaking havoc at least in the stock market and with people like Nvidia just because of supposedly how how much cheaper and more efficient, DeepSeek is compared to some of the other models. So today is obviously the first day, and and, you know, it remains to be seen how accurate this is, especially coming from the Chinese.

Steve Edwards [01:23:07]:
But, sort of a disruptive thing going on this morning, at least as of the time of recording.

Ishaan Anand [01:23:13]:
Yeah. Do you mind if I jump in there a little bit? DeepSeek is an utterly fascinating, story and model. I'll say, one thing is that the training cost might have been apples to oranges. Like, they they stated what the cost was for the best run or the final run. There's a lot of other costs that go into I talked earlier, you know, when AJ was asking what goes into it, one of the key things I was thinking actually about DeepSig is, like, they they're going to do a lot of other experiments and runs. There's a lot of stuff that gets built upon. So I think some people are, comparing apples to oranges, but it is, you know, a impressive model in a lot of ways. The other thing that I find the most fascinating about it is the training process is remarkably simple.

Ishaan Anand [01:24:07]:
And, you know, I'm trying to think of an analogy. It's like, normally, when they do this part of the training process called reinforcement learning, it's it's a it's a lot more complex. And it's kind of like, you know, you think about a car and you're like, well, if you wanna go from point a to point b, you need an internal cusp combustion engine. It's basically doing you know, having a little fire, and you got these pistons and these cylinders. It's really complex piece of mechanics. And then somebody's like the electric car and say, you know that little toy motor you had? Well, let's just scale that thing up. And so they they tried this really simple relatively simple technique and just scaled it up, and it worked. And I think different people are are reacting to this model differently.

Ishaan Anand [01:24:48]:
Some people, it's about the price. Other people, it's about the training setup, and it's how did we miss this. It's just remarkably simple. So it's definitely, worthy to bring up as it's just a really interesting model.

Steve Edwards [01:25:02]:
Occam's razor strikes again. Right?

Ishaan Anand [01:25:05]:
Well, there's something they call in AI the better lesson, which is stop trying to put into the model how you think you think. Instead, just give really general comp compute and just throw more and more data and more and more, compute at the problem, and the model will figure it out. So don't try to be too smart about it. And people are like, this was a better lesson all over again. It's like, oh, we thought, you know, we had to do this really complex, you know, reinforcement learning setup, and these guys showed, well, maybe you don't. Yeah. In the end, their production model actually still had somewhat complex training pipeline, but one of the interesting results is this model called zero, where it kind of like, you know, alpha zero learned how to become a really good Go player just by playing against itself. In this case, it wasn't the model playing against itself.

Ishaan Anand [01:25:53]:
It was just trying out ideas. And they just told it whether it was right or wrong, and it started automatically, emergently figuring out how to improve its thinking. And it starts getting these eureka moments where it suddenly realizes it can backtrack. And it's like, oh, wait. I made a mistake. And it's you're watching the model. Like, we didn't train it to do this, and it suddenly figures out how to, like, get smarter. So it's really, really fascinating.

Ishaan Anand [01:26:19]:
I could we could talk another hour on on the model. But, yeah, lots of people are pouring over it. It's it's fascinating in many dimensions.

Steve Edwards [01:26:28]:
Cool. And then last but not least, certainly the high point of any episode is the dad jokes of the week. So what did one pie say to the other pie before being put in the oven? You know? This is a musical answer. All we are is crust in a tin. For anybody that knows Kansas. Here's an Australian version. My mate was bitten by a snake. So I told him an amusing story.

Steve Edwards [01:26:58]:
If I know the difference between anecdote and antidote, he'd still be alive. And finally, when I was in school, my teachers told me I would never amount to anything because I procrastinated too much. I told them, you just wait. Those are the dad jokes of the week.

Charles Max Wood [01:27:19]:
Alright. Well, I'm gonna jump in here and, save us from the high point of the episode. I've got a couple of picks. I always do a board game pick. So the for the game I'm gonna pick, I learned this game last week, is called Cascadero. I'm gonna put links in for both, board game geek, which which kinda gives you information about the board game, and then an Amazon affiliate link, just because then you know where to go buy it if you want it. Anyway, so Cascadero, the the premise of the game is that the kingdom's breaking up, and so you all play a different, faction, I guess. And, you're trying to connect towns and send your people through the towns to, pull the kingdom back together.

Charles Max Wood [01:28:12]:
And so you put your little guys out there and then you score points based on whether you're the first person to the town or the second person to the town. If you have a group, you have to have a group of, your little horsemen. I can't remember what they call them. Heralds? No. The heralds were the other things. Anyway, so if you connect to a town with a herald, then you get an extra point for connecting to it. And then there are bonuses that you get. So if you if because when you connect when you get the points, you actually move a marker up the technology or progress track in whatever color you connected to.

Charles Max Wood [01:28:51]:
And, so I guess they're not points. They're just movements. But, anyway, so once you move past certain points, you get certain rewards. And, I mean, effectively, what you're trying to do is you're trying to, score the most points, and you also want to get to the end of the track in whatever color you're playing. So if you're playing pink, you want your pink marker to get all the way to the end. And, yeah, like I said, you get bonus points for getting all five of your markers past the first space that's marked for that. You get more bonus points if you get three past the second, spacing. And then if you're the first one to get one all the way to the end, then you get bonus points.

Charles Max Wood [01:29:35]:
And only one person can get those. And then the other ones are if you connect two cities of the same color and there are five colors, you get bonus points for each color. And if you get all five colors, then you get 10 bonus points. And so you're just moving your marker around a track. When you get the points, as soon as somebody gets 50 points, the game ends. And so, essentially, if you want to win, you want to be the first person to get your marker all the way to the end of the track of your color and then be the person that gets that fiftieth point. We played it, and it was our first time any of us playing it. And so nobody crossed that fiftieth point before we all ran out of little horsemen.

Charles Max Wood [01:30:20]:
And so when somebody runs out, everybody gets one more turn. Or if somebody crosses that 50 marker, everybody else gets one more turn, and then the game's over. It's reasonably simple. The scoring's a little bit, complicated as far as, like, moving you know, when you get moves and how many moves and things like that. So, BoardGameGeek weights it at 2.53. Right? So it's a little more complicated than kind of your, casual gamer who's just gonna place gravel with their friends. But my feeling is is that it was only just getting used to when I put my horsemen down, what happens. And as soon as you get used to, I put my horsemen down, I can move something up the track so many spaces and then how to get the rewards.

Charles Max Wood [01:31:05]:
Once you figure that out, it's a relatively simple game. We played it in, what, an hour maybe? A little longer. I think once if we'd known what we were doing, we could probably played it in forty five minutes. There were three of us playing it. So, anyway, Cascadero, fun, fun game. I I liked it. I wanna play it again now that we I know how to play it and my friends know how to play it, because there were a couple of things I would've done differently as as as I got into it. As far as other picks go, go to jsgeniuses.com and sign up.

Charles Max Wood [01:31:43]:
We're gonna start doing the meetups, and I'm gonna start posting videos. The videos I'm posting, the I I'm kind of building an entire app. I don't know if I'm gonna show you writing all the code because some of the stuff gets repetitive. Oh, I have to connect another data model to this database. Right? It's like, okay. You don't need to see that 18 times. But, you know, we'll we'll get kind of the major pieces in and then anything bonus extra that I do. The the app I'm gonna build, I decided I need to learn Next.

Charles Max Wood [01:32:15]:
Js. So it's gonna be a Next. Js app, and I'm gonna be putting it on CloudFlare workers. And the reason is is because, just to give you an idea of what the app is, it's it's relatively simple. But, last year, when we ran caucus night for the Utah Republican Party, we had an online registration, and we got DDoSed because there were people out there who didn't like us. And, it it's internal politics to Utah. It wasn't the Democrats. It was somebody else.

Charles Max Wood [01:32:50]:
But, anyway, so, because of that, I'm looking to, you know, put it on a system where I know it'll just kinda expand to whatever comes at it. CloudFlare is also usually pretty good at you hit me 18 times now. I'm just gonna say, drop it drop you unless you can prove you're human. And so I feel like I can get some of those benefits, but I'm also curious to see how CloudFlare workers work. So it's gonna be basically a registration. There's gonna be a little bit of site automation because the Utah State voter registration database where you verify that your voter registration doesn't have an API, which means that I have to go and have my program use something like Puppeteer to fill in full fields and then scrape data off the response to make sure that you're registered to go to caucus night. I'm thinking I may also offer the same kind of thing to the Democrats, and anybody else who wants to run a caucus night that night. I think the lib libertarians in Utah do it too.

Charles Max Wood [01:33:52]:
Right? So that they can just hey. You've got, online registration, and then you've got, an app that'll verify it on the other end. So, anyway, that that's what I'm looking at. So there may also be a React Native app or something like that that on the other end, you know, people can show up with a QR code that says I registered and this is who I am, and people can just verify that way instead of having to look them up in a paper list or something like that. So that's what we'd be building. You JS geniuses, you get access to the videos. You get access to the weekly, meetups and a bunch of other stuff. I'm also looking at starting a new podcast on doing AI with JavaScript, and it's gonna be at this level.

Charles Max Wood [01:34:33]:
Right? We're not building our own models. We're gonna be using the existing models that are out there, the open source models, if you will, and showing how to build things on top of those, or using some of the cloud services that generate images or, you know, using something like Whisper for transcriptions and things like that. So, anyway, keep an eye out for that. That'll be free. I'll probably drop the first two or three weeks worth of episodes onto this, RSS feed. And then from there, you'll be able to just, subscribe to the other feed. So that's what I'm I've got going. Yeah.

Charles Max Wood [01:35:08]:
Those are my picks. Ishaan, what are your picks?

Ishaan Anand [01:35:11]:
So I've got two picks. The first one, both are gonna be AI related. The first one is, notebook l m, but everyone knows about notebook l m with, like, the fake podcasters. My pick is NotebookLM without that feature. I think that feature is great and really compelling. It lets me consume, you know, material on the go in in podcast form. But I I like the other parts of NotebookLM, which is, like, it's a great way to stick a variety of sources together and then ask questions about it. So one example is I like to go to Y Combinator Hacker News to see what the comments are, but I don't wanna read through every single one.

Ishaan Anand [01:35:47]:
So I will stick it in there and say, well, what are the most insightful comments? What are people saying? I I did this actually with DeepSeek. I said, what are what are the comments people are saying about DeepSeek? What are they seeing for performance? What are the issues where it's not working? And what's great is it doesn't just summarize it. You've got links where it can go to each part of it and say, okay. This is I'm like, oh, that sounds interesting. Let me go click on it, and I can go right to the citation of that that comment. The formatting's a little off, when you when you stick it in there. So there's and you're only limited to 30 sources in each notebook. But check out the other parts of notebook l m.

Ishaan Anand [01:36:17]:
I think it's it's really interesting. I expect to see a lot of other, applications follow the similar type of of UX paradigm or inspiration. The second one is, I don't know if you guys have been watching it, but Star Wars has a new, show, Skeleton Crew, that they have on Disney plus. And, first of all, I think it's it's good. I don't think it's, you know, Mandalorian or or Andor, which was my personal favorite level, but it's still pretty good. But the other reason I bring it up is I liked some of the elements of how they handled AI and droids. So in one episode, there's something that could be akin to jailbreaking the droid where somebody uses the equivalent of prompt, you know, prompt hacking to jailbreak a droid, and I don't think we've ever seen that in Star Wars' reflection of of droids before. There's another one where it reminded me of this paper called alignment faking where the model has to decide between its original training or the thing it's being asked to do right now.

Ishaan Anand [01:37:15]:
And And it kinda goes back and forth, and it gets over in by its original training. And so there's there's one thing at the very last episode that I also thought was fascinating. But I I really liked those interesting bits of how they handled AI that I think we wouldn't have seen in in a show like this without, understanding of ChatGPT that I think probably the writers were inspired by. So those are my picks.

Charles Max Wood [01:37:35]:
Awesome. Yeah. Skeleton Crew's on my list of things I wanna watch, so I like the recommendation. Thanks for

Ishaan Anand [01:37:40]:
that. Mhmm.

Charles Max Wood [01:37:41]:
Alright. Well, just a reminder, go look on Maven.com. The code was JS Jabber for 20% off. And, so if you're interested in the course, go check it out. I'm not hard sell guy. I just think it sounds fascinating. So, anyway, let's go ahead and wrap it up here. Until next time, Max out.
Album Art
A Guide to AI Models: From Tokenization to Neural Networks with Ishaan Anand - JsJ_669
0:00
1:38:05
Playback Speed: