How Does ChatGPT Work? - ML 107 - Adventures in Machine Learning -

Michael_Berk:

Hello, everyone, welcome back to another episode of Adventures and Machine Learning. I'm one of your host, Michael Burke, and I'm a resident solutions architect at Data Bricks, and I have my co host with me,

Ben_Wilson:

I'm Ben Wilson, former data scientist and machine learning engineer, who now focuses on building Malos tooling at data breaks.

Michael_Berk:

So today we're going to be talking about chat, G, B, T, and the one liner of what it does is, it's essentially a chap out that you can interact with Will write code. It will distill knowledge, and it will summarize text, and some of the to the two crazy facts that I found on the Internet where first it got a B on the Warton n B. A final exam, So one of the most reputable Ba programs in the world, it could pass that test, which is pretty mind loading. Then also it passes the turning test. So there was a study that tested whether people could determine if an article. I think it was a two hundred token article, whether it was written by a person or chat, g, B T. and the success rate was fifty two percent, so it was exactly a coin foot. So Chatgbt is an impressive technology to say the least, and today we're gonna be talking about what it is and how you can learn more. So we're going to be using sort of an F structure and we're going to start off with the first question, which is what is open? I now, Ben. have you heard of open Book?

Ben_Wilson:

I have very impressive company. Um. Their first big claim to fame where I became aware of them was the original N, g, p, T, One model that that was out there, and some of the pretty impressive things that that was capable of doing, and there was sort of a light buzz around them around that time, and they started just dropping more and more Train models out there for people to use that were far more advanced than anything that was out there at the time.

Michael_Berk:

Got? Yeah, Since then they've iterated and developed some new products, so on their products page they list three. The first one is Whisper, which is a speech recognition and transcription framework. The second is Dolly, And if you guys were paying attention six months ago, Dolly was the craziest thing you've ever seen. It can generate a Mona style painting of mice on a beach Playing That gets hard and it was pretty legit.

Ben_Wilson:

Hm.

Michael_Berk:

So Dolly is image generation from Britain Descriptions and then chat, G, B, T is checked, text generation, summarization and analysis. So there's those are the core products And it was founded in twenty fifteen and is grown dramatically and is around six hundred employees right now. According to Linked. On one final point, That's interesting is Microsoft has a forty nine percent stake, so they're probably going to be a microsoph company if forty nine percent already is not enough to make them a microsofcompa, So there's

Ben_Wilson:

Yeah,

Michael_Berk:

point number one. Uh, for point number two, let's talk about. Sort of the pricing of chat, P, T. there are a variety of chat chat, g, b, T. three versions, Um. the most powerful parent model is called Devinchi, and this is priced at twenty cents per one thousand tokens, and a token is a word, But there are other ones like eight, which is the fastest one and least powerful, but it's still pretty robust and it is point zero, zero, zero, four cent Per thousand tokens. So depending upon your use case, you can get speed versus robustness. Ben. Have you ever worked with the parent model De Vinci at all?

Ben_Wilson:

Personally, no, I've talked to people that that subscribed to their service, and the anecdotal responses that they've given is that it is sufficiently more advanced than than the free thing that's out there, which is chat T three that you hear in the news, and you see people playing with and some of us have extensively played with it, But you got to pay to play with any service Like So. If people are really serious about using a language model like this, and want to dedicated a bit of time for session level, sort of contextual awareness to get Devin to respond to problem statements in a way that are going to be useful for your company. It is, by far, in my opinion, the most advanced thing out there. It's great. From what people have told me, I've just played with Chagptthree a lot and it is impressive.

Michael_Berk:

And so you said. One of the advantages theoretically

Ben_Wilson:

Michael_Berk:

of Devintally, is contextual knowledge given in a single session.

Ben_Wilson:

Yep,

Michael_Berk:

What else could Chatcbt be better at that? Devine theoretically would fill in the gaps for

Ben_Wilson:

One thing that I noticed actually yesterday because we're going on a project right now, the Inter faces with some of their pre trained models that they've they've released in the past, and I wanted to see if anybody's listening from data breaks. This was after hours, but I wanted to see if I could train a Chathpttwo model from Hugging Face Transformers. wanted to re train it on some data and then I wanted to have a you ask it The questions and see what the responses are from, Both like the pre trained Raw, and then me, Re training a little bit. Um. And it, The pre train raw didn't actually answer questions all that well, Um, even within the context of a session, it got pretty confused about some things, and then after re training it, it definitely improved quite a bit, but then I took the output. I took the actual code that I wrote, and then I took the simulated input for a user to ask questions of the bought, and pasted it to chat. T. p. T. Three, It knew what I was doing. I knew What libraries I was using and it was like, Oh, you're using the Transformers library and you're using the predecessor to me, which is chat P T to. And this session had gone on for a while, so it knew kind of the sort of questions I was asking and It analyzed the code and was like, Oh, you're having a user loop through a bunch of interactions, But you're not waiting for the responses from the box. Who I understood the code that I was doing and it knew that it actually wrote that it understood I was doing this for testing purposes, and then it re wrote the code a little bit as an example, and said, If you wanted to actually open this up for a human to interact with it, you would not do this. You would do this instead And I was looking at it. I was like, Yeah, that's pretty. That's That's exactly how I read it. It's pretty cool and I asked it. I said, What do you think the output is going to be from these questions? And it generated answers that were so unbelievably flawless. Uh, that when I told it what the output was, its response was Um. I can see why that would be this case. I'm trained on much more data. That's why I answered it in this way, and it was like clever responses that it had. It was something that I would expect a witty And mildly sarcastic person to answer, Because I told it to interact with me in that in that fashion and it was starting to pick up on that and then I asked it like Hey, if I wanted to take this model and I wanted to put it into a pipe line, how would I do that? So it wrote the code for me and then I looked at, it was like Well, wouldn't it be more efficient if we did this instead? and it responded with Yes, that he was like, Oh, you're right. This would be more efficient. And here's the code for that for what you just explain to me. So it adapted in that way from from that context. And then I asked it something that I knew was impossible within the library because I started asking it to. Okay. How would I save the model? How would I save the configuration for the model? How would I save the tokenizer And how would I save the configuration for the tokenizer and then within the trainer object that I'm creating. How do I save the trainer? So the Four of those questions saving the model. It knew. the a P. I gave it to me. I wrote a function around it. I was like, Yep, that's awesome. They can figs. It pulled those out correctly. It was the exact a P. I. I knew for a fact that you couldn't save the trainer object. There's no a p. I for that because it wraps a bunch of C libraries and there's some partial functions in lambda expressions that you can't save. The state of It got creative and it generated a safe underscore trainer method and wrote code as though like that would be a, an a P. I. That would be part of the package it actually wrote Is like This is how you would save a trainer object, and a trainer object contains these aspects, and this is the method that that you would apply a path to on your local file system, And I looked at it, Just stood there staring at the screen like I can't believe it just traded a B, s. M. This is Mind blowing

Michael_Berk:

Uh,

Ben_Wilson:

like it should know, Because it

Michael_Berk:

uh,

Ben_Wilson:

has access to this repository. It should have that knowledge of what methods are available. I looked in the Get hub history to see. Did did hugging face ever have a safe trainer method? No history of it whatsoever in the entire history of the rep. So I called it out and I was like Hey dude, like, there's no method for this. Um, and this is the exception that you get if you try to do that and Responded with like Yes, you're right. I was simply inferring that this might be something that would exist Now I understand. thank you for telling me that this is not possible.

Michael_Berk:

Ben_Wilson:

Michael_Berk:

theoretically

Ben_Wilson:

that's what the context is for these things Like it saves the state of that memory of your entire session based on a user ide token, and it understands that's how it's knowing what to respond when there's a thousand different users that are using an instance of the model. What to respond to each individual user? Based on that contextual history,

Michael_Berk:

Got it in. De Vinci. Theoretically would not do that and have a better answer.

Ben_Wilson:

Ah, it's got access to a whole lot more information, so token imbetdings are done in the data in these models instead of within the actual model framework itself. So it's got more data that it has access to, but it has in order of magnitude more notes in order to access sort of the the historical reference of token relationships as it's passing through these sequences, so it can craft more complex sentences. It can understand more when it any of these models. When they're generating text that goes to an end user, they'll generate a sentence that is of some sort of fixed length that can't produce more than that, the number of notes that are available in the output of it. So it does that, but it will generate that one sentence, and then it uses the context of what they just generated to generate the next sequence of text, and deventalyis, capable of doing that of a Much larger history, so it can generate maybe a paragraph at a time that has more sort of lucid connections between each of those those topics. Chat. P. Three is fantastic at creating paragraphs of text because of how well it can refer to previous state, but deventally is just going to be far more advanced at that.

Michael_Berk:

So yeah, to summarise there within the chat, g, p T product offering, there are a variety of different price points, depending upon latency requirements, and sort of how good you want your responses to be. so next moving on, let's talk about what G, p T actually stands for it. So g, p T can be called a general purpose technology, but in this case it stands for generative pre trained transformer. Ben. You have a one liner ready for how those work.

Ben_Wilson:

I mean the generative aspect is it's creating you know tokens, So we read in a sentence that is in U, t, f, eight in coding. so we're writing words. Uh, No language model or deep learning model or machine learning model, in general, has a concept of what to do with text. There's no defend Of relationship between them that you can mathematically manipulate, So that gets tokenized Um. and these tokens then go through the model architecture to figure out what the relationship is to that collection of tenser objects, Those indexes that they get passed in, and the generative aspect of that is looking at the output of it. And then the architecture says, I need to find Relationships between the tokens that I'm that I'm producing based on what was sent to me to provide some sort of response In The architecture is pretty complex for that model. I have no idea what the diving architecture is. That That is, that is their bread and butter. I don't. I don't think anybody's ever going to know it Who doesn't work for opening eye. I imagine it's incredibly complicated because those people are geniuses and they've worked really hard at that. But the the idea is it's going to produce sequences of Of tokens that have relation. you know, relations between them that have been trained to be relevant in response to a collection of tokens that come in. And it has a like a memory safe state. so it's like Hey, I generate this sequence, Then when I generate my next one, I need to generate another you know set of sequences. So there's a concept of recursion that's happening until it. It knows that it is exhausted. The idea that Trying to convey

Michael_Berk:

It so sequence and sequence out.

Ben_Wilson:

Yes, sequence in multiple sequences out.

Michael_Berk:

So with all of that we've we've been talking about how it's an absolutely massive set of knowledge. The model itself is giant, and then the training data must have been just almost incomprehensible, So actually did some research to figure out what that training data look like. So first from a compute perspective, Microsopht built a super computer for Open A, and this super computer is according to the listing of the top super computers in the world, In the top five. In terms of power, it has two hundred and eighty five thousand C, P, U, course, ten thousand gpus, and then four hundred gigabites per second of network connectivity for each gpusurver. That's a. That's a pretty big computer. But

Ben_Wilson:

Yes,

Michael_Berk:

with all that it still took months for the model to train. Been what's the longest training time you've ever seen for a model you've been working on.

Ben_Wilson:

I've personally been working on that was part of a project that I had to do Ten days is the biggest and we used quite a bit of of g P. hardware for it was a deep learning problem for image classification. But four projects that I've worked with teams on. Uh, we haven't come to this level. All right Personally, haven't seen something that's this big. Usually you have a dedicated team of engineer specialist that are working with the data science team, So you don't call a vendor to help you with that, You just hire the people. That's what open eye is That they don't need no help. They need to help. you know, working with Microsofto for all that hardware, Because that's not cheap to build that server and keep it running. And just the A C bill alone is probably Astro. I'm local, but for stuff that I've worked with teams, we've had stuff that has run for weeks on end for deep learning training. Usually n P stuff. so any time you're doing, If you're doing transfer learning, you're probably not going to see training times of that long. but if you're doing something where you're implementing a novel architecture the first time ever, and you're running through, you start small when you're like, Hey, I'm going to train on ten thousand. you know, ten thousand training set entries against, you know, a hundred test validation data sets that are labelled. You do that to make sure that your model architecture just works that it doesn't blow up, and then it's showing some Of improvement on each epoch. But then when you're ready to actually train the thing, you're talking about, millions of events of label events, sometimes even billions. And when you're talking model architectures of this size, the number of notes that have weight connections between them. Each of those is a mathematical computation that has to be done for each iteration. Each epoch, as You know, learning over that training set over and over and over again. It takes a long time. These are expensive things to build from

Michael_Berk:

Yeah,

Ben_Wilson:

scratch.

Michael_Berk:

speaking of expensive, I looked up how much. Well, at least the estimate for how much it costs to train, and experts say it was around four point six million dollars to train

Ben_Wilson:

I believe it.

Michael_Berk:

and building the super super computer is completely not included in this cost. so

Ben_Wilson:

Ah,

Michael_Berk:

just

Ben_Wilson:

there's

Michael_Berk:

for running,

Ben_Wilson:

There's no way.

Michael_Berk:

Yeah at all.

Ben_Wilson:

Probably, I would estimate that four of those server racks in that super computer, not server modules, not like a four moduale or an eight U module. But the full stock you know, rack for those is probably around five million dollars. They probably have four hundred full sized racks. For that that computer, maybe five hundred. That's a big machine.

Michael_Berk:

It's an

Ben_Wilson:

That's

Michael_Berk:

expensive

Ben_Wilson:

a lot of

Michael_Berk:

piece

Ben_Wilson:

number

Michael_Berk:

of hardware.

Ben_Wilson:

Cards too. It's

Michael_Berk:

Yeah,

Ben_Wilson:

a lot of fiber optics. But yeah, that's an investment, but it's an investment for the future, because these technologies and computers, even if that computers is full hundred million dollars to build the thing, probably in that ball park, Um, Microsoft is all in on making investments like that right now. For the future. They know what these things are going to be In ten years from now where it's going to be. It is going O be your pocket assistant for helping you navigate modern life. I mean, it's already doing stuff like that where if I don't want to write a bunch of boiler plate code for something that I'm testing out, I certainly don't do this for like B, Rs. But if I'm if I'm doing a demo of something for myself, I like Hey, I'm trying to learn the A. P. I want to see What's possible. I could go to four or five different websites. Read through the dogs, look at the source code and hack something together in thirty minutes, or I can can very clearly define to chat, P. T. three, what library am using what I wanted to do and I wanted to generate code for me that I can then copy and paste into a new notebook environment and just run the test and see what happens. It's It generates that in fifteen seconds for me, And it teaches me if I ask it questions. And like Hey, I don't quite follow how this thing works where he. There are any other ways that I can manipulate this part of the code. It'll teach me. It'll be like Yeah, you can do that and here's how to do it. So it's It's like search on stherroids. It's the next evolution of stack overflow, where it's basically an amalgamation of all the you know one thousand Plus up vote answers that are out there because it just works most of the time.

Michael_Berk:

Most of the time. Yeah, it's been interesting to see how it's changed. Some of my friends work lives. One friend is applying to new jobs, and they have started every single cotofher letter with a Chatcpt written entry, and then they edited accordingly. They also have been using it to sort of in place of Google to summarize definitions in finance or whatever it may be. So it's

Ben_Wilson:

Hm.

Michael_Berk:

It's been incredible to see how it's already changing some work flows. But this, so this super computer is absolutely massive, built custom for open a I. But let's talk about the training data that went into this super computer to actually build Chatgptthree, So for context, there have been three iterations at least publicly released to chat, g, p t, hat, g, p, T. Two was trained on ten billion tokens and a token you can basically think of as a ward chat. G, p. T. three had four hundred and ninety nine billion tokens, and just looking at the break, Uh, four hundred and ten billion of them were common crawl, so essentially just web text. Um. Another nineteen was a separate set of web text, And then there were two sets of books, one with twelve billion tokens and one with fifty five billion tokens, Nd. then finally they trained on just a little bit of Wickedpedia at three billion tokens. So Ben, can you start to describe the scale of this training data? or is it even possible?

Ben_Wilson:

So one thing to be clear on with the term If we're talking talking about classifiers and talking about tokens like token count, There's not four hundred and ninety nine billion words in existence. Um. the English language has a lot of words, even if you include slang and stuff in their. not that many words, Though, so what they're referring to is relationships between tokens, So think of it as like sentences, Um, The token you know, maybe the word a, his token Zero. It's not usually in these models, but it will have some sort of imbetting associated with it. So you have this vocabulary that is list of words in language X, and then that's map to basically an inerger index position, so that that corpus of definitions is common between the tokenizer where you con In to the industries. The model has those references that are locked in there as well. and then when you decode the output of any of these language models, you're converting it back into collections of words or a single word or whatever. But the training data has has these sequences components, so you'll have as that sentence is constructed different combinations of of tokens that are in there. So when you talk about all those permutations of Hey, this is how a sentence could be crafted. Um, it's not the raw sentence that we would be Generating if we're having a conversation or communicating back and forth in slack or something. There's a lot of terms that are dropped that are just not important. They don't have contextual reference, and that's to limit the size of how big the model would have to be. So there's there's pre process or stages that happen in any of these models that it's like Hey, stop, words, get rid of those. I don't care about words like the n. A. And and then you know, certain pronouns are dropped because they're Not dropped, but more collapsed into a single term. So there's clean up that happens, and usually with these models, If you're using a pre trained, one, eighty to ninety per cent of your work is doing that stuff is cleaning your data up and cleaning up your tokens and making sure that you're you're training on just relevant things that you want the model to learn, which is one of the reasons why building these things from scratch if you've never done one before is a true Insurmountable task. unless you have a team of you know, four hundred world class engineers working on it.

Michael_Berk:

Right? Yeah, so those tokens are not individual words. I just looked it up. There's a hundred and seventy thousand words in the English language,

Ben_Wilson:

Yep,

Michael_Berk:

So it's about the combinations of the tokens, and we well hopefully get into that and just a sec on how it actually maintains, Quote, Unquote attention and self attention between different tokens, But before we get into that I just wanted to also chat about some implications of chat. B. T, and I think there's a lot of discussion about how it's going to take Obs, or at least change jobs, But another area of concern is security and Tyber security, So it's clear that Chap can generate text and it can use those to develop fishing attacks at a very large scale, So I can create a hundred thousand emails all customized to a given user. Send them out, and hopefully I'll get some credit card numbers are passwords back, But beyond that, what else can catch be to T do? in terms of security, in your opinion,

Ben_Wilson:

I mean, if you're a nefarious actor who wants to use some sort of automated system to generate fishing attacks or just generate Pam? Um, there's nothing that chat, g, P, Three does that chat, P, T, two can't do. you can always. I mean these models are chattptthree. isn't out there in the wild for you to just take and re train. But its predecessor There's a. There's loads of language models that are there on Hugging Face marketplace that could do all sorts of stuff. Some are pre trained from the market place that a user took pre trained on a certain data set and then reuploaded. That does this one unique thing, because it was trained to do that one unique thing. I saw one two days ago, trying to figure out what output certain Uhtokenizers on De code could do with certain different language Architecture, and somebody trained. This is actually a hugging place employee did it and they wrote a a lyrics generator from Chatgpttwos based Core model, which is just the g, p. T to model, And you type in the name of a famous artist from you know, Basically, if they've ever been on the Board top one hundred, it's been trained on their body of work. So the train that's pretty big. But you type in the artists and and then you say Hey, Generate tend to iterations of song lyrics that seem like they would make sense from this artist. Is it perfect? No, Did it generate some that you're like? I could totally see this being a song that this person would write. Yes, so you wouldn't Hook something like that up in an automated fashion Be like, Hey, connect the outputs of this and send it to my s. m, t P server, so I can start. You know, spamming out all these email addresses. If Arius person would know that that would get your Ip address Black listed by Ole major mail carriers within seconds anyway, So they would create all of these You bought networks globally running, you know against Vpns, that will be changing their I P address after Eve. Ten emails they get generated and they'll be using You know, some mail service that is anonymous, so that they can't be tracked, but you could do that today. It seems scary. you know, the media blows this up and people are like, Oh my God, it's really hard to tell if this is a person or not, but people doing illegal stuff are already doing this and some of them are fairly sophisticated. But One thing to keep in mind if you're using a mail service like G, mail. I can promise you that the company that runs that has incredibly sophisticated filters that are based in deep learning that are searching the contents of every mail that's coming through. It doesn't take long for it to adapt and start detecting this stuff. So even if somebody used Chattgptthree to create all this stuff, Your male provider, There's really smart people there. They're gonna start blocking it.

Michael_Berk:

Got the second component of this question is all right. Well, if G, mail can adapt and learn that chap, C, T, spanning lots of people, Well, chatcbtcan, right coat. So can't it adapt

Ben_Wilson:

No, you can do some clever things with it as I've done with software stuff where I've kind of taught it how to to write code in certain ways, Um, just to see what would happen like, Hey, are you familiar with this language? Yes, I am okay. I'm going to send. I'm going to give you some scholar code and I want you to rewrite this in Python. and it's really great at doing stuff like that, N then when you find an issue you tell it like No, that's not quite right, Or hey, I don't like that the way that you wrote that. Can you? can you perhaps make that more efficient? or can you make it more compact or hey, can you add you know exception handling? To this, it does that and it doesn't very very well, uharguably better than most humans. So it can do all of that stuff, but you can't tell it. Hey, can you write me a nefarious bit of code that will try to hack A s andcrepted hashes. And it will. I actually haven't tried that. I don't know if it'll do that. Maybe it will don't do that at home, kids, But even if it were to do that, you would still have to extract whatever it generates and execute it on your own execution environment, which is illegal. By the way, You can't tell it to execute code to the outside world. In fact, you can't tell it to search for something that happened yesterday. You can't say like Hey, this thing happened in the news and ask it like hey. Do you have any details on this thing? It doesn't have access. It's knowledge based is cut off at some time in twenty twenty one, because that's when it's training set end. So it doesn't have reference to modern things, Because it's not actually. it's not alive. you know, it, A model that is basing its text generation on data that's been trained on that. It's in its vocabulary some new things. It doesn't have reference to it, also doesn't have the ability to instantiate an execution environment and execute code on its own. That's not possible. It can write text that's formatted light code. So it looks like it's an idea, but it's not an idea. I can Execute that code if it could. Uh, I would start being concerned. I don't think the sponsor of open a Microsoft would allow that to happen even if they wanted that to happen. Who knows how dangerous something like that could get, if you could coerce it to writing to iterating through and you could tell something like this. Hey, I want you to try to you know, hack into the Data base for this government organization And can you see what their security profile looks like? What would have bought like this? Do? that's that has the wealth of knowledge of software developers on. Get up. it would be like Hey, I know how to how to inspect communication protocols. I know how to you. Now attempt to discover every service that's running on this on this main framer of the server, and try all of the default past. It's because that's all on the internet too, but it would be able to say I don't just want a single thread that's doing this, because it would know how to do multi threading on a computer and it would also have reference for how do I create an w S account for free. Oh, I can do that real quick. Here's and then I can you know use Terra form to spend up all these these services. Oh, I hit my limit. I know that I can only have you know this money on a on a ship Like on a free community edition. Well, what if I just spent up a hundred thousand community addition instances on every S region globally Done five minutes later. it's now running. You know, four hundred threads on a concurrent thread pool with eight hundred thousand shells, and it's all attacking. You know, this one server, and trying to basically do penetration testing, you've now created a dido spot They can take down. You know a lot of things, so nobody's going a want to do that.

Michael_Berk:

Well? Okay so at the beginning of this security conversation I was very happy with the answers. It can do fishing and it can't really hack anything. But what is the step that is between? It is not connected to the Internet and it is connected to the Internet. That would allow it to be such a powerful dedossing server.

Ben_Wilson:

You would have to have the ability for it to execute a command structure, where A, the server that's hosting the web, Ap, the rest of a P that you're connecting to when you go into the service. there would have to be a back in service there that allows it to execute arbitrary code that does not exist for a very good reason.

Michael_Berk:

Could you build that?

Ben_Wilson:

You and I could build that in probably a week. It wouldn't be that challenge. You would have to basically teach it or let it know that, in order to execute code, this is the command that you need to produce as a head or token in your response, which would then call this service, which would spin up a elynic shell that would execute whatever it tells it to do, and then you could coerce it to say. All right, I want you to execute this, this lino command that gives you access to these Services. Go nuts. And then you can just tell it like Hey, can you try to do this thing? All of that is impossible because nobody is going to build that back end service Because that would be dangerous in our current modern architecture of cloud base computing.

Michael_Berk:

What if I'm a terrorist

Ben_Wilson:

I mean, you would have to have access to that server in order to create that, and you'd have to re train the model yourself to make it aware that has this capability, there would probably be a bunch of code that would have to be written to enable this functionality, but it's certainly not not beyond the realm of possibility for a couple of dedicated people who want to do something like that, But you need access to that Data center and

Michael_Berk:

Right?

Ben_Wilson:

that's not opened up for public access.

Michael_Berk:

Got it? So the only wall between detossing the U S government and not dosing the U S government is open A And their servers.

Ben_Wilson:

Well, the only wall that exists between that is the threat of the federal penitentiary system in the United States. That's a felony. Like Does anybody want to play around in and get slapped with a twenty year prison sentence in federal prison for messing around being like, Hey would be really cool if we did this. Not really like the F. B. I will not find that funny. Let's just say that,

Michael_Berk:

Probably

Ben_Wilson:

But is

Michael_Berk:

that

Ben_Wilson:

chat t? p. T. Three doesn't have the technical capability of doing that based on my interaction with it. About Co generation. It most certainly is.

Michael_Berk:

Got it so it can do a simple attack like Didos. Can it do something more sophisticated like building a new uhincription decription algorithm?

Ben_Wilson:

I think that generating things that are based on prior knowledge, something like incription is entirely within its capabilities. Is it guaranteed to work now? M. Can it coble together and infer things that might work? Sure, it's pretty good at that. based on my testing. Um, the code doesn't always run, but at least it's It's making an effort on things that it's inferring that related to One another, The the challenge would be getting it to build something entirely novel. You would have to coach it to a certain point where it understands the relationships of what you're you're telling it to do based on its prior knowledge, which arguably all new things are based on prior knowledge. So you could theoretic theoretically, say yes, That's capable of doing that. I haven't pushed it to That level yet to be like, Hey, I want you to create an a P. I that that is capable of Re training yourself. I think that's a bit of a stretch. Um, but you certainly can have it work out things that could potentially be dangerous, but it can't execute it. so it's still the user would have to be the one that's Taking that and running it. And as I just mentioned with the f B is lack of a sense of humor. If you were to execute that that code that it generates for you, that's trying to crack an incription protocol. M. I certainly hope you like the back of vans and seeing the world through bars,

Michael_Berk:

That's my favorite thing to do on a Saturday morning. All right, So we have a little bit of time left. Let's do a quick technical over view of how chat g B T works. and then we can t up for the next episode something that Ben and I were chatting about before, which is how you can actually go about learning to implement things like this specifically in hooking face, So before we get into that, let's just quickly sort of on a technical level, explain how the system works. So the Chitecture is a decode only transformer network with a twenty forty eight token long context, and then within that it has a hundred and seventy five billion parameters, which is just completely unprecedented in this in this space, so let's sort of break down each of these components one by one. So what is a incoder or a decoderbl

Ben_Wilson:

So in codes and de codes. If we're talking about umbtokenizers, that's that. Hey, I'm taking human speech text and I'm converting it into these new America indexes. The that's in coding. decoding is the opposite, So you just say Hey, this tokenizer that I have these index values converted back to the speech so that a person can read it because otherwise it's a test Or of of numbers that have no real meaning to us.

Michael_Berk:

Exactly? And so g, B T or the Ctalybtframe works? They re de code only, so they are essentially generative. Um. Second thing is a transformer network, and the the sort of base line reason for having a transformer is to process sequential units or tokens, So it's a logical application for n, l. P. It's also a logical application for time series where you have sequential data Points and the thing prior to transformers that were the latest craze where recurrent neural networks, and especially memory related recurrent neural network. So s t, m, r g, r, use and in twenty seventeen there was a paper released by Google and Toronto called Something like All you need is attention or attention is all you need. And they sort of developed this transformer network. And what that train Former network did is it had a few innovations. One is something called positional and codings, which is its stores token location in the data instead of imbedding it into the neural network structure, so that allowed essentially referencing may be the first word and the last word of a sentence are very related, or two words very close to Gether are very related. Well, No longer do we have to depend on the neural network structure to define that relationship. There's actually a newmericncoding of that relationship And then another component is attention, which is essentially matrix that helps map those relationships. U. Ben. In your opinion, how game changing was the innovation of a transformer.

Ben_Wilson:

That's like asking how much did human society change with the automobile If you apply it to M. l. H. This architecture to put into context, like, if you wanted to get the performance out of chat that we currently see in chat, P, t, three, or in the far more advanced in super Awesome De Vinci. Uh, If you were to do that in l, s t M, that model, you would probably have to What you spent on that that super computer that microsofh built, and just by hard drives, just in order to store the model itself, Because you would have to have all these connections with that reference of like, Hey, this word normally follows this word, but you would have to have all of the the weights associated with all word relations within sequences that exist within human languages worldwide, So the model would be just so ludicrously large that I don't even think we have compute architecture that would run that properly, like you, had need to global spanning data centers in order to train the thing. So this new architecture that that was conceived of is a massive game changer, and what it is is applying even more concepts that we have of our understanding of how human Language skills relate to sort of concepts. You know, we're not processing when we speak or write or or read. We're not processing each individual, you know, word, effectively a token, and really thinking about the relationship mapping between them. Where we're mapping sort of series of tokens, Sort of, in our mind we know what words naturally go together, and how that changes the meaning and the context of things. So taking You know how our minds apply language to our thoughts and applying it mathematically to the data structures, that's the real big game changer here, and allows the model to shrink by so many orders of magnitude and you can get performance. That's actually just. it's revolutionary both on the training side and on the inference side. I'm really

Michael_Berk:

Yeah,

Ben_Wilson:

excited for the next iteration where we get one of these models. Yeah, we're talking about now, hundreds of billions, or potentially trillions of parameters. But the training is is effectively what it was for chat. P. Three for diving. Like Hey, we have three orders of magnitude increase and complexity and it still trains in three months. That's where you're gonna start saying language models that are like capabilities that are almost scary, Like Wow. This thing is deeply Nowledgable about the vast body of human knowledge Up to this point,

Michael_Berk:

That's crazy.

Ben_Wilson:

I'm excited.

Michael_Berk:

Yeah, I am too. as long as we don't connect it to an executable environment, but just one more shot out on Transformers. They're also paralellizable where sequential noral networks. Prior to that we're not, so that has been an equally game changing component of this structure. Cool, so we're about a time. I'll do a quick summary. So chapt the latest craze. It's a text generation and summarization tool and it's really really powerful. Um, it's that said, it's not the most powerful chap bought model. Open a eye. The developer of Catch Pt, actually has more powerful non open source tools, And the reason that it's so powerful is it's an absolutely giant model that was trained on a custom built super computer And it leverage S this decode only transformer network That was was innovated pretty recently. And yeah, if you haven't tried it, go check it out.

Ben_Wilson:

Definitely

Michael_Berk:

So until next time it's been Michael Burke and my co host

Ben_Wilson:

Ben Wilson.

Michael_Berk:

and have a good day, everyone.

Ben_Wilson:

Take it easy.

Search

Trending Now

Popular Searches

How Does ChatGPT Work? - ML 107

Hosted by

How Does ChatGPT Work? - ML 107

Share This Episode

Show Notes

On YouTube

Sponsors

Transcript