Challenges for LLM Implementation - ML 126
In today's episode, we speak with Anand Das, the CTO and co-founder of bito.ai, an LLM-powered code assistant. Expect to learn about managing LLM context, keeping LLMs up-to-date, common user pitfalls, and much more!
Special Guests:
Anand Das
Show Notes
In today's episode, we speak with Anand Das, the CTO and co-founder of bito.ai, an LLM-powered code assistant. Expect to learn about managing LLM context, keeping LLMs up-to-date, common user pitfalls, and much more!
On YouTube
Sponsors
- Chuck's Resume Template
- Developer Book Club starting
- Become a Top 1% Dev with a Top End Devs Membership
Socials
Transcript
Michael (00:01.558)
Welcome back to another episode of Adventures in Machine Learning. I'm one of your hosts, Michael Burke, and I do data engineering and machine learning at Databricks and I'm joined by my terrific cohost.
Ben (00:12.349)
Ben Wilson, I troubleshoot integration test failures at Databricks.
Michael (00:17.834)
And today we have an episode that I'm very, very excited about. Let me explain why. So today we have Anand Das and he entered the workforce sort of as a software engineer, focusing on high performance compute. Um, but then he realized he enjoyed being a CTO. So he did, he did some CTO positions and the first CTO position that he took was for Pubmatic, which is a sell side ad auction product and.
For those of you that aren't aware, ad tech is sort of one of the most complex and data intensive industries out there. Uh, I worked adjacent to it. And for context, Pubmatic manages over 300 billion ad auctions a day that generate over a hundred terabytes of data. So I mean that, that scale is just ludicrous. Currently he's the co-founder and CTO of Bitto, which is essentially an LLM powered code assistant.
And the tool currently has around a hundred thousand developers and so on. And I was wondering if you could sort of explain some of the challenges that you're looking to solve with this tool.
Anand Das (01:22.567)
So Michael, generally the LLMs that are available right now, they answer questions based on generic data, right? Which is okay if you're solving simple problems. And even if you're solving complex problems, it gives you generic code. But for a developer who's working within an organization or a project which already exists, there is already existing code. So even if you get generic code, you spend more than 50% of the time taking that code and fitting it into the...
project that you've built in the fashion modularization that you need. So that basically sucks. Some people actually start using it, and then over a period of time, they reduce their usage of AI. So what we want to do is utilize the context of the code that you already have so that using the LLMs which are already available, provide them the context of the code that you already have.
and then generate code so that rather than spending 50% or more, you're spending 5% to 10% to get this code up and running within your project. Because then it will be really useful. The other thing that we're trying to do is there are simple things that people don't have time to do. We all talk about unit test cases. We all talk about having pipelines which actually look at is the code optimal or not, like code reviews. Right?
Most of the times, code reviews are pending because people don't have time. And then you get looks good to me kind of comments, right? And then you get bugs in the production, right? And nobody to blame, but people just don't have time. So we can utilize LLMs out there with some training to actually do automated reviews so that at least people will get quick feedback. And it also removes the human bias, right? You'll see that.
Some guy made a mistake once. And then one senior developer is like, this guy's in my crosshair. Anything he does, I'm going to scrutinize it. But when you have LLMs, they're looking at code. They're not looking at people. They are giving review comments on the code. And you can actually automate a bunch of things, like gender test cases for code. But when you give existing code, you basically give some context. And then you say, this is what I want to try to do.
Anand Das (03:44.615)
get me the test cases, and then based on the test cases, you can generate unit test cases. Obviously, doing all of this at a high level seems easy. LLMs are not that easy to integrate. So building tools which actually do that so that people can take their ideas and automate stuff that they don't want to do usually. And automation takes care of it rather than humans. So that's where we are going. Yeah.
Michael (04:11.026)
No, I said that this is sort of a very interesting time for LLMs. As we've discussed a few times on this, on the show, uh, basically LLMs can do a lot of simple tasks really well, but they struggle to handle complexity. And, and before we started recording, we were actually talking about an issue that you're currently facing that's perfectly solved right now. Um, and that is sort of managing context length. So.
Can you explain how that can be a challenge with LLMs and how you guys are approaching solving that?
Anand Das (04:41.367)
Yeah, I would say that we have solved it completely, to be very frank, right? Because there is always a limit on the number of tokens per request that you make to an LLM, right? And depending upon the LLMs that you're using, like if you use OpenAI, the max can go to 32K tokens. And if you basically say three characters per token, which is off the mark because sometimes two characters also makes a token, you don't have a large set of code that you can provide, right?
And although we say all the code that is written right now or in the last two to three years is very modular, but there is a lot of code that people are still working with, which is monolith. So a single file will not fit into your request. So out there, the challenge for coding, not for English text. English text is different. You can summarize and then give context. But for code, you cannot basically just chunk it.
by saying I'll chunk like four lines, five lines, and then get embedding, and then get the query, and then figure out the nearest neighborhood or cosine similarity, and get the chunks, and then just give those chunks, and then ask the question. The reason for that being is you might just get like four lines of the function because that is what matches. And then you leave the portion in between, and you might get ending of the function because that is what matches in the index that you create using embedding.
Context management is a bigger problem. So what you need to basically figure, like, need to do is figure out needles in the hashtag, and then figure out, based on that needle, the surface area around it, which you should pass as the context. So if you get a function, ideally, if you're within a function, this function might be useful. OK? Now, because this function is here, somebody is calling this function, and this function is calling something else. Right?
getting that piece of information, if you're modifying that particular function, will actually help not only in giving you the answer for modification, but then telling you that this modification might break these things, or you might have to go and modify these things. Otherwise you'll get the answer that, okay, modify this function, but then you modify and start testing. It's not gonna work properly. And this is severely limited by the context because the code is huge.
Michael (07:00.059)
Right. Yeah.
Anand Das (07:06.139)
Even if you get the context right, the whole context may not fit in into the token limit that the AI takes. So then what you need to do is you need to actually get a portion of the answer. Take that whole thing and summarize that context. When you're summarizing, make sure that the relevant portions of the code which need to be complete are not summarized. They're given as is. And then continue the output. That looks very easy at the high level.
But when you actually run it in production, you kind of see issues. And then you have to kind of tune things around. And then you have to go language specific also, because in JavaScript, you can have anonymous functions, right? And you can write REST APIs. In C and C++, which is very structured, makes it easier. So you need to basically figure out, like, you need to understand the language grammar so that you can generate context, which is more effective.
You can start with brute force, which is like similar to treating a code file as a text file and then chunking it based on characters. That will get you decent results, not like way off, but at the same point in time, it's not going to be highly accurate and complete over a period of time. So those are certain things that we are kind of doing internally. And I wouldn't say that we haven't, like, we are, I would say, getting good results, but we are 20% there.
like 80% of work still remaining to get it to the right level.
Michael (08:39.226)
Yeah. And for those of you who are not super familiar with this problem or even the LLM space context length refers to basically, if you have an active session, how much information can you feed into a single prompt? And obviously you can't feed everything. That's what the training is for. And so managing the size of that, especially for instance, in chats, as context length grows over time, that's a real challenge and I've run into that a couple of times actually with, with projects at work and, uh,
I think your solution is fascinating sort of starting with a local area and then searching around in that file and returning what you think will be relevant results. How do you determine what additional text is relevant versus not?
Anand Das (09:23.551)
So if you're within a particular function or a module, then you try to basically see how much of that module you should give. If already the context generated is huge, then you basically limit yourself to the function itself, or that particular module, maybe a class, maybe an interface, maybe a module. Then if there is space remaining, you kind of look at who's calling it, where it is instantiated, and then give those code blocks. So you go from center and then kind of...
spread out. And you spread out till the amount of context is there. And the input actually has the question and the context that you're giving. So if the question is large, your context reduces. So then if you want to actually answer that question properly, you might have to break down the question into multiple parts and then actually give context. And the part of the question get the answer, like divide and conquer, get multiple answers, and then combine them together.
Ben (10:06.485)
Thanks for watching.
Anand Das (10:23.307)
to try to get the relevant results. Again, not all the time LLM will give you the right results. So you have to then tune the hallucination and the other parameters out there. So yeah, a lot of things that go on. And as I said, language specific thing is also required because you can do brute force with this, right? Before going to language or anything, you just say that if I get a hit somewhere, let me actually get four or five chunks.
around it, right, two above, two below, right, in the sequence. And if you see llama index and the other guys, they have chunk overlap, right, so that there's some continuation. And if you hit one area, you have something surrounding it. But if the surrounding thing is not complete, you get funny results, right? You might get a function which does something, and then you're like, it shouldn't do this, because this is what I do out here.
But it gave you that answer because it only had a portion of the function, not the complete function. So it just hallucinated about what the remaining part was. So that is what you are trying to avoid, give a complete block which is relevant out there. And out there, language-specific thing helps you to actually get a language-specific block which is concise and required.
rather than a portion of it.
Michael (11:50.458)
And then how have you seen other sort of companies or just engineers in general solve this context length problem?
Ben (12:00.597)
I don't know if anybody has a perfect solution for it right now. Um, I can speak. I mean, I've been, while you've been talking, I've been writing down a bunch of questions I want to ask you. Uh, but the, the context awareness is something that we use GPT for. Like a lot, uh, not like, Hey, write all my code for me. Uh, clearly it doesn't work, but we use it for.
validation because we're building tools that interface with that system and some of these other providers. And we want to do basically understand what the capabilities of these things are so that the APIs that we're building kind of do things that help to solve the problem that somebody's trying to do, like pre-crafting a prompt for somebody for a task like, Hey, I have these two functions. Please write a unit test for me using this framework.
you know, so prompt engineering type stuff, and then test that out and see what other results look like. And.
I think when you're just testing something out for the first time and playing around with it, you're going to do very simple, hello world style things. That's what a lot of people I think did the first few months of these things being out there. Like, oh, look, it can write this test for me. It's so awesome. It's like, yeah, but that wasn't really complex. It's not really real world. So we're doing real world stuff. We're saying like, hey, I've got an entire module I'm going to try to paste in block by block.
Anand Das (13:27.665)
Yeah.
Ben (13:37.425)
And I'm going to give it a very specific instruction. Like I want you to create a new API endpoint for me based on the context that I'm giving you. And it must adhere to these, like do these things, but also not do these other things. And, you know, wait for that to generate and then say, my next prompt is I would like testing to be conducted using the PyTest framework, I want unit test mock patches to be done.
so that I'm not calling this other thing, and I need parameterization of at least 15 elements to validate that this works correctly. Provided that you craft that prompt, it does pretty well. But one thing that we've noticed is there's almost a proliferation in context-aware prompt engineering, where if you're giving it context of garbage, it starts to proliferate that garbage. So like,
We've intentionally sent bad coding practices into it and saying, Hey, I want something in this style. And it seems on a new clear session, it'll fix code, but the longer that the context session goes on, if you keep on giving it junk, it starts producing bad code, like unreadable nonsense, like overly complex things where it's like, yeah, the computer can understand that. But if I ask my buddy.
Hey, can you tell me what this code does? They're sitting there for like two or three minutes, just reverse engineering it in their mind. What are your thoughts on that side effect of people becoming more dependent on LLM-based or any sort of just code gen systems where the source that you're providing these engines might not be optimal?
Anand Das (15:29.935)
Yeah, I think that problem is huge, right? Because we are seeing the way people are using it. And obviously, we don't track anything on our side. But sometimes when we are kind of like helping users when they get into situations or like, hey, this is not working, the way we see people working is they'll ask AI to generate some piece of code, right? Then they'll try it out. It doesn't work.
then they will basically just copy paste the stuff, saying that the code is there because the context is there. Then they'll say, here's the error that I got. Then the AI will suggest, OK, go and modify these things in the code. So they just keep repeating it. Over a period of time, they are copy pasting. They have forgotten that they're coding. They have forgotten that this is what I was trying to do. And they get into this scenario as like, I have to ask the question, and this guy is going to answer, and I'm just going to try this answer.
Early on, if somebody starts using AI, they are not learning programming. They are basically going through this. I have to ask a question, and I have to verify whether the answer is correct. If the answer is not correct, I'll basically say that, OK, this is the issue, and then get me the next answer. And some people believe that is programming, which is not. On the other side, the problem that you mentioned, like garbage in garbage out. That is like, you know.
like well-known stuff. So if you have bad code, it will seep in. Forget about the bad code. If you basically have something mentioned in the prompt which is not right, that also gets carried forward in the context. Because what it is doing, it is referring to what you have given before. That is its knowledge base. It might use something from the model that you learned. But the more focus is on what you have provided. So ideally, it's going to be a good thing.
you know what we suggest to people is before you generate test cases and stuff or before you know modify the code try to run some checks on the code like is it performant right you know does it have any modernization issues right should we refactor it if you can do that then do that before you run the other stuff otherwise it is going to go and provide you code in the same
Anand Das (17:53.727)
And rather than saying just the same fashion, as you said, it percolates down. So the code will get more worse and worse. It is LLMs are very good at repeating the mistakes that you have done. And yeah. So but humans have to take care. We haven't automated that as of yet, because that's also a philosophical problem.
Ben (18:07.389)
Yes, can confirm.
One thing that I...
Anand Das (18:23.015)
Why I say philosophical not technical is because you can do that but some people don't want it like don't touch my existing code Please all right, and I want to keep it the same for fear or like, you know lack of resources to actually manage the change maybe that's the case, but Some people are like I don't want the existing code to change I want everything to fit in and there are people who are okay with ripping and replacing those are like fine with you know doing those things but
In practicality, I've seen that happening only for new code that is being written for any existing code, like in organization or anybody who's basically working as a contractor, they say, I'm not going to touch existing code. I'm just going to make the new code and somehow fit it in into what was working. So yeah, that's, that's the thing. Yeah.
Ben (19:12.069)
Yeah, one thing that anecdotally I noticed that is related to what we're talking about. In some of the earlier versions of open AIs, APIs, you could start prompting and generating code that it seemed like it was just following instructions. So one of the games I played with it was.
trying to get efficient comprehension of a collection and saying, hey, I have this list of this size and it has these nested components in it. And eventually get it to the point where it's following along with what I'm explaining to it. And I'm trying to intentionally be vague at first. But the point of it was, can I get this thing to create a tail recursive loop that will crash a computer?
you know, antagonistic code that if somebody were to execute it, it's going to brick their computer. If you just ask for that on like GPT 3.5 turbo, it wouldn't do it. It'd say, I don't recommend doing this. First generation stuff back in November last year, it would just produce, oh, you want to crash a computer. Here you go. Here's the code. And now there's safeguards and protections against that. But 3.5 could only...
I've eventually got it to generate some fairly not truly malicious code, but sort of trolling code and to the point where you execute it and it gets, you know, it crashes Python because of recursion limits. And then other things like just filling up, you know, memory allocation space and just creating so many objects that Python just dies, the kernel crashes.
repeating that same exercise with GPT-4, when it got to a certain point where it kind of grok-ed what I was trying to get it to do, it stopped. It was like, I know what you're trying to do here. I don't recommend you do this and I'm not going to give you the code to do that. And I was like, man, that's clever. Great job on them for getting that reference. However they had to train that, I don't know. But whatever that mixture of experts, I don't know if there's a gatekeeper expert that's in there. It's like...
Anand Das (21:29.818)
Yeah.
Ben (21:41.361)
What is this code actually trying to do? Uh, do you see any sort of dangers out there of somebody trying to do something similar to that for sort of zero day operations where, you know, it's, these things are capable of generating relatively sophisticated code. And if you have an idea that you understand the inner workings of an operating system or a kernel to say, Hey, I wanted the ability to get root access here, can this thing.
do it in a way that it doesn't have reference in its training data for what I'm actually trying to do.
Anand Das (22:17.095)
I would say for all these models which are supported by bigger organizations like OpenAI, Anthropic, Amazon, Google, they will put, I'd say, restriction on top of the model and also train the model not to give it. But I'm not too sure about all the open source or other models that are available out there because that requires good amount of work. And as you rightly pointed out,
Previously, there was a guy who was using a prompt on our side, like, give me five ways to break the code. So now, break the code means multiple things. From his perspective, he was trying to break the code so that he could basically write test cases. So he's like, give me five ways to break the code. And the next prompt was, generate test cases for this. Suddenly, one day, it stopped working. Because break the code means you might be breaking a code, which is a code.
which can be a access key, right? Or something like that. So it just stopped. So we had to change the question. It's a, give me five ways to test this code, but what I'm looking for is negative test cases, right? But the fun part was, it was working sometimes five ways to break the code, and sometimes it was not working, which also meant that they were recently updating the model at that point in time. It hadn't seeped it through, right?
Ben (23:17.213)
Mm-hmm.
Ben (23:28.541)
Yes.
Anand Das (23:43.747)
And it also depended upon the context, because they try to identify what you mean by this request. So they started doing that first. And then the other thing is you can change the question. So if you're just looking at the question or what the request is, you can prevent that request from happening. But I can change the request when I don't let out the intent, but I'm saying try it. Give instructions. Do this, and this.
whose output might lead to something which is wrong. And now some of the algorithms have started putting the restriction on the output that is generated, so they verify the output. But again, these guys are kind of like OpenAI or any massively used models. They have to do that because the risk is higher of them allowing somebody to break something. But in open source, it's not like open source models are bad or something like that.
But in open source model, people will have to take extra effort, like whoever is maintaining it, to add that layer in. So that will be like, even today, I think on some website or something, somebody had mentioned that, hey, there is this model. There are no restrictions. There's no check. You can actually give it questions, and it generates code, right? Like the dark side, or the dark web, as I call it. There are some models still there.
Ben (25:05.033)
He he.
Anand Das (25:09.807)
you know, which basically have no restrictions and you ask them anything and they're going to answer. But we need to like as an industry, I would say that we need to basically work towards making sure that doesn't happen because it's going to kill us at some point.
and
Ben (25:25.297)
Yeah. I mean, if you get, I mean, imagine what it's going to look like three, four years from now, where assuming we're still on Transformers architecture, which we probably won't be, with how many great minds are working in this field now. But what happens when we have, you know, a 10 trillion parameter model and it's been trained on so much code.
Anand Das (25:35.291)
Yeah.
Ben (25:54.537)
that nobody's really capable of doing QA on the input training data.
And something like that is just provided for, for anybody to use. It's like, Hey, so long as you have this, you know, whatever it might be, 800 gigabyte, you know, GPU instance that you can load this model onto, uh, four years from now, provided you can, you can start it up and run inference against it. It can do stuff like, you know, you ask it, Hey, how do I.
You know, I have this particular, you know, control software. You don't tell it where, where it's installed. You just say, Hey, it, you know, this esoteric programming language that's used for controlling devices. And I, I need an instruction set to troubleshoot and interface with this system and it's like, Oh, I, well, I do have reference to that language and it's
You know, it utilizes profibus connections and here's the instruction set to gain access to it. And then here's some code to run this diagnostic that you want. And nobody realizes that this person has jumped over a fence or cut through a chain link fence at an electrical power substation. And they plug in their connection cable and are sending instruction code to this thing that then takes out a large part of the power grid in a country.
to get that knowledge prior to one of these tools, you would need a decade of experience of working with that type of stuff. And during that, you acquiring that decade of experience, there's probably government security that's involved in making sure that you're not a terrible human being in order to be able to be exposed to that. So from my perspective, stuff like that is kind of scary with some of this stuff.
Ben (28:01.767)
Uh, if there isn't security controls built into the model architecture itself, and that being a standard, I think we're going to see stuff like that in the next decade.
Anand Das (28:12.379)
Yes, absolutely. Totally agree. And right now I think, you know, these guys, like, you know, I'm not, like, you know, all the big guys, they have a reputation as well as security to maintain. So, you know, it's like, they have to do it. But those who don't have to do it, and like, you know, they're not wrong either, right? Researchers are basically publishing things for the good. Right? They don't have huge amount of time to actually put these restrictions in place. But, you know, there is a framework on top of which they build, which actually gives this.
restrictions, then it makes it easy. But again, as a programmer, what I think is, today we have hackers, both white hat and dark hat. If you have something which is open source, and if there is a module which is basically adding these restrictions, somebody who has a bad mind can actually take that code, modify it, and still try to do what it is not supposed to do. So there are two.
two thought pools, right? Like, should you have it as open source or what part of it should be open source and what part of it should not be? But at the same point in time, it's code. So somebody can actually, you know, if they want to at some point in time can take over, take away the security aspect of it. You know, and that is scary. That is, that is scary. Yeah. To be very frank. So.
We need to have more good human beings and we need to have checks, I would say, because anything in code can be changed by somebody who knows how to code. And that's the issue out here. If, if, if we have like some bad guy who has a bad mind. Like, yeah.
Michael (29:56.774)
Yeah. Oh yeah. So I have a question. Uh, Ben and I were chatting about this yesterday and I would love, love your take on it. One of the things that frequently we run into with LLMs is keeping the models up to date and they're incredibly expensive to train, but having really recent information, especially when you're working on sort of cutting edge tools or techniques, it's really valuable and I found that for instance, I
And building a new Databricks tool and there's open source documentation, but maybe the documentation isn't great. So I'm sort of scratching my head. How can I actually build this thing? I go to chat GPT and say, Hey, do you even know what this is? And it just spits out nonsense because it was trained in 2021. So how do you think about keeping these models up to date and specifically, how do you think about cleaning data and bringing that in, but also
How can you retrain in a cheap manner, sort of at recent intervals?
Anand Das (30:59.295)
So the thing is we're using, I'd say, commodity models, OpenAI and stuff like that. The moment you train, your cost goes up. And we are cheap guys. I don't want the cost to go up. So we use a different mechanic. So we believe that you would want to train the model. When you say training, from our perspective, it's like to change the behavior of the model of how it is answering questions. Or if you write code in a certain way,
after you put up space before curly braces whenever you open, those kind of things. So I'll train the model a lot. But if I wanted to use more recent information, then what we tend to do is, rather than depending upon the model, which is basically going to update its data, which is used cost anyway for opening and the other guys, and they do it maybe quarterly, half yearly, or yearly, what we try to do is, depending upon the-
things that you're planning to do. Like for example, you're using Angular right now. So if I have Angular documentation, which is basically properly indexed using embedding, then provide that context when you're writing the Angular code for things that have been modified from the version that the model is on. That's a quicker way. But you have to also modify the prompts to make sure that the hallucination is not there
where to use what. So for example, if a particular thing is deprecated and new thing has come up, then you have to mention that this was deprecated. Instead of that, you can use these things. Otherwise, the answers are bad. But if you want to train, if you're training specifically, not on a generic stuff, but on your own code, so it's a limited data set, then it's an easier task. But at the same point in time, the
If you want the training to be fast, you need to actually come up with the right question and answer set that you provide as the input. And that's a learning in itself. When we first started training a particular model, everybody said 1,500 inputs is good. We did 1,500 inputs. We figured out that the model basically didn't do anything different with the training that we had given.
Ben (33:23.669)
Thanks for watching.
Anand Das (33:25.787)
I had to give like more than 100K inputs for it to start showing me results of training, right? Then too it was not correct. And then, you know, I had to give huge amount of questions, right? Then we kind of figured out, and this is different for different models, but some models are very good if you give them more negative examples. And some models are good if you give them more positive, like, you know, a mix of positive and negative examples to kind of train faster.
So in some cases, you say, do not do this. And in some cases, you say, this is OK. This is bad. But it differs from model to model, right? Although you're using the transform model and stuff like that. And this you basically figure out after doing it for a couple of times that, OK, this is what works here. This is what works there. And ideally,
We would like to automate that, to be very frank. Portions of it are automated, but we don't have something which is like, OK, this is the model. Let's try to train it and see which mechanics works. And these are the things that we have learned the hard way, doing and then figuring out it doesn't work, and then figuring out what works. But there are other models out there that we haven't tested, so the mechanics might be different. And obviously, the 1,500 market, like 1,500 inputs is good enough to train.
kind of marketing input, yeah, that's fake. And then generating these questions, the other thing is you do need human inputs. You can generate the question and answer using AI itself. But that is like, as Ben pointed out, you can feed wrong things also. So in training, although you might use AI to generate question and answers, you need to go through a human review.
Michael (34:53.479)
noted.
Anand Das (35:16.631)
and put some checks and balances in place before those question and answers are used. If you're doing demo where, smaller things, then go ahead with it. But if you're doing an enterprise thing, you'll have to put these things in place. Otherwise, it will work good in demo, but when you're using it in practice, there are things that will break or it's not basically giving you the type of answers that it should.
Michael (35:42.13)
Yeah, it's a really interesting concept that sort of relates to drift detection where if you're constantly retraining, how do you monitor whether the model will start hallucinating more or whether quality will change? And so obviously you need good training data each, each rerun, but, um, you can also just monitor that over time. So it's, it's a really cool. Like concept of applying drift detection to, to LLMs as well.
Anand Das (36:00.988)
Yeah.
Anand Das (36:05.711)
Yeah. And that has to be put in place. If you're doing it without it, then suddenly, not now, but six months down the line, you start seeing things which you are like, this shouldn't happen this way. It's like asking a kid, just learn on yourself. So you need guidelines so that he doesn't go on the wrong track. So that is where you need to basically put
I'd say human review or some other mechanics. In some cases, you can automate it. In some cases, you need human inputs.
Ben (36:45.361)
Yeah, that's the really interesting thing that I've seen, uh, you know, as some of the listeners who have been like tuned into the show for a long time. No, I used to be a data scientist and build models and solutions and stuff and companies. Um, and when we were talking about NLP a decade ago and we're using the old school techniques, uh, or it's like, Hey, I need all this training data to go into this model that can find. You know,
Ben (37:17.485)
word groupings and try to do auto completion with finding probability of next word. It was really slow, but you could get it to do certain things that were pretty clever. And you'd see, if I just look at these n-grams and then check out what is a potential probability for the next word, it could do stuff like clean up messy text.
Ben (37:45.757)
people, you know, there's grammatical errors in it, but there's a million entries. I don't want to sit there and rewrite these by hand. Perfect application for that. But when we were training stuff back then and even the precursor to Transformers talking about like LSTMs, you build something and...
So much time and effort went into the training data where you have to go in and manually clean a bunch of stuff and write a bunch of algorithms to figure out where there's problems in the data and cull certain things and identify it for usually the data scientists to go in and fix all this stuff. And then you have the sacrosanct data set that you then train on.
Ben (38:36.541)
never did that before because they're just now coming to the party. They're like, hey, these things are easy. I can just give it, you know, I can use GPT-4 to generate my training data and for fine tuning. It's like, yeah, you really don't want to do that. Or they say, well, we can't scale, you know, creating a training data set. And the question that hits my mind is like, what do you mean you can't scale? Is it time?
And they're like, well, we don't have the people or the time to do that. Like we didn't 10 years ago either, but that's why it took a year to get the model out because it and 20 people working on it. This is, this is not like a new problem. This is machine learning 101. Like this is how you get good things. So you put a lot of effort in. Do you see, or do you hear people talking about that specifically about what your company is working to address where people are saying?
Anand Das (39:19.25)
Yeah.
Ben (39:34.353)
Why I have this esoteric programming language that, you know, the tool on its own, it doesn't really do well with like, Oh, I need, I need an LLM for Haskell. And you're like, okay, uh, we don't have a ton of training data on that. Or maybe you do. I don't know. But, um, do people want to see like a quick result from something like that? Do you, do you get that feel from, from customers?
Anand Das (40:00.059)
I think, yeah, from customers, some of the customers were kind of new and haven't basically dirty their hands by actually trying it out. They think that training should be like two days worth of work, right? I'll give you all the data and somehow you fit it in. And people who know about it, they will talk like, they know, they'll talk sense, saying that, how are you going to actually use this much amount of data? How many questions will you generate?
If I update my code and you want to train it, what is the lag time? It's going to be a week or two weeks or a month. So people who know, and people who know are also people who have solved these problems before. Most of the people who are like, OK, this is new tech. And I basically asked it a question to generate some test data based on the code that I've given. And it gave me 50 records. Great. It can generate any question.
Ben (40:41.362)
Hehehe
Anand Das (40:53.887)
answer, right? Like you ask 10 questions and then you know, yeah, cool. Like that should be easy, right? Like, you know, you just run it in a loop. Like, like the answers that are coming, do you know that, you know, it can hallucinate and like, what if the temperature value is seven versus two, right? You know, you don't want the same question and answer repeating multiple times. Have you seen the output if I run it like a thousand times? And look at what kind of questions and answers you're getting.
And then actually sometimes you have to, people are so much into it and they believe so much into all the Twitter stuff. Somebody is saying that I did this and they're like, it's so easy. This guy wrote an open source piece of code and it does this. Now you have auto GPT. I can give it a question and it solves it. Then you ask them, how many hours does it take to solve this complex problem? Because it's doing trial and error. Then you give them an example, like, okay, let me run it.
this question and answers and you see the question and answers. And now that you look at it, do you think that it is as easy as just generating question and answers and feeding the data? We can do that, but it's not the right thing to do. But sometimes you have to show and tell people, and then they understand. Because people are all in, this is going to change the world, and we just need to use it somehow. But people who are kind of real and have dealt with these problems before, they're like,
We understand, okay, training is there. Training is costly. You have to give a huge amount of data. How are you going to generate the data? How are you going to have like negative and positive? Like, you know, what do you think will work and you know, will this change if I'm using a different model? Because enterprises are different, right? They don't want like a model from open AI or this thing because they're still thinking that, you know, it's not secure. And obviously, you know, if they're using the APIs, which are out there by default, you know, data is used for training.
So they want to take an open source model, train it, run it in-house. And when you do that, like it also depends upon the model that you're using, you know, how much amount of data it already has, the kind of training input it can take, like the token limits and so on, because you have to make sure that the question and answers that you give fit into that, right? They're more complete than, you know, you're spreading it across, because then, you know, training is more effective. And you have to also figure out, like, one is like doing the initial training, then you have to figure out.
Anand Das (43:17.667)
Anytime things modify or are modified, what is the differential learning that you give on a periodic basis? And how much time it is going to take. So there is going to be a cold start problem. And there is going to be a repetitive training that is going in. Say that an enterprise buys a company which has a set of APIs and suddenly everybody has to use it internally.
Uh, like safer authentication now, you know, anytime you type code, like, you know, you're not basically using some open source stuff, you know, this, this authentication function that the company has bought, right? That you have to use. That is what shows up in your algorithm. That is, uh, you know, something that will require a good amount of work. It's not as easy as that, but you have to show and tell some people who are like, you know, who are blind, blind or, you know, I'd say believers because you know, it, it holds promise, but you know, like
There's a lot of marketing and fluff from the reality. And those people who have done it before, they are like, no, it's not that easy. They clearly know it's not that easy. Yeah. So we see both.
Ben (44:26.913)
Yeah, can confirm, you know, talk to a number of our customers who are trying to get into it and I can tell within 30 seconds of jumping onto a meeting whether people have built stuff at scale with sufficient complexity in the ML space. I'm like, yep, these people know what they're talking about. Because it's sort of the questions that they're asking and the comments they're making. They're like, yeah.
HGPT4 is cool. We ran it through some paces to do this sort of problem. Okay, I'm talking to a team of actual data scientists and ML engineers and software engineers who know in order to get the vision that the business wants, this is a two-year project to use this massive open source model and they understand what's involved. There's other people that are just like, yeah, this thing's great.
Anand Das (45:14.814)
Yes.
Ben (45:25.893)
It can do no wrong. Like I asked it this one question. It didn't have context. And yeah, I usually give them a little bit of homework and say.
test it out for what you think it's going to do, but start asking it questions in ways that are not generic, but are specific to your business and see how well it does. And then they're like, Oh yeah, it has no context. It doesn't understand what any of this stuff is. I'm like, because nobody outside of your company understands what this is. That acronym that you use, hopefully it's unique and it can just say, I don't know what this is. But if that
If that is some cool little marketing term that you took from English vernacular and it means something completely different, check to see how it responds to that. It'll think that you just had a typo or you accidentally put the caps locked down while typing this in and it will go absolutely insane with this response.
Anand Das (46:27.603)
Absolutely, the fun part is when people say, hey, training is easy and stuff, one fun thing that I ask them to do is take a piece of code and ask OpenAI to basically make it performant. Take the performant code and then again, ask the same question, make it performant and keep on going. Because it will always give you changes. And it never stops. And the fun part is if you're not...
Ben (46:46.025)
Mm-hmm.
Yes.
Anand Das (46:53.423)
If you're looking at it in detail, then you'll see that, you know, some of the code that was modified before for performance reasons, it has like remodified and then it'll go back to some things that it has told you before in the next one. And it just keeps on going in an infinite loop. The reason for that being is it thinks that it needs to answer. So if you don't train it properly, it's never gonna stop. And then once they go through it, then they're like, oh, okay, because you know, you don't stop in one place, and even if you regenerate, it is gonna regenerate different answers.
Then you ask them, what do you expect it to do? Give you something which can stick around. Or sometimes you're writing a new piece of code. So you want to look at what are the different ways in which you can do this same stuff. But then hallucination can creep in. There is no solution for it right now. Once in a while, it will creep in. Whatever prompt you give, when you say, don't use this, use this only, it's going to creep in, that we have seen. So there are a lot of challenges. As you said.
Anand Das (47:54.272)
It's not easy, but when people see, like, I gave a question, when their whole premise is based on two, three usages that they have done manually, yeah.
Ben (48:06.069)
That's the same thing that I found. In fact, just yesterday evening, I was playing around with Chat GPT-4. I had just finished doing a unit test implementation for something that's pretty low level and complex in Python, believe it or not. But it required interfacing with memory state of the actual hash tables and being able to...
test around mocking some responses that it would have for object references, which is something that you don't typically do, because it requires using sort of dev APIs in Python that change with every Python release. So good luck, you know, keeping that compatible. But because I was trying to like diagnose something, I was going down to that level and saying like, what is going on here? Like, why do I have this extra reference here? So I was like...
I figured it out and got the result. Then I wanted to see how GPT-4 would do with that. I was like, hey, I need to do this thing where I'm overriding the state and it's something that's going to be influencing how the global interlock works and I need to get this memory address. It gives me something that's basically read out of the official Python docs about how you would access this. Then I said, well, how would I do that?
override this. And it did not hesitate. It started producing unit test mock things for patching. And I was like, all right, I'll humor it. I'll run it and see what happens. And of course it's completely wrong. So I took the response and then I explained to GPT-4, like, hey, what you're doing is referring to this actual processes reference to itself. So you can't actually do that.
it's self-referential. I need something that's not in this process thread that's in another instance of Python running in order to get this access. And eventually, you know, the answer is, you can't use Python for this. I need to use a Unix command to inspect this. And it was just crazy how insane it became with just doubling down on
Ben (50:29.465)
more and more complex, more and more broken things. And it brought me back to the other thing that I did this week, which I think these systems are absolutely amazing at, which is learning something for the first time. And I'm blown away by how great these things are as teachers, provided that you tell it like, I know these languages.
I have no concept of how to do this, you know, X language proficiently. Please teach me the fundamentals and provide me examples of how I can do these tasks in this language. Cause I'm trying to learn React right now. I'm not a front end guy, but I need to learn it to do a big project. So I'm using it to say, show me a good way to do this with this, you know, front end stack. And it's
Anand Das (51:14.556)
Oh, same as me.
Ben (51:26.857)
guiding me through and I'm learning way faster through examples than I would trying to read through a book and use the getting started guide. But once you hit this, there's like this threshold that you hit when you know enough about that language's behavior where you can instantly spot where the response is total garbage. But it's kind of scary when you think about if somebody were to ask me...
Anand Das (51:35.763)
and audio.
Ben (51:56.189)
Hey Ben, I need this credit card payment app to be written in React. And you got a month to do it. And I'd be like, I have no idea how to do that. I've never done that before. I don't even know React well enough or JavaScript well enough to do something like that. And if I go over to GPT-4 and start asking it to like, Hey, can you generate all this stuff for me? I needed to do this and I needed to process this credit card number properly. And.
and make this call out to Stripe properly and then safely secure this, it could be creating something that's going to basically build a lawsuit for your company. That's what's kind of scary.
Anand Das (52:40.859)
Yeah, yeah. For learning, I'd say it gets you to an escape velocity, as I think about it. Like if you want to, like I was learning Angular. I use Angular. I'm a complete back end guy. But sometimes when you have to fix things and you don't know, then what I've seen is when you have specific questions or.
Ben (53:01.407)
Yes.
Anand Das (53:06.235)
You know, people like us, when we learn or even kids, I'd say, who are learning programming languages, for them, reaching to an escape velocity or like the base model level is good. But if you want to do some things which we want to do, right, in our brains, but the documentation doesn't support it and there's no knowledge about it, right, like the hackers' mentality, if I may say so, then these guys fail because nobody has done it. Or like, you know.
Ben (53:30.111)
Mm-hmm.
Anand Das (53:35.671)
They'll basically paste you documentation. They'll tell you, based on whatever they know, they'll generate something which may not work. Because sometimes I think that they learn a lot, but at the same point in time, sometimes I think that the conceptual part is not there. You know what I mean? It's like, yeah, you memorize a lot of things, and you figure out the sequences in which, or order in which, these kind of like.
correlate with each other, right? Sequences and series like, you know, like this memory maps, but you know, this is the concept dude. And then, you know, based on this, you can only do this. You shouldn't do this or, you know, things like that, that it kind of misses out. So initial start is pretty fast and you know, solving a problem when you can describe what the problem is. And in your mind, you have a solution. This is how I want it to be. Then it makes it easier.
But if somebody is basically saying that, you know, I'll just go to it, it's my like, my, you know, hidden programmers, right? Like, you know, then it's not good because like, you don't know what to do. You're expecting this guy to figure out what to do. GPT-4 to some level, you know, you can give specifications and it will kind of understand, but not like, you know, if you're writing complex piece of code, then not so good. And again, we talked about prompt engineering.
Ben (54:30.814)
Yes.
Anand Das (54:59.975)
But frankly speaking, I think people who have built a schema in their mind or, you know, I'd say a logic of how to solve the problem, they can provide the prompt properly. Whereas people who are like, I want to solve this problem, but haven't basically figured out in their mind, the logical order of our sequence of how the thing should work. Their answers will never be right. But people who have that sequence correct, right? Their prompt will actually yield more better results.
And that's practical stuff. Like, prompt engineering is like, OK, if your logical thought process is good, you'll get it. If your logical thought process is not good, then yeah. Sometimes it's like writing pseudocode, right? And asking it to generate code, in a way, at a high level.
Ben (55:47.869)
Yeah, the best unit tests that I've had it write for me are the ones where I'm writing more English in bullet point instructions than it would have taken me to actually write the test. And I'll only do it if, if I'm like, all right, I'm doing something that is, I don't need to have interfaces with other parts of my code. I have this, you know, sort of pure function that's just doing this one thing.
And it's calling maybe one other function and that's it. So I'll paste both those functions saying, this is the intended performance of this, this code. Here are the data structures that I'm expecting to have come in. I want you to write positive validation for these eight data structures. I want negative validation for 12 others. And I want to make sure that all of my exceptions that I'm raising within these functions will be triggered with.
what I'm expecting is invalid results. And then here's how I want you to write these tests. How many of them I want you to write the structure, which test framework to use. And I write that all out and like, I just wrote like three paragraphs of text, geez. But if I say, generate 20 tests for me, and it'll sit there and iterate and write that. And then I look at the result and I'm like, all right, that saved me from writing 900 lines of basically.
you know, script, I see that as a win, but when I have to manipulate it to get an, like, cause I'm just doing this for testing, but I'm like, Hey, I need this thing to do X, Y, and Z. And I want it to, you know, use these input types. And I don't give it a lot of context. You generate it. You're like, you just having to correct it constantly. And what's really funny to me is when
I'm in a prompt session and I'm going through and asking a series of questions that are using parts of a language that aren't frequently used or I don't think are frequently asked about because I don't think a lot of the training data is on framework code. I think it's more on the applied side because that's what most users are going to be asking. But if you're talking about interfaces to the operating system and system, you know,
Ben (58:13.133)
related things and stuff with sub processes and threading. You start getting down this whole of, hey, I need to access this low level component and use it in some way. When you start asking things and asking for examples to be generated to solve a problem, it starts writing its own framework code, which it always just blows my mind when it hallucinates in that way.
It creates its own API that doesn't exist in the language. But then you look at it, you're like, that should exist in the language. This is so useful. And it's like, and you started thinking like, how did it come to this? This, you know, how did it come to generate this answer? You realize, Oh, that exists in C plus plus, and it exists in Java, but it doesn't exist in Python, you know, or vice versa.
Anand Das (58:47.515)
Yeah, yeah.
Ben (59:06.057)
It's like, that's why it thought that this was a thing because of all of this other context that it had.
Anand Das (59:09.979)
Yeah. It's fun because you know, sometimes you kind of ask it to write say a node JS or Python thing, right? Like, you know, the, the cool example is like I want to use tick token in like a node JS. Right. So then it will basically give you something like this doesn't exist. Then you have to let me write a Python code that will work.
and then try to use that as a module, like, you know, I'm making some extra calls in Node.js. Okay, fine dude. But you know, over a period of time, it started giving that as a result. I'm like, okay, fine. But you know, you can use something else other than this. You know, you can look at the libraries which are there that you can readily use in Node.js, like, you know, use GPT-3 tokenizer or like, you know, stuff like that. But it's kind of funny when it hallucinates at that level.
If you ask it to use a particular, like, you know, it doesn't know about a CLI or it doesn't know about an API, which is looking and say, okay, using this API, you know, do this. It'll actually make API calls. It will make rest calls and it looks all proper. Then you start using it. Then you're like, this doesn't work. Yeah. I'm like, yes, because it doesn't know about this. You said that there exists an API. So it just crafted its own and gave you a piece of code. Now you have to go and figure out which of the APIs actually does what it was.
Ben (01:00:21.202)
Mm-hmm.
Anand Das (01:00:35.375)
like what this code generated, right? Like, you know, it said, okay, add an element, but does it have add an element or add a list? Then you have to go and modify the code, which is funny because people go into that, like, you know, as I said, black hole, they'll spend like hours. Some people will say, okay, I give up and I'll write the code on my own. And some people will, like, you know, some people who then learn would actually say, hey, by the way, this is how the API is, right? These are the APIs available. Can you modify the code to use this API?
Right. So it's, it's pretty interesting. Yeah. A lot of like hard-ended lessons I would call, which are funny also.
Michael (01:01:13.29)
Oh yeah.
Ben (01:01:13.597)
I've actually learned to use some of that hallucinating output in that exact way when designing an API. So I'll ask it to like, Hey, are you familiar with this framework? And it's like, yes, or it will say, no, I don't know. And I'll, I'll take like, Hey, here's the domain client API for this. And here's the methods that are available. And I need to add in something that does X. Uh, can you write?
you know, the applied usage of this new API. And it's interesting to see, or I'll say, Hey, give me three examples of how to use this. And you start to get it generated. I know I'm on the right track with my own design when it largely agrees with most of the things that are generated because it's almost intuitive. It's like, okay.
Anand Das (01:02:08.317)
Absolutely.
Ben (01:02:09.309)
This is learning how other things are built. And if I deviate away from that too much, there's no context for a human user to intuitively understand how this is gonna work. I'm waiting for somebody to build a tool that does that. Like, hey, use an LLM to basically generate proper API designs.
Anand Das (01:02:19.323)
Yeah. And it will also give you...
Anand Das (01:02:28.323)
Yeah. And the fun part is it will also do versioning and backward and forward compatibility. It's like because we were trying to build a API layer. And we had to start with a prototype API. And then everybody's happy that it's up and working. And then we did the same thing. We asked both OpenAI and Entropic, right? Hey, generate an API to do blah, blah.
And then suddenly, you know, one guy's like, shit, we didn't add versioning.
Like this guy is like putting in this stuff. I'm like, good, like you learned something. Let's go and add it in. So, you know, that's very interesting. Yeah. So, and you know, like, you know, there are these simple things which, you know, even like, you know, programmers who have been there, because, you know, they're focused on like, you know, getting the solution out. So when, you know, they see these kinds of things, they are like, you know, suddenly it's like.
Ben (01:03:09.625)
Yes. Put that V1 in there.
Anand Das (01:03:30.211)
Oh man, like, you know, let's put this in, but if you do it on day one, it's great, right? You don't have to kind of modify or change the mechanics later on, but these simple things like code smells will go away when you utilize it in the right fashion. Like, yeah.
Ben (01:03:45.797)
Yeah, I had a question that I wrote down in the first three minutes of our chat that relates to this. And something that one of the members on the team that I'm on implemented for our GitHub repo is an integration that says, Hey, I can tag a code block in a GitHub PR and it'll send it off along with context around it to open AI. And we can ask questions to it and say, Hey, is this the most efficient way of doing this? Or could you
Could you rewrite this for me in a more efficient way? Or can you, like, this is too complex. Can you rewrite this for readability? Which that's the one that I love the most. I love that prompt. I'm very much about readable code. You know, huge adherence to that. But one of the things that's really interesting is, what are the challenges for your tooling and at the company that you're working at?
where it becomes not so much a technical problem, but a philosophical question or a human behavioral problem. If you have something like that that's running, what are the impacts to the team psychologically to having a dad in the room that's checking everything that you're doing? How do people see that?
How do you feel people will see that going forward if they lose faith in that?
Anand Das (01:05:19.891)
So there are a couple of things, right? Let's assume that whatever it is giving is better than what is written for a second. And it works. And all of these are with a lot of caveats. It works is another caveat, but it's also better than what you have. But we get asked this question when we are talking about building a PR tool or something like that. People are like,
Can we actually, you know, is it going to run at the user level or is it going to basically run as a kit hook, but it's going to post comments publicly. Right. The, you know, and I'd say, you know, there's this difference also. Junior guys don't care. They're learning senior guys who've been there for a long time. They
You know, I would say are a bit afraid, but they're kind of apprehensive about like what it is going to put in. There are two schools of thought. Somebody is like, OK, I'll learn more. And some people are like, you know, we did this for a particular reason this way. It may not be performant, but you know, this is the context of why we have done it. We know that it's not performant. So if you're basically changing the code for performance, then it's a wrong thing.
And then some people just want to basically be able to get that input and fix it, but not known publicly. So we see all three things. And then people do ask us like, hey, if it is at user level, it's great because when I check in or whatever, you tell me I fix it before it goes in. From the organization perspective though.
Like the engineering managers are like, if I have it as a kit hook and it goes out, then it's good because I stopped something that, if somebody hasn't installed your tool and it goes in, I can fix that. But then when you do that, then people are like, I'm getting commented on my work, not by a human, doesn't have the context and stuff like that. So we see all of that.
Anand Das (01:07:30.259)
Uh, but you know, most of the people like developers would be like, if I can get it before I do something, it is much more better than, you know, going out there. But the engineering managers are like, I want to stop everything that is going in, which is not good. So I wanted, uh, you know, at the other level, but then people are like, can it take like, you know, some, some of the senior guys are like, can we provide it the context of why something was done in a particular way so that you don't give comments, which are right.
if you're kind of looking at it from performance aspect. But based on how we did things is not something that we are going to do. So if you can feed that in, then you can do it. Or otherwise, we can't use it. So we see all the things out there. Again, it's philosophical. And I'll call it muscle memory kind of scenario.
Ben (01:08:21.695)
Mm-hmm.
Anand Das (01:08:23.623)
People just expect it to be a certain way, and then they believe that if you put this in, then you might have a problem.
Ben (01:08:32.276)
That's something that I see with the widespread sort of proliferation explosion of LLMs as it affects software engineering as an entire industry is a lot of people are like, oh, it's going to automate our jobs. No, it's not. We're a century away from anything like that at a minimum, but what it is going to do is what it's doing right now for teams that are willing to adopt it, which is increase productivity.
Anand Das (01:08:49.undefined)
Night-Snipe.
Anand Das (01:09:01.98)
Yes.
Ben (01:09:02.041)
and just have another, it's not a creative human brain that's there sitting right next to you commenting on what you're doing, saying like, are you sure you want to do that? But it's another perspective of something with vast knowledge resources to say, I have not seen this pattern before. I've seen kind of what I think you're trying to do 800 times done this way. Have you thought about this sort of thing? It can catch errors, which it's fantastic at.
And I mean, I use it every single day. It's such an amazing technology and tool for software engineering productivity. However.
What I think is actually going to change about code is what you just said. And it's something that I've started to do more when I'm providing, you know, code samples of something that I want it to modify or interface with my inline code comments have become more verbose. Where I'm it's almost like I'm crafting a prompt within my code comment to let it know this is why this is the way that it is and we cannot change this.
due to XYZ. And I find that at least with GPT-4, it starts to sort of grok what the heck is going on and it won't actually modify that in a way that will break integrations with other parts of the code. So that's what I think the big side effect is going to be. Code bases are going to get bigger with text and not with code.
Anand Das (01:10:34.095)
Yeah. Text, comments, or you provide supplemental data correlating it right one way or the other. But yeah, as you rightly pointed out. And I think there is like nobody can take away a human who basically has a logical mind who decides. What I think is it can solve any problem, but you need to figure out what problem to solve. That's a human who's going to pose the problem. This is what I want to solve.
The other thing is there can be infinite ways to solve a problem. But it's only a human who basically will say, these are the restrictions that I have. This is the way I would like to approach to solve this problem. So the logical thinking is not going to go away. The real programmer is not going to go away. If you're basically copy pasting and being a language translator in coding, yeah.
then you are affected. But as long as you are basically using a logical mind, I think, as you said, it's not going to go away soon. It is generative AI. We don't say thinking AI. So there's a reason for that. So human mind is still required. That is what I believe. For me, what I think about this opening and stuff is we had abstractions. We had assembly. Then we had C.
Then we have Java and a bunch of languages which abstracted a lot of complexity. So now you can basically give instructions in English. But again, what it is doing is generating code that you need to run. It's not like taking your English and then executing it. So it's kind of a level of abstraction to get you information. Over a period of time, that abstraction will keep on getting better so that it can solve problems and challenges based on the input that you give.
it'll try to figure out based on the input that you've given. It might ask you, is this the context that you're talking about right now? It doesn't like does it to some level, but not a whole lot? And probably the context length will increase. I think that is a major thing. Because if that is solved, I think it can help with a lot of things. Right now, it's kind of like you have a window of 32k or you have a window of 100k. And you have to see how to slide it properly so that you get the answers.
Anand Das (01:12:55.479)
Right? If it doesn't fit in, then although you do a bunch of things, if you can't fit in the context in the input, it's a challenge. Summary is a challenge. And I wouldn't say that we have solved everything. But yeah, we're seeing what kind of issues it creates. And as we see them, we are fixing them one by one. And we are learning over a period of time. Yeah.
Ben (01:13:21.237)
And that's really the way to make any successful ML product is it's just brute force humans that are brilliant working on these problems and going through and saying, all right, that's a new problem we created because we fixed these other five, let's tackle that. And then, you know, that iterative process, it's how to make great stuff.
Michael (01:13:48.91)
Yeah, so I'm going to jump in here. Uh, we could keep talking for at least six hours, maybe even up to 24 hours. But unfortunately we have jobs and lives that we have to get back to. Um, so I'll yeah. Yeah. Feel free.
Anand Das (01:14:00.255)
I was thinking like Ben and I would code together, right? Or something like that. But we should get on it at some, someday, yeah.
Michael (01:14:08.826)
Yeah, just Saturday dates with you too over VS code. Sounds good to me. Um, cool. So I will quickly summarize. Um, today we talked about a ton of really interesting and salient points regarding the LLM industry. And some of the things that I keyed in on were the challenges that we talked about. So context length, which is essentially sort of the memory about a current session. And if you're not familiar, you can sort of think of it as like cookies for a web browser or anything like that.
Ben (01:14:09.055)
definitely.
Michael (01:14:38.978)
Um, maintaining state and maintaining context about what you're talking about is really challenging because there's finite memory. And so, uh, one thing that I thought was absolutely fascinating is Anand's team is looking to solve this limitation via intelligent search of sort of code dependency graphs, super smart implementation. Another issue that we've frequently run into is keeping models up to date. Retraining is really expensive. There's not enough GPUs. And so.
You can solve this either using context or sort of retraining with human in the loop feedback. On the user side, there are a million challenges, but a couple of things that we discussed is one, people blindly using the tool solution for that. It's just don't do it. And also people nefariously using LLMs. The gatekeeping aspect of sort of stacking a model that says, Hey, should I actually be returning this to the user? That's a great solution. But with open source models.
A lot of that burden goes to the developer that's creating the model. And so it's sort of an unsolved problem. So yeah, those were a few things. There was lots more. Um, but on, if people want to learn more about you or your work, where should they go?
Anand Das (01:15:50.431)
You should go to bit.ai. We build a developer assistant using AI. Still in recent stages. A lot. It's in the early stages. There's some things right, not all of them. And we believe that over a period of time, it will get better. We just launched the functionality to actually answer questions based on your code. That means any code that you open in your IDE.
We index it and then answer questions based on it. Scale is a challenge, but we are getting there and it will improve over a period of time.
Michael (01:16:29.666)
Awesome. And that's B-I-T-O dot A-I. Cool. Well, until next time, it's been Michael Burke and my co-host and have a good day, everyone.
Ben (01:16:38.889)
Ben Wilson.
See you next time.
Anand Das (01:16:42.995)
Thanks everyone.
Challenges for LLM Implementation - ML 126
0:00
Playback Speed: