The Past and Future of Language Models with Andriy Burkov, Author of The Hundred-Page Machine Learning Book

Richie and Andriy explore misconceptions about AI, the evolution of AI, AI research, the role of linear algebra in AI, the resurgence of RNNs, advancements in LLM architectures, the reality of AI agents, and much more.

Apr 14, 2025

Guest

Andriy Burkov

Andriy Burkov is the author of three widely recognized books, The Hundred-Page Machine Learning Book, The Machine Learning Engineering Book, and recently The Hundred-Page Language Models book. His books have been translated into a dozen languages and are used as textbooks in many universities worldwide. His work has impacted millions of machine learning practitioners and researchers. He holds a Ph.D. in Artificial Intelligence and is a recognized expert in machine learning and natural language processing. As a machine learning expert and leader, Andriy has successfully led dozens of production-grade AI projects in different business domains at Fujitsu and Gartner. Andriy was formerly Machine Learning Lead at TalentNeuron.

Host

Richie Cotton

Key Quotes

People need to care about reinforcement learning because this is the closest type of learning to how people, how humans or animals in general learn. So no one trains a dog by showing it many, many pictures of how to chase a prey. The dog after watching like a million pictures doesn’t communicate, now I know how to pray. You actually shoot a prey and you say, go get it. And once it brings the prey back, you give it, give it the dog a reinforcement.

Agentic AI looks like robots are working for you and you just sit and wait until it's done. The problem with this is that LLMs aren't agentic, so they don't have agency themselves. So the LLM doesn't know at any moment in time, it doesn't know that it's in the process of solving any problem. So it was fine-tuned to generate the next token. And then we also fine-tuned it to generate the next token so that it replicates some patterns that we find useful. It doesn't make it agentic just because you proclaim it.

Key Takeaways

Understand that AI is not a singular intelligence but a collection of algorithms producing intelligence-like outputs, which can be applied across various domains.

Recognize that AI's history is extensive and not limited to recent advancements like language models; foundational techniques like logistic regression remain crucial in production AI due to their efficiency and effectiveness.

Explore the resurgence of recurrent neural networks, which, despite being overshadowed by transformers, are proving to be efficient and effective in certain contexts, especially with new architectures like XLSTM and MinLSTM.

Links From The Show

Andriy’s books: The Hundred-page Machine Learning Book, The Hundred-page Language Models Book

Skill Track: AI Fundamentals

Rewatch sessions from RADAR: Skills Edition

Transcript

Richie Cotton: Hi, Andriy. Welcome to the show.

Andriy Burkov: Hi, Richie. Thanks for having me.

Richie Cotton: to begin with, I'd like to know what do you think is the most common misconception people have about artificial intelligence?

Andriy Burkov: Probably there are more than one. But I think that the biggest one is that it's actually int intelligence. So it's a scientific term that was invented by scientists to talk to other scientists. So, and scientists often invent some really funny terms just for the sake of, having a common term that everyone is agree on.

this artificial intelligence was one of those. But then when artificial intelligence has become very popular and available, not only to scientists, but also to a regular normal human being, people starting operating this storm as if it was, you know, like, I remember news, articles or news episodes on tv and they say AI does this, and now AI does that.

Like if it was like some specific AI and previously did that, and now, now it, does that, but it's just different algorithms that, that people apply in different domains calling them, under the umbrella of ai. But yeah, it's not one AI that does all this and it's not really an intelligence.

It's more like intelligence, like outputs that you get when you apply those algorithms to some, to some problems.

Richie Cotton: So I guess relations in your book, you go back an awful long way in history. So it goes right back to the s... See more

tart of artificial intelligence, like the middle of the 20th century. I'd say. I was very impressed, like it's a very short book, but somehow you've managed to pack in like 70 years of AI history.

So talk me through why did you decide to start back in the 1950s?

Andriy Burkov: I think this was to show that AI is not language models and language models is not ai. Okay? Because again, the book. Was released at this specific moment in our history where just suddenly, well, I like I call this two years after JGPT was released, suddenly like millions of hundreds of millions people have been exposed to an avalanche of information about artificial intelligence.

But 99% of information that people have received in these two years were about language models. And it's okay because, well, currently it's the most powerful algorithm out there. But I didn't want this book to be about large language models like the pre-trained transformers because it would make an impression that this is what AI is.

like, okay, there is a book on on language model. You open it, it start with the transformer. Then how we pre-train it, how we fine tune it and how we use it. And people will say, okay, well I know AI. Now ai, it's much bigger than that. And it didn't start two years ago. It started like in 19 I would say 40 50.

And it started small. But over the decades the field has grown quite significantly. So my PhD, for example, was 15 years ago, and it was on agents and multi-agent systems. So like when I today see people talking about agents without really understanding what it is, because there is a scientific definition, well, not one, but several definitions of what agents are.

And if you don't start with definitions, if you just operate words that everyone uses, then it's not science because your agent can be different from someone else's agent. Your AI may be different from someone else's there. So I wanted to show not just the evolution of language models, but.

At the same time to show that transformers and language models, it's just a continuation of what, what's going on for many decades. So now people who will read my book, they will say, oh wow, I'm interested in language models, but there are other things. So where do I go next? And you can go next to many directions.

And I wanted to cover at least the most important ones in the book.

Richie Cotton: That's very true that it is helpful to have an understanding of the history so you understand like why these things happen and it's important that, yeah. AI didn't start with Transformers back in whenever it was 2017. I'm curious though, are there any parts of sort of 20th century AI research that are still relevant for people working in data or AI today?

Andriy Burkov: Plenty. For example, logistic regression. It's been around, well, I might be wrong, but for about 60 years. Logistic regression was there. Currently, it's still one of the most used algorithms in production ai. Why? Because it's fast, because it's implemented very efficiently. So, you don't have any troubles training, logistic, regression, whatever problem you want to solve.

Well, it's only a classification. We might say only a classification algorithm, but classification is one of the most important problems that we usually solve because a classification is a decision. Like, for example, when you decide whether a letter is a spam or not a spam, you make a decision, spam, not spam.

And to make this decision, you train a classifier. And currently, again, We might say only a classification algorithm, but classification is one of the most important problems that we usually solve because a classification is a decision. Like, for example, when you decide whether a letter is a spam or not a spam, you make a decision, spam, not spam.

And to make this decision, you train a classifier. And currently, again, people might, not know that there are other algorithms that would allow you to classify things and they would start straight. by taking some pretrained transformer and trying to fine tune it or just instruct it, like classify these things.

But this would be an overkill. You know, it's like, you want to dig a hole like, 50 centimeters deep, and instead of taking shovel, you, pay for the excavator. Like it doesn't make sense. So logistic regression is just one, but there is also support vector machines. Support vector machines is very powerful for text classification together with logistic regression.

And by the way, in my book, I wanted to demonstrate it and demonstrated it very clearly that if you use logistic progression as a classifier and you use Psyche Learn, which is like a, kind of a basic machine learning library, you can train a very strong classifier that is very hard to beat with the state of the art language model that you will fine tune on the same data that you use for logistic regression.

So it's not just faster, it's not just cheaper, but in many cases, especially for text classification, logistic regression and support vector machines are still state of the art and even better. So I I I've shown that even the baseline logistic regression implementation is very hard to beat with a transformer.

Well, of course, of course, depending on some, on the documents, in some database, in some text, the data sets maybe transformer will be better. But there are data sets and they are realistic where this simpler algorithm are still state of the art.

Richie Cotton: Excellent. Yeah. So, I totally agree about the use of logistic regression and how many of the world's businesses are, are running on these sort of older te the. Math in there. And this is one thing that we've been going backwards and forwards on data camp for a long time is like, well, how much of the math do you teach people? So what parts of linear algebra do you think are important for people working in AI to know?

Andriy Burkov: Well, people often ask me this question and I would say the only thing that is critical to understand what we call modern ai, which is mostly neural networks I think the, the only critical piece is the first derivative of, of a function. So it's like a part of calculus. Usually you learn this in college or, or university, like it's the first year.

I think so, why, why it's important is because if you really want to understand. How a neural network is trained, you must understand the concept of gradient descent. It's an algorithm that usually 99% of cases this algorithm is used to, to optimize the parameters of your model.

And this algorithm works on functions. So basically your model is like, just a function like f of x y equal f, of X. And this F of X can be defined very simply. Like for example, the linear function is let's say WX plus B. So X is your input. For example, your letter you want to classify as spams, spam.

And those w and d, those can be just numbers. So let's say you, you want to find what those numbers should be for your specific problem so that your classifier is the most accurate. So, you use this gradient descent algorithm that will take your data set and it'll adjust those W and B it's iteratively, so that in the end, those w and d converts to some values that then you put in production and you use.

So just to understand this principle, you should understand the notion of the first derivative then, and the first derivative in functions is showing for every point. Like, for example, if you have like curve like this for every point on this curve, the first derivative just tells you whether the function grows in this point or it decreases like it's very simple concept.

So if you understand this concept, then you understand gradient descent. And if you understand gradient descent, then you understand how the neural network is trained. And then when you understand how the neural network is trained, you understand how transformers are trained. So like it's all depends on this first derivative in the beginning.

And it's not like I, I don't consider myself like a strong mathematician. I understand many things, but it takes a lot of time for me to understand that, by the way. Maybe we will talk about this later, but I really appreciate LLMs as a assistant in understanding things. for example, previously to understand something.

So you read a book and you, you're stuck on the equation. You don't, I don't understand why, why the author brought up this equation. Why should I understand how it works? And if I don't understand why I. I'm not motivated to continue because I'm kind of okay now I just continue for the sake of continuing.

I don't know where am I going? So with LLMs, you can select a bunch of texts. You paste it, you explain the context, and you say, why this question is here. And it'll say, ah, it's because it may be wrong. But if it gives you the right intuition or even a wrong intuition, you can compare your intuition with this equation and say, ah, okay, I see now.

Or you can say, no, still don't see you, repeat the question and eventually you understand why and you continue. So for me, it, simplified learning from articles, from books quite a bit. So this is all to say that now it's not really difficult for a newcomer in the field to understand the first derivative, even if they didn't understand its in college it's one of the simplest concepts out there. For example, integrals are much harder for people to understand because their physical meaning is not as obvious as the first derivative. The first derivative, you just say the function grows. The function decreases. Very simple, but the integral, it's an area under the curve.

And why would this area under the curve be important? Why would we measure it? So this is much harder, but fortunately to understand the modern machine learning, you don't need in integrals, but in some cases, yes, but not to understand neurons.

Richie Cotton: That's kind of cool that you can get a really long way with some simple linear algebra and understanding derivatives, like essentially basic calculus. And then I also like the idea that you can use AI to explain things that you don't know. In fact, I have to. I have to read a lot of books. It's part of my work.

And often it's like, yep, using AI to summarize them if it's going too slow. Really useful in your case. I feel like you packed like a thousand pages worth of AI research into a hundred pages, so not much summarization needed there. But yeah, I, I like the idea of using AI to just be able to change the pace of and the difficulty level of, of whatever you're reading.

So related to that, you cover a lot of more recent techniques. So, I think there's a whole chapter on recurrent neural networks, which were really, really hyped I guess in the early sort of 2000 tens, and they've been shunted aside a bit with the rise of transformers. Are some of those more recent techniques still useful?

Andriy Burkov: Well, I didn't want my book to be about some expired tech just for the sake of it. So really was happy to create a path from the beginning to the end where you not just discover the history, but also you discover the history by learning about stuff that is still state of the art, or at least used in some context.

So, recurrent neural networks, they were replaced in the beginning the transformers. Why? Because transformers are naturally parallelizable. So, in your recurrent neural networks to make the model predict the next token, you first need the model to read previous tokens. And this reading of previous token in neural network is done one by one.

you read the first one, you update the neural network internal state, then you read the second one, and then you reach the end of your prompt, and then it starts, generates the continuation. So this was not inherently wrong, it just, it was hard to paralyze because you cannot predict the next token for every time step.

At once. the way we train transformers, we predict the next token for all previous tokens, not just for the last one, but for all of them. And in recurrent neural network, you would have to kind of this recurrent region and prediction, which is slow. So just because of this transformers, they have shown that you can train very fast and the model will become very good.

And you can train on very large parameter sizes. But it doesn't mean that the recurrent neural networks are worse in any way. And we have seen just in 2024 that several recurrent neural network architectures have been proposed like X-L-S-T-M and mine TM, that don't have this recurrent dependency like hardwired in the architecture.

So now you can train a record of neural network, which would be as good as a transformer. And in some cases you can win in speed because recurrent neural networks are usually smaller than transformers. So, you can run the inference token by tokens much faster. So for this reason, I included record neural network, but also it's a very important learning step because if you just start from scratch and you go directly to the transformer, it'll, it'll feel very overwhelming.

And many people will be intimidated by this kind of, oh, no, no, no. I, I will never get this whole thing just myself. But if you start with count based models that don't use any neural network network, you understand principles, like language model is how does it work?

And this count based language models are not kind of expired either. For example, on your smartphone when you type something to your partner or your friends. Often it's repeated stuff. So you repeat every time. Let's say have a tradition go to a bar every Friday.

And you, you can tie to one of your friends, like, okay, do we go to X? And this X will eventually start to be suggested as the continuation in your keyboard. But this is not a neural network because no one will retrain the neural network on your smartphone. And even in the cloud, it's overkill.

So they still use this count based models that just say, they just count how many times this x follows this preceding words. And if it's the most frequent continuation, it'll just show it to you and it's okay for you because you don't need on your other completes to have a very long context to predict the next one.

So count based models still used. Cool. Recurrent neural networks are still state of the art. And then transformer. So you kind of start with count neural networks to understand the principle. Then the recurrent neural network shows you what is the most straightforward way to implement a neural network as a language model.

And then once you understand the whole matter on the most straightforward way, then you say, okay, now let's consider some less straightforward way. But the principles will be the same. So the reader will not feel as intimidated as they would feel if they started with transformers from scratch.

Richie Cotton: Okay. That does make a lot of sense that you build up the concepts gradually rather than just trying to dive straight into the cutting edge stuff and get overwhelmed and it doesn't really make any sense. And I do like that those. In based models, I guess there's things like bag of words, is that the sort of idea like the very simple natural language processing models are, are still in use.

Andriy Burkov: Well, yeah. So, it's not back and forth. It's more like, grams, like for example, you have let, they have an agram that says This evening we will watch a. And what will be the next step? The next token, the most likely that you will watch TV probably, or field. So you just take a, like a data set of different texts.

you chunk them into n grams, for example, five grams. So it's, like five words following one after another. And then you say, okay, I will take the four as my context, and the fifth I will count how many time this fifth word follows this previous four words in my overall data set.

And then when it's time to predict, you just take this four gram from a dictionary. You take the continuations count and you say, ah, okay, for this four words, this continuation is the most likely, and this is what you predict. there is no kind of, mat other than counting involved, but it's still very, very useful on small devices where you don't need to retrain anything.

You just update counts and it follows your usual conversations with your friends.

Richie Cotton: That's kind of cool because there is a sort of sense that these like big neural networks uh, just sort of eating everything in terms of technique. So it's nice that some of the older sort of less computational intensive things still have some kind of value. Alright, so you mentioned that there were these sort of new recurrent architectures for large language models.

What's happening in the world of large language model architecture? Like what are the.

Andriy Burkov: Well, the biggest today is of course, those, what we call them, thinking or reasoning models. Also that they are known under the inference time, computer test type compute models. So basically the models that instead of giving you the answer right away as charge GPT initially did.

These models generate what we call the chain of thought. So it's an internal discussion of the model with itself that helps it to come to a better response compared to what it would be if this discussion didn't happen. So it was there for probably a year. The charge, GPT, I think was the first one to release the thinking model.

I think it was with charge G PT O one, that they released it open AI didn't want to show what's going on in this thinking process. And everyone guessed maybe they invented something complex to make it happen. And it really was the models that our thinking were better in many tasks.

So what was the most impressive that happened in the past two months? It's deep sick, the Chinese company that released the model that they called R one. And they trained this model to generate this chain of thought and not just, they have shown what this chain of thought looks like, which would already be like, a very, very good hint for the scientists to try to reproduce something like this.

But they also, they published paper that explains how their thinking model has been done. And they put them out. And the algorithm there that anyone can reproduce the algorithm, they, they call GRPO. And this algorithm actually, they, talked about it in. Previous papers as well, where they explained how they trained a, model for mat.

They called it deep mat, and many people already implemented it and confirmed that it works very well for mat. But R one made so much noise because they have released this chain of thought and they made it kind of a demo where anyone can read this chain of thought, confirm that it's actually resembles a thinking process and anyone could reproduce it.

So, for example, I, I took a very small like by modern standard, a very small language model. It's Quin 2.5 with 1.5 billion parameters, like it's considered tying it today. And I took a publicly available data set with math problems and answers without the, the solution, just the problem and the answer.

And I fine tuned. So I implemented this GRPO algorithm from their paper. And I fine tuned this model to generate this chain of thought and generate the final answer. And just with this small model on this spool level math problems, the model was capable of solving 90% of them.

And I didn't even use the whole training set that I have. So the model converts this 90%. it wouldn't improve. But the only reason why it wouldn't improve it because it wasn't large enough. So if I took a model with 3 billion parameters or seven or 20 or 72, it would learn higher and higher.

And eventually it'll converse to something maybe around 95, 90 7%. like you can say, anyone living with their parents in the basement. Can rent a quite affordable node of GPUs. Like for example, a node with eight GPUs with hundred like 88 or a hundred gigabytes of memory and fine tune whatever model they want to become the best model in the world to solve this specific kind of problems like school level math, bullet level math, some mathematical proofs or whatever.

Whatever data set you can get, you can use it. it's achievable by a random person. And this is what it was a wake up for the entire industry because if previously we could at least kind assume that Open had something. Now we don't assume that they have anything, so anyone can do exactly what, what they do.

So, it was kind of, this is why, you know, there was this fluctuations in Nvidia stock prices because people were like, okay, if we don't need so much compute to create state-of-the-art models, who would need all those GPUs? But again, this reasoning was a little bit not well thought because like I think now for this reasoning models, we need even more GPUs because we do this additional compute during the inference.

So yeah, for me this was the most important, and this is why, by the way, I decided to work on a book on reinforcement learning as my next book in the a hundred page format. By the way, just recently, Satan and Barto received the Turing Award for reinforcement learning, and I'm sure a hundred percent that the very recent result about R one was one of the reasons why they were selected because it's a very real improvement in AI quality.

So it's not like we take a larger data set or we increase the number of parameters. No, it's a smart algorithm that doesn't cost much and results in a very high boost in AI performance quality. Well only for mass logic and logic and reasoning, but it's not a negligible of what people use.

Richie Cotton: That just seemed like, quite a huge shift then in, in terms of the market for large models. I think there was a sort of trend towards there being a few, like just huge foundation models and now you're saying, well, actually maybe the trend is going the other direction. So everyone gets their own, I guess, almost disposable model, like just something that's cutting edge for solving a specific problem.

Do you think that's gonna be the future then having many personalized models?

Andriy Burkov: Well, they will compliment each other. Because like just recently, if you, if you follow the news maybe yesterday or two years today, two days ago it was Quin so it's from Alibaba. They they released this Q WQ 32 which is a reasoning model, and they compare it to R one, and they have shown that on those reasoning tasks, it's very comparable, but it's much smaller.

Many people without really, you know, testing long enough started to speculate that, okay, now we can have a R one grade model, or let's say charge GPT grade model in 32 billion parameters, which is not true. Yes, for math, for logic, for reasoning, where you can get the data sets of task, solution format with reinforcement learning, you can train a very, very strong model, and it'll be small, like as I, as I trained, 1.5 billion to reach 90%.

So of course if you have 30 billion, you will reach higher than 90 and for, more complex problems. But it doesn't solve the overall problem of these models with hallucinations. So hallucinations happen for many reasons and on many occasions. But one of the very frequent use cases that people use language models for is to ask.

Like, for example, what is X or who is person or what this person is known for, or why this person is so important in the history of this country and so on. So this smaller models, they don't have enough parameters to store factual information. So basically when you do the gradient descent update the update goes to adjust some parameters that will result in the model capable of predicting the next word in this specific context.

So if you have a lot of parameters somehow, and it's still open scientific question, how, but somehow this gradient descent doesn't erase. previous information, it goes to the parameters where it still can, put some new information without erasing the old one. But if you all only have, let's say, 10, parameters to train, well, every new piece of data that you put in it, it'll erase everything that was before and replace.

So it's always instructional to look at the extremes. So if you have one parameters, you will not remember anything. If you have millions of parameters, you will remember something, well, your model will remember something. And if you have billions or dozens of billions on hundreds of billions of parameters, then your model can remember quite a lot.

So this 30 billion parameters QWQ is good for math, but if you ask a factual questions, it's as is hallucinate just as much as any model of, of this size. So, currently. No practical solution to solve hallucinations. In small model, the only solution we know is grow model size. So the large models will coexist with small models.

People will train small models, models for very business specific tasks using reinforcement learning or even just supervised point tuning. But if you want to create a assistant that is knowledgeable in multiple domains, you will still want to work with as large model as possible.

This is, by the way, why this GPT 4.5 that open I released recently climbed quite high in scores. It's not smarter for math or for logic or, but it's much more accurate, in fact, because it's much larger than previous model.

Richie Cotton: Okay, so that's really interesting that you're gonna need different types of model for different use cases then. So, yeah. You mentioned if you're answering questions with a sort of true or false answer or, or some kind of factual answer, then you can need that large model because it it needs to be wise.

Has that sort of broad data input and then something where,

Andriy Burkov: you can use a larger model or you can train a smaller model to access some data and like kind of knowledge base and pull relevant facts from this knowledge base. And then just summarize them what we call retrieval augmented generation. The problem is to create a high quality knowledge collection like this.

it's not easy because knowledge if you just pull it from the internet, it can be wrong or because people don't always verify what they put post online. the, say it's one of the reasons for hallucinations too. So if you really want to have a very small model that is capable of talking about very wide number of subjects without too much hallucination, you need a high quality knowledge collection, which not many can have.

Richie Cotton: Since you mentioned.

It always feels like the sort of unloved cousin of supervised learning and unsupervised learning. Why do people need to care about reinforcement learning?

Andriy Burkov: Well, people need to care about reinforcement learning because this is the closest type of learning to how humans or animals in general learn. No one trains let's say dogs. By showing it ma many, many pictures of uh, how to, you know, chase a prey. And then the dog after watching like a million pictures, like, oh, okay, now I know how to pray.

Now you actually shoot a pray and you say, go get it. And once it brings the play back, you give it the doc uh, reinforcement. This is by the way where, the name comes from. So you reward the agent. And in this case, I'm really talking about the real agent and not the language model renamed into an agent.

you reward an agent for achieving the, the expected result, and you punish it. Well, you don't punish the dog, but you punish your robot for not achieving the, the expected result. So how do you do it? Like you cannot give a reward. To a robot after it makes a very, very tiny movement. Like, okay, it made this tiny movement.

I give it 0.1 reward, and then that movement, like, and I give it 0.2. like, you would like to give the reward in the very end of the task. So the reinforcement learning, this is what allows you to propagate this final reward across what we call the states of the environment.

So the robot operates in some environment. So if you ask it to bring me an apple, so the robot should find the shortest path in this environment detect an apple somewhere in, a bucket or on a tree, take it and bring you back. And then you say, good robot. Here is the hundred reward.

And if it bring you a banana. You don't want to punish it like entirely by saying, no, I take the hundred back. You will say, well, for banana it'll be zero. But the robot will still learn that at least it brought something and it's not a total disaster. So like bringing something is a good thing. So the reinforcement learning allows you to, propagate this reward distribution across all the states and all the actions the robot made in all states of the environment in which it was.

So the first time it'll not bring anything. It'll get negative reward, but the first time it bring you something, it'll say so the, the, the reward will be propagated. So the actions that led to this final reward their probability to be chosen will be higher by the end of this.

So this is why it's very important to have simulations. In which you can train simulated versions of the robot, that those simulations should be as close to the real environment as possible. So in this case, you can run millions of copies of the same environment and millions of copies of the same robot.

And they will all in parallel try to like execute some tasks. And even if one out of million eventually succeeds, then you propagate its rewards to all the action that it made. And you update the, what we call the, we call this the function, which is used to select actions in each state. You, we call it the policy.

So this policy can then be copied to all other replicas of this robot. And they all simultaneously become a little bit smarter. And you continue this way with millions of simulations, and eventually at least one or two or three will start systematically solve the problem exactly as expected.

You can imagine that in supervised learning, showing all possible kind of, here you do this, and here you do that. It's impossible. Like, it'll not scale. Like if you want to have robots that live with us in a typical house, the robot must learn on its own. Like it'll bring you beer on, on Friday evening and it will say, wow, good one.

And it'll learn, it'll get the reward from this. Wow. Good one. And it'll learn that bringing you beers on Friday is a good thing.

Richie Cotton: That is seem like very appealing, having my own pet robot to bring me bears on a Friday evening. Uh, I like that idea. Oh, and since you mentioned policies, I guess this is where we come back to the group relative policy optimization from the large language model. So again, think it's a reinforcement learning technique.

it's nice that there are some sort of. Good examples of use cases for reinforcement learning. I think like a decade ago it was all about playing games. It was like, oh, well this is how you win a chess or go using reinforcement learning. But now it's like, well, it's robots, it's large language models.

So lots of like really important business use cases there. Okay. Alright. So, I don't think we can get away without talking about AI agents. So they've been hyped for months. They're kind of starting to appear in the wild. What do you think the big use cases for AI agents are?

Andriy Burkov: Okay. Well, first of all, we as I said in the beginning, we operate terms, well, I, say we collectively all people, we operate terms that we didn't define. Okay? So ai, it's operated in a very broad sense, like, something. That can think and solve uh, all, all sorts of problem. And the same goes about the Asian.

So, there is a very popular, well, not popular, well, it's popular, but also important book that is used as a textbook in most universities worldwide from Russel Norwick called Artificial Intelligence. A modern approach. And this book is kind of the door that you open in the world of ai.

when you go to university. So in this book, there is a chapter specifically about agents. So the agent is defined as an actor that acts in a certain environment, it perceives the environment, states it can execute actions in this environment. And it receives a feedback from the environment, usually in forms of rewards and an observation of the next state of the environment in which this previously executed action brought it.

So this is how the agent is defined. Now, if we define the agent according to reading the news, it's nothing different from just an LLM that you fine tune to call, they call like call functions, So you say, you are a useful assistant and you can help the user to get some information when they need it.

So when the user asks for let's say the information about the best deals to buy a TV online or a car, then you must call the function that connects to let's say Amazon or Google Marketplace and pulls the search results for it. And then you summarize it. and you explain that this function has this inputs, for example the topic.

It might have another input, minimum and maximum price that the user is ready to pay and some other inputs. And then once you have this input, you call the function and you will get the result. And the result also at the LLM itself cannot call function. The LLM just says at this point convers, in the conversation.

this function with this parameters must be called. So then you parse this you, I mean the developer. You parse this suggestion by the LLM to call this function with this parameter. You parse it, you actually call the API, you submit the parameters, you get the results back, you put it in the context and the conversation goes.

So this is kind of a assistant assistant agent. There are some other agents that people use to work collectively on something. For example we write a document, okay? And we need multiple pieces information to write this document. So there is one agent responsible for making sure that the overall structure of the document is as user wants.

There are other agents that see the state of the documents and says, okay, here we need to add additional numbers. So I will go online, I will pull some numbers. I will put them that. Kind of looks like robots are working for you and you just sit and wait until it's done. The problem with, this is that LLMs they aren't agent, so they don't have agency on themselves.

So the LLM doesn't know at any moment in time. that it's in the process of solving any problem. it was fine tuned to generate the next token, and then we also fine tune it to generate the next token so that it, replicates some patterns that we, that we find useful.

because the model is not age agent, you just say you are a very capable agent, but you. It doesn't make it changing just because you proclaim it. Okay, so like what the problem is? The problem is that because like none of those agents actually know that they are solving a specific problem.

They just follow patterns that we either provided the prompt or we fine tune them to follow this pattern so they can be arbitrarily wrong and still continue to follow this pattern. no one of those agents will say, oh, whoa, whoa, stop, stop, stop, stop. We were supposed to write a document on bananas, but we are now writing on potatoes.

Is it something we should probably reconsider? No. if it goes in some direction and it goes in the wrong direction, they will continue to go in the wrong direction. And you don't need agents to test it. You can just ask an LLM to explain to you something.

And you just need to make sure that you understand this domain. Like for example, when I work on my book, I ask it like, explain to me this equation, and I know how it works. I just want to see an explanation and probably in draw inspirations and write something probably similar, probably better.

So it explains to me something and I know that it's wrong. So because I know the domain, I say, no, no, no, this is not this this is that. And it says, oh yeah, right. This is that, and it rewrites it. But it'll not happen with the agents never. So the agents will take the output of other agents as real stuff and will take this output as their input, and then the process will continue based on this wrong input going forward.

So this is why, for example, you heard about this world first AI software engineers.

Richie Cotton: Oh, Devin.

Andriy Burkov: yeah. So like when they just started to talk about it, they already proclaimed that is the world's first AI software engineer. It was such a strong claim that, I don't know why, the rest of the world didn't love, right away, but the hype was so high that everyone expected like miracles.

Yeah. Well, this will be first ai software engineer. And this one will be the world's first lawyer. And this one will be the world's first doctor. But if you are a doctor, you'll ask questions about the human body and you will find so many issues in those answers.

you will say, no, I, will not trust it. Yes, in some domains AI is good. For example, when you show it this x-ray with some cancerous cells and it can circle around. This is cool. And I will use it, but I will not talk to AI to decide what treatment I will apply because this is crazy.

Like, one time out of two it's wrong. So the same about this dev, which is an agent because, well, you say, I need this problem to solve. This is my environment. these are my passwords credentials and so on. Now go for it and bring me the solution. And like it's been already more than a year, we don't see any, any Devon deployed anywhere in production.

Some, some recent tests have shown that up to 25% of tasks have been solved. And knowing that, we never reach a hundred for whatever problem it is, if you. After a year of effort, you manage to read 25% and even those 25%, I didn't see them. But I think if you really dig deeper, you will see that the code is not good.

There are plenty of bugs, yet it's compile it and somehow the, test cases passed, but it's not manageable that the code is so wrong that no programmer will say. I will continue to main maintain it or I will continue to ask David to maintain it. So like, agents will be good in situations in which the data and the problem itself can be described verbally.

And this verbal description can be easily found online because this is where the lms get the data for pre-training. So if your document or your problem can be described verbally, and there are plenty other descriptions of such problems online. The agent will be good. Why? Because it has seen already plenty of patterns of this problem.

And it knows what to do with those patterns. Otherwise, it'll be I don't know if we can still call it hallucination, but it will kind of actions without meaning in many cases because the pattern itself is not very familiar to the agent. So it'll just do something for the sake of doing so it'll kind of correspond to hallucinations.

But we need to invent probably a better turn to this stupid action. And those stupid actions will accumulate and the more agents are there, they will not cancel problems of one another. They will amplify the problems of one another. So this is my, stance on it. So it's not entirely useless, it's just that you must find use cases that are very similar to web for example, web scraping probably.

Yeah. Uh, Like searching online probably. Yeah. But, solving let's say I say, okay, I have this cold fusion reactors, but plasma I cannot, contain it for more than half a second using my, my algorithm. This is my algorithm. Hey Devon, fix it. My plasma should now be manageable for 10 seconds.

Well, this will never happen. it has never seen any example of plasma controlling code. So, no, it'll just fail.

Richie Cotton: This is really interesting then you make it sound very conceptually easy. It's like, well, it's an LLM, it's doing a lot of I guess test time thinking it's using some tools like APIs, but then in some cases, like you mentioned, you start off with like a, a shopping example and it seems to work okay there, but it's not gonna replace engineers or doctors anytime in the near future, which I think is probably quite reassuring for a lot of people who are worried about their job being taken

Andriy Burkov: Yeah. It's, it's, it's a good news for many who don't understand how the thing work. If you understand how the thing works, you are not worried. But if you don't Yeah. It's good news.

Richie Cotton: I guess one of the sort of well hype things over the last few months has been the idea of these deep research tools. So it's like ai doing PhD level research. I don't know how much of that is, marketing hype and how much is really possible.

Andriy Burkov: Yeah. Well, first of all, this PhD level ai, it's a new marketing term that. They pull of their head. I say they, I mean, the totality of venture capitalists, CEOs and influencers who wants to, you know, this hype to continue forever because they, they have investments and likes, so they, they, it's kind of a, symbiotic relationship between them.

So now we talk about this PhD level. Ai. First of all, what, what does it mean PhD level? So, PhDs, I was a PhD. Okay. I worked on a very, very tiny direction in the overall ai, which is, call it game theoretical multi-agent systems or computational game theory. So it was very, very narrow direction and in the world maybe.

maybe a hundred other scientists worked on the same problem as, myself. And when I read some new paper on the domain, I knew all the people on the author list because we were the same kind of group of people working on this very narrow domain.

So this is what the PhD does. PhD doesn't launch spaceships to Mars every day, or, you know, we don't invent new physics every day. We work on a very narrow domains and on its own, this very narrow domain. It's not, not particularly difficult, like, I mean, you just need to know a lot about this specific domain and you'll feel comfortable in it.

So this is the definition of PhD level work, but. It doesn't come from, you know, pre-training the LLM on the web data and saying now that you can do this very narrow research, show me one example where, you know, they, they took an L, LM, they said, okay, now you are a specialist in computational game theory.

Write me a paper that solved this problem. No, no one can solve for educate. I doubt very much that it'll be done. Okay. Because, again, an LLM cannot do anything that wasn't in the training data. It can kind of extrapolate a little bit. For example, let's say if in some scientific domain some algorithm was used to solve some problem and you, you work in a different scientific domain, but your problem is somehow similar to that solved in a different domain.

So if the LLM was trained on the. former. Then for the latter, it'll be able to say, oh, you can use this algorithm to solve your problem. Why it knows is because your problem is similar to that problem. And this similarity comes from what we call embedding ign similarity between embedding.

So if your problem is embedded in the same embedding space as that problem, of course the suggestions will come from the solution for that problem. But if you work on something very narrow and very only a hundred people work on it the chances that this PhD level AI will suggest something that a PhD can actually use very slim or even not.

So maybe what they call by PhD level AI is maybe a, there are exams. when you want to become a PhD student, you need to pass some exams. And these exams include knowledge from different domains. Like for example, in computer science, it's about compilers, it's about like proofs of theorems, like, it's about, creating let's say systems that don't fail. So there are algorithms that introduce redundancy in how you encode information. So there is a lot of knowledge that you should have to be admitted to a PhD problem and to test you that you have this knowledge, you probably, you will pass, well, I passed some exam, so maybe this is what they say.

Maybe there, LLM. Is good at solving these, problems like these exam problems that PhDs take. But again, as I said, if you have today a large enough model and you use reinforcement learning and you have the problems and solutions for these exams, you can fine tune the model to do it. And then you can call it PhD level ai, but it'll not do PhD PhD work.

And by the way, in my case, the PhD work was still kind of, it's a creative work. So I, I advanced the field by proposing new algorithms, like new solutions to existing problems. But there are many PhDs that they don't produce any new information. They just aggregate a large body of disintegrated pieces of information.

Like just recently there was a woman who worked on, I think the use of smells in literature to describe racism and when they describe people of, color they used like, smell references and so on. It's a very important work and really kind of very fun to read. And for her to be able to write a thesis.

She should have read plenty of books and find pieces of West Mail was used. How, in what domain? In what stage of our history, like before revolution, after revolution. And so this is a very important PhD domain, but it's so narrow that maybe only 10 people in the world will read it and say.

Yeah, this is really good work. this is worse. Like, giving the PhD title to the scientist. This is not something you will ask an an LLM to do. Well again, if you train on all books of the world it might write you something, but it'll also hallucinate a lot.

And you will have to reopen those books, reread those passages, and make sure that it's actually the real deal and not something that it just imagined because you sample tokens from a distribution.

Richie Cotton: it sounds like there's a big difference then between like the, the PhD idea of we're doing novel scientific research and the PhD idea of we're just doing something that's deeply technical and requires a lot of domain knowledge then. So I guess you need to be very clear on what kind of problem you're trying to solve before you start diving into making use of LLMs for that sort of thing.

Andriy Burkov: Yeah, just recently Google I think they released a kind of an assistant for scientists their scientific work. I think this is where lms will really be super helpful because as I said there are so many publications right now, especially if you work in computer science and ai, like thousands of new publications are put on archive weekly.

Impossible to read all of them. Never. But maybe some of them contain some algorithms that might be useful for your research, but you don't know about this because you are incapable of reading all of it. And even if you could read all of them, you cannot keep all this in your head.

like, it is not like a toolbox where you put different stuff and eventually, oh, I, well, some scientists, like the great scientists, this is how they operate. they, have so many algorithms and different approaches to solving different problems in their head. That when you present them some new problem, they already see what kind of tricks they would try in the first place to solve this problem.

Because they have a lot of stuff in their head. But for most, of us, we remember something that we use frequently. Once we don't use it frequently, it gets lost. So in this case, the LLM, it's a very, very good assistant because it remembers everything. And even if it hallucinates, it doesn't matter because you are the scientist.

So you say, okay, this is my problem. I already have this algorithm, but I would like to prove that this algorithm is efficient. How do I start to prove something like this? And the, the LLM might say, oh yeah, well I saw this article and it was a similar algorithm to yours, but apply it in a different domain.

And this is how they have proven that the algorithm was efficient for the problems they applied it to. And you'll say, show me it. The LLM will show you the algorithm. You will not just copy, paste and publish. This is not how scientific work works. So you'll take it, you will implement it, you will see whether it actually applies to your specific, case.

And if you validate everything and you consulted with colleagues and everyone says real deal, you publish. So this accelerates a lot the research cycle. But again, right now there are plenty of low hanging fruits like this. Eventually, everything that was easily transferable from one domain to another will be transferred.

And then we will confront again a situation where there is nothing new to do. Like we, we already transferred all possible algorithms to all possible different domains where they can be applied. Now what do we do? And now we once again have to start to do science from. the basic principle of how we did it in the past.

So for some time scientists will benefit, maybe at this time will be educate, we don't know. But at some point we will reach situation where everything that was already invented by people before us has been used in all domains that we care about. And we now have to start thinking once again for ourselves.

Richie Cotton: Okay. It certainly just seemed like quite a, an exciting intermediate time though, just being able to transfer like best practices from one domain into another, because I'm sure there are a lot of domains where they could do with being like, think about like social sciences, but they could probably do with being like a little bit more quantitative or technical in some cases 'cause it's just not being explored enough yet.

So yes, definitely some exciting times in the near future, but I guess you just gotta be careful about what is the right use case for the large language model.

Andriy Burkov: Yeah. And just remember it about very funny. case a couple of years ago, well, I say couple, maybe five years ago, people just on red discovered that some group of scientists in healthcare, they needed an algorithm to actually optimize a function. But because they weren't really, you know, to technical they invented gradient descent, and they published it like, okay, we propose this function and we also propose a new algorithm how we can optimize the parameter of this function.

And they just described like basic gradient, descent, and published it as if it was the discovery of our lifetime. So at least with LLMs, the LM might say, well, no, it's not a discovery. It's already been implemented in.

Richie Cotton: How that is very cool that they invented gradient descent because it's obviously an incredibly important algorithm. But yeah. I can see how that might be a little bit embarrassing, the fact that, I guess to get through peer review then and it was first.

Andriy Burkov: I, I guess so. Well, it's because they like, it's a domain where they don't usually do any math. Like it's maybe some, you know, social research, but they needed kind of quantify some results, so they didn't know how to do it, and they invented, they invented it.

Richie Cotton: Nice. that's very funny. Um, Alright, so before we finish, I'd like to hear a little bit about what your tech stack is because it seems like there are so many different tools now for building with ai. So just tell me about what your favorite tools are.

Andriy Burkov: I don't consider myself as a software engineer, so I am, I know how to code and in my book, I put a lot of code and I am quite confident that this code is good. Why? Because I tested it with different language models and ask it to simplify this code and make it. as simple as possible.

but at the same time, to keeping the educational aspect. So it shouldn't be, you know, geneal, just one liner no one can see. It should be something that the reader can read one by line and see the logic. So like, I can write such good quality code, but I don't use, for example like I know that many people today use Carstar or v vs studio uh, v vs code.

And I use regular text editor that is a sublime text. I've been using it for a decade now. And I'm quite satisfied. You can run code in sublime text and you can, debug in it. So this is my kind of a go-to place to write. Code. And also, for example, if I work with an LLM on coding, so I created a script.

For example, let's say I work on on a web app. Okay? So for example, for my book, I needed to create a website, which is interactive, that shows some very interesting animation where I applied a machine learning algorithm to, to turn elements of the illustrations. Like, it, it looks very fun.

So what I do is that I created a script that takes all the files that. you work with in your project. For example, if it's a React project, there are some standard files and standard kind of a directory structure that you should respect when you work on a React project.

So this script, it takes all the scripts, CSS style files and HTML files and just concatenate them in, in a text document. So next time, if I want to quickly modify something in this app, I take the content of this of this document, I paste it into an LLM, and I say, okay, improve this or replace this with that, or add this additional, element on, on the webpage or make this button work this different place.

So I just describe it. It says, okay, cool. It writes me in the, the full change. And then, then I say. if I really don't want to think like do something very fast, I say, no, write me the modifications in the find this replace with that format. And this is like, you need to use brain.

It is just say, okay, find this, replace with that. So you search in your original code, you find it, you replace it. Eventually you replace everything. You compile it compiles code. If it doesn't compile you say, well, this is my error. Fix it. And eventually you, you get to the result or not. Because as I said, if your problem is something out of the distribution the data that was used for training, you can try very hard and not get to what you want.

But again, the good thing is that now we have alternatives. So for example, I today my go to model for coding. Is open AI oh three high. Okay, so I have this paid plan for $20 a month, and it includes, I dunno how many, 10, 15, 25 interactions with high. Then let's say I exhausted those, those introductions and I didn't still found the solution.

Well, the LLM didn't give a solution. So I can take this code and go to my second go to model, and today it's at the gr on x gr is spectacular. how fast Elon Musk managed to train such a good quality model. So, I don't want, you know, wonder from one to another, you know, randomly.

my principle is that if after three back and forths, the problem is still not solved, it. you have two choices to start from scratch with the same model, but explain the problem better. Or you can go to a different model and very high chances. The different models will just give you the solution right away.

Because they were trained on different dataset and especially fine tuned on different different problems. So maybe some model was fine tuned on a problem similar to yours, and you will just get the output right away. So for me, like it's like three back and forth. I still see, we don't go anywhere.

It says, okay, this time I, I fixed this problem. Okay, it'll work. You run, it doesn't work well. You say, okay, no, it doesn't work. So if the third time didn't work, you say, okay, I restart, or I go to a different room and now I have plot, I have and I have open ai like, or all three, or three mini or even O one is good too, so like a very vanilla case.

But I know that many, many developers use cursor or use vpo. And if I were a software engineer, if I did this, 24 7, this would be my bread and butter. Of course, I would use these tools, but for my cases, test new algorithms. I build websites for my books myself.

The way I use it is simple and it's also simple for anyone who never used these tools, who never program it at all. Like never wrote code by themself. If you showed these people an id, like a state of the art, ID like, s code. It will be hard for them to, just to set up the environment for a new project.

people who are outside, they would prefer to work this way. Like, you just put something in an LLM, you explain what the problem is or what the change you want. it writes you, the, the updated code, you put it way it works. And this is how you proceed. But eventually, if you want to be more and more productive as a software engineer, you would gradually move towards this specialized ID,

Richie Cotton: That's really fascinating that you say you get away with just like, let's use a stripped down id. I mean, you mentioned supply and text, which is like pretty sort of basic stuff. And then just have this AI assistance. But the trick there is to use lots of different models that are gonna give you assistance in different ways.

Andriy Burkov: yeah, don't spend much time with one model. Test all of them very quickly. And there are high chances that one of them will solve it right away, or you can spend an entire day trying to squeeze something from one model and never get anything that works.

Richie Cotton: So final question. I'm always in need of new people to follow. Whose work are you most excited about at the moment?

Andriy Burkov: this is a hard question because I rely a lot on my ex bubble and my ex bubble is the, is an like, it's created automatically. So I don't follow anyone particularly. So, of course there are people that everyone follows, for example, Andre Car, he.

Who worked with language models or neural networks before everyone else. even I, I was still in the classical machine learning when he already trained language models from scratch. So Cario, of course, everyone follows him, but you heard of them and everyone heard of them.

So, but for the rest, as I said, like I, trust my ex bubble and by the, for now it doesn't disappoint. I see all the important news for me my bubble. So, yeah. Well, I recommend you to follow me and if I react or request something, you'll see that is something I think is worth spending time.

Richie Cotton: Alright, so, just following everyone with variations on the name Andrea, then. Nice. Okay, cool. Wonderful. Thank you so much for that. I feel like I lot of this session really great stuff.

Topics

Artificial Intelligence

Machine Learning

podcast

GPT-3 and our AI-Powered Future

Sandra Kublik and Shubham Saboo, authors of GPT-3: Building Innovative NLP Products Using Large Language Models shares insights about what makes GPT-3 unique, the transformative use-cases it has ushered in, the technology powering GPT-3, its risks and limits.

podcast

Building Multi-Modal AI Applications with Russ d'Sa, CEO & Co-founder of LiveKit

Richie and Russ explore the evolution of voice AI, the challenges of building voice apps, the rise of video AI, the implications of deep fakes, the future of AI in customer service and education, and much more.

podcast

End to End AI Application Development with Maxime Labonne, Head of Post-training at Liquid AI & Paul-Emil Iusztin, Founder at Decoding ML

Richie, Maxime, and Paul explore misconceptions in Al application development, fine-tuning versus few-shot prompting, the roles of Al engineers, planning and evaluation, challenges in deployment, the future of Al integration, and much more.

podcast

Developments in Speech AI with Alon Peleg & Gill Hetz, COO and VP of Research at aiOla

Richie, Alon, and Gill explore speech AI, its components like ASR, NLU, and TTS, real-world applications, challenges like accents and background noise, and the future of voice interfaces in technology, and much more.

podcast

What History Tells Us About the Future of AI with Verity Harding, Author of AI Needs You

Richie and Verity explore why history is important for the future of AI, the role of AI in society, historical analogies including comparisons of AI to the cold war, the role of government and regulation, and more.

podcast

Developing Better Predictive Models with Graph Transformers with Jure Leskovec, Pioneer of Graph Transformers, Professor at Stanford

Richie and Jure explore foundation models for enterprise data, the limitations of current AI models in predictive tasks, the potential of graph transformers for business data, relational foundation models for ML workflows, and much more.

See More See More