Skip to main content

End to End AI Application Development with Maxime Labonne, Head of Post-training at Liquid AI & Paul-Emil Iusztin, Founder at Decoding ML

Richie, Maxime, and Paul explore misconceptions in Al application development, fine-tuning versus few-shot prompting, the roles of Al engineers, the importance of planning and evaluation, the challenges of deployment, and the future of Al integration, and much more.
May 4, 2025

Maxime Labonne's photo
Guest
Maxime Labonne
LinkedIn
Twitter

Maxime Labonne is a Senior Staff Machine Learning Scientist at Liquid AI, serving as the head of post-training. He holds a Ph.D. in Machine Learning from the Polytechnic Institute of Paris and is recognized as a Google Developer Expert in AI/ML.

An active blogger, he has made significant contributions to the open-source community, including the LLM Course on GitHub, tools such as LLM AutoEval, and several state-of-the-art models like NeuralBeagle and Phixtral. He is the author of the best-selling book “Hands-On Graph Neural Networks Using Python,” published by Packt.


Paul-Emil Iusztin's photo
Guest
Paul-Emil Iusztin

Paul designs and implements modular, scalable, and production-ready ML systems for startups worldwide. He has extensive experience putting AI and generative AI into production. Previously, Paul was a Senior Machine Learning Engineer at Metaphysic.ai and a Machine Learning Lead at Core.ai. He is a co-author of The LLM Engineer's Handbook, a best seller in the GenAI space.


Richie Cotton's photo
Host
Richie Cotton

Richie helps individuals and organizations get better at using data and AI. He's been a data scientist since before it was called data science, and has written two books and created many DataCamp courses on the subject. He is a host of the DataFramed podcast, and runs DataCamp's webinar program.

Key Quotes

A lot of people try to apply fine tuning to problems where it might not be the best solution. But in a lot of situations, you can get away with few-shot prompting and providing a few examples to the model, everything related to a rag pipeline to retrieve the context and include it in the prompt.

In the future we'll be mostly focused on thinking and planning, solving solutions or other creative processes. We’ll automate a lot of boring mundane tasks and it will let us actually be more human, which is not that intuitive.

Key Takeaways

1

Supervised fine-tuning is essential for transforming base models into useful assistants by instilling specific knowledge and formatting, but preference alignment techniques like GRPO can further refine output quality.

2

Deploying large models requires careful consideration of compute resources and cost management; start with small-scale deployments and use monitoring to gather data for scaling decisions.

3

Despite advancements in LLM context windows, RAG remains valuable for efficiently retrieving relevant information and controlling costs, especially when dealing with large datasets.

Links From The Show

Maxime’s LLM Course on HuggingFace External Link

Transcript

Richie Cotton: Hi, Paul Maxime, welcome to the  show.

Paul: To be here.

Maxime: Hi, Richie. Thanks for the invitation.

Richie Cotton: Wonderful. So, I'm curious as to what you think people misunderstand about AI application development.

Maxime: So I think that in my field, a common misconception is about fine tuning and when you can apply it and its effectiveness in general, I think that a lot of people try to apply fine tuning to problems where it might not be the best solution. And it's understandable that you want to customize your other LM because you have your specific use cases, you have your specific data, but in a lot of situations you can get away with like few shots prompting by providing a few examples to the model.

Everything related to right pipeline to. Retrieve the context and include it in the prompt. All these techniques tend to be like really good for a lot of stuff that people want to do, but a lot of them try to directly  go to fine tuning, which is never really the best option unless you really have to go through fine tuning.

So yeah, to me, this is the main misconception on my side.

Richie Cotton: That feels like a very good news story because there are some very simple things you can do. So you mentioned the idea of few shot prompting. It's like, it's a relatively cheap technique. So you don't need to go all the way to fine tuning in order to get customized model behavior, I guess.

and Paul, wha... See more

t are people get wrong in your side of things?

Paul: Well on, on my side of things, I guess, the most common misconception is about using frameworks to build this, more on the ference side of application. So we all know about lamp chain, LA index and all of that, and I seen that mostly everyone start with them and they think that's usually like what it takes to build that application.

But. In most cases, usually you hit some threshold where you cannot move forward because from what I see in these frameworks are mostly  like a very low code solution to getting into rag and agents and all of this. And it's like similar to no-code solutions. They're quite rigid. And you need, when you need custom stuff, you usually hit some ceiling that you cannot pass.

And what I recommend is always like when you need to interact with the database, like basically ingest data and factor embedded data into your database and retreat it, just start writing it from scratch, from day zero instead of starting with these frameworks. And I, I believe that these framework are only good, like for very quick concepts to see that.

It's worth it to start writing it from scratch. And the idea and the data is manageable and workable.

Richie Cotton: So at the moment, the frameworks that are only good for very simple projects, and you want to go a bit low level with your code you need to write something a bit more sophisticated.

Paul: Yeah, exactly. Because most of the time you need like  very custom filters, very custom ways to store your data. I don't know, custom ways to pre-process, post process your data. And you'll realize that if to customize land chain or LA index or similar frameworks only takes longer in the long run than just writing it from scratch.

But I want to highlight that there, there's like the other dimension to this, like more agentic work workflow related framework like land graph and lemon index workflows, which realize these issues because they're made like from, from the same people. And they're mostly used to orchestrate your logic, your steps to orchestrate them, to monitor them, to deploy them.

And that's a way to go because they don't limit you in how you process your data and how you manage your data, how you integrate your workflows or agents with your current infrastructure. So these are two sides of the coin. So things like land graph?  Yes. Things like lamb chain. No.

Richie Cotton: Okay. So, maybe we can get into the tech stack into more detail, but it sounds like you'll really think about which pieces of technology you're using when you're building stuff. so I'd love to talk a bit more about like what your roles are, 'cause you're both kind of normally AI engineers, but it feels like your jobs are very different.

So I'd love to hear about like what exactly it is you do. Maybe Paul, do you wanna go first this time?

Paul: So my background is more from a software engineering, ML engineering side of things. So. I'm like used to trading models, but it's not like my bread and butter part of the job, let's say. So my side of AI engineering, because it's a very vector, is mostly related to like taking models from the researching from the trading theme and integrate that into like bigger systems, which can be things like more lop, like putting them on the right infrastructure to scale well to heat requirements like  costs, latency, throughput.

so this is one aspect of it and another aspect of it's like integrated into the code itself because most of the time you need to pre-process your data between putting it into the model. And the reality is that how the data is pre-processed in the research part of things most of the time is not like. Production ready for, integrated into the code itself. So most of the time you need to rewrite it to again, think about requirements. And also we have to think about like the training service queue, how to manage your pipelines to actually not introduce differences because only small parts of your data are like, process different from how the models will trade and how it's deployed things most of the time without go well.

that's why I like hacking the, the ML ops world with the more engineer world merch really well together because  both have to solve these issues. See how you deploy your model, how you design architect your architecture to make your features reproducible, to make your features shareable, to make your feature accessible at training and, and certain time.

And I like part of my job. And on the other side another aspect of it is more related to like workflow and agents development. So I think that more on the, like this combination of engineering, mls, software engineering side of things, you'll get into frameworks like, as I said, like graph where you need to orchestrate all sorts of agents to glue them together.

Tin thread, right? Pipelines and And these aspects of engineering. So to conclude, don't do that much fine tuning. I.

Richie Cotton: It's interesting that a lot of the problems you mentioned, they're very familiar. I think like, coming from my data science background where it's like you think you're gonna be spending a lot of time messing about models,  but actually like dealing with data quality issues. How to get from prototype to putting stuff into production.

So, very familiar issues there that you're dealing with.

Paul: Exactly. But I will also add that because of this, the scale that AI engineering and LMS introduced, because usually here you need to work with bigger models and bigger scales. You also have the dimension of data engineering and software engineering. So I think this role is a little bit tricky because you need to really know a little or more of everything.

Richie Cotton: and Maxim tell us about your role.

Maxime: Yeah, so I have a, quite unique role as working post-training for an LLM provider. So in my job what I do is like I fine tune base models to create general purpose models, like a bit like GGPT, you can ask it pretty much anything, right? That's not really what people do in companies.

In my previous role I was  more machine learning scientist, where the goal was to work with someone like Paul. I would train the models and I would give the baby to Paul so he could deploy it and Improve the code Also in general, like, as Paul said the code that we, we tend to, to do, to process the data is not production.

Byd that's it. That's Paul being nice to say. It's a Jupyter notebook and I can't reuse it. But yeah. Thank you Paul. I think that in terms of tech roles, you have these two. To evaluate strategic positions of having machine learning scientists that are responsible for the model. And the goal is really to get the data, make a good model evaluated, make sure that everything works.

And then you have the machine learning engineer who's also responsible for a, a lot of breadth. Actually. They have to cover a lot of things from data processing to deployment, to inference optimization.  And usually it goes really well together in my experience. And I think that with large language models.

We see exactly the same kind of dynamics. It's just that the knowledge the tools, they changed a bit. The scale also changed because now you don't have to deploy hundreds of machine learning models every week. But you have to deploy one or two with a very big, and you need to make sure that they really work.

So it switched a bit like some kind of focus in what's important in these jobs, but fundamentally, we still have the same dynamics.

Richie Cotton: so that's interesting that. Your job feels like pretty closely related to the more traditional say traditional, but like for the last decade or so. Machine learning scientist role. So it's just like, make sure the model is sort of high quality and then hand it off to someone in engineering to go and put it in production.

Okay. I think between you you have the sort of full flow of creating AI applications just to make sure the audience understands, like what are the different sort of  steps in order, like at a high level for creating an AI application. Like where do you start and what's the end point?

Maxime: I think that's the main part is understanding the problem really well. Like, with any project, You need to really understand the outcome. You need to understand the inputs, the outputs, what we are trying to do really here. 'cause sometimes people will tell you, okay, I want to fine tune this model on this dataset, but actually this will not solve their problem.

So you can do it, but then they're going to tell you no actually so understanding the problem. Problem and scoping it well is probably the most important step. And then we have like different step of data collection, data generation if needed to create a dataset. It doesn't have to be a fine tuning dataset.

It can also be a dataset for rag. It can be also a dataset for evaluation. It just, like, broadly speaking, dataset creation is probably the first step if needed. You have a step of LLM training where you're going to fine tune the model on a specific  dataset. After that, you probably want to have an evaluation framework.

It's often better to start with the evaluation framework, even before you generate the data to do the fine tuning. 'cause you know what you optimize for. You're probably going to iterate a lot of times over the evaluation data sets. So it's good to start early once you evaluated the model. I think I, I, can give it to Paul, but then you are going into like rag deployment, monitoring and testing all you want to continue.

Paul: I will not continue like directly, but on the, on the same page, let's say. So what I like to do is like also, again, more on the engineering side, is to think carefully about the requirements. So one part of the requirements is the data, which might seem or already said, but other aspect of any software application in the end are the requirements of like  latency throughputs costs and infrastructure.

So I usually like to start. Laying down an overview of the infrastructure and thinking about the flow of the data, how it'll run through your infrastructure, where, where it'll be stored, all the components will be connected to each other so people can really understand how this will work in production.

Because one thing is to work like on your internal clusters where everything is, stitched together with your own internal scripts and all of that. And another thing is to have automated workflows where everything is physically working automatically with DevOps pipeline or whatever other pipelines you implement.

So this is another important aspect of it. And one thing that I usually preach is to start really small in the beginning. So to have an end-to-end workflow. That  works like with your minimum features, basically an MVP, let's call it without trying to be smart, without trying to do complicated things and all that, but you see that everything is working end-to-end and then you start using the evaluation dataset to evaluate how your application works on your benchmark.

Basically, we, take the data set that Maxim told us about and test the whole application on it. Sometimes this works, sometimes the data set needs some adaptation because it's one thing to evaluate your LLM in installation and another thing to evaluate your whole application. yeah, the idea is that you want an end-to-end workflow that works.

You want your evaluation to be really in place and based on the evaluation, you have a feedback loop and based on that you can start exploring new features, adding more complexity. And all that. You can think about it  like integration or unit test. In the end, like you, you keep adding stuff on top of it and on avenue feature you test it.

And if the test pass, it's okay, let's push that feature. If not then we keep it to orage.

Richie Cotton: That's very cool. It just seemed like a common theme between the two of you is that you need to like plan what you're doing beforehand. It's, I always find this very disappointing. I wanna like dive in and start writing code and things, but actually, yeah, working out what the problem is you're trying to solve beforehand.

A very useful idea. Okay. So, all this sort of workflow I would say you've got it really well in your LLM Engineer's handbook. I'm amazed at how you got like the full workflow in, in a, in a relatively concise book. I'd love to go into like some of the chapters in more detail because there are a few parts of this where a bit of the workflow I wasn't particularly familiar with.

So you have a whole chapter on something called the Rag Features pipeline. So tell me what is a rag feature pipeline.

Paul: at the very high level overview, you can  split your machine learning system in four big types of pipelines. The feature pipeline. The trading pipeline, diverse pipeline, and the observability pipeline. I know for me personally, the names are quite self-explanatory. But I will dive deeper into the feature pipeline.

So this one basically keeps row data, which usually comes out of your data pipelines, which can be implemented by your data engineering team, or a smaller projects implemented still by the AI ML engineering team. But the idea that what comes out of this data pipelines is usually clean, standardized data to some extent.

And the feature pipeline takes this information and transforms it into features, right? Because it's called a feature pipeline. And because you're implementing a rag feature pipeline, these features are collections that are used for red, right? So. What we do here is  basically we further clean the data if we think it's necessary for our specific rep use case, and we also chunk it embedding or do other advanced RAM related things before indexing and loading it into a vector tp usually, but this is like a standard use case.

We can, of course, dig more into this because RAG is not only about vector tps. in the end RAG is about storing your data in a proper database that that can help you in your problem and retrieve it as smooth as possible so you have access to the right context. You want to generate answers.

Richie Cotton: So I'm curious as to what a feature might look like. If you've got a. A text document, then what are the features you're pulling out of this? Can you give us a, concrete example?

Paul: Like a very basic example is that. You have a text document, And within this text document, you usually have multiple  entities. And the idea is that during generation, so when a user hits a question into the chat and wait for an answer, you don't want to pass to the LLM the whole document because most LLMs are like biased When you hit context that is not relevant, for example, it can get confused and start giving answers on context that is not, as I said, relevant to your, question.

And then in this use cases usually have hallucination happens, So we can dig that into this later on. But on how a feature looks like, so you have a big document which can be set from like one page document to hundreds of pages, document. So the core idea is to split this document into entities that are like, make sense on their own and usually you want to be as small as possible,  but still relevant.

So like when you take this chunk as we call it, it makes sense on its own because in the end we will pass this chunk to the LLM to generate an answer. So it's like, if it's fragmented information, then it'll be useless or the answer we will be like not complete. So the trickiest part actually in RUG is to take this documents and split them to create this packets of information that are really relevant on their own.

And after we take this, we split the documents into these packets, these chunks, we embed them and store them into a Vector db. And why we need to embed them is because. The most popular technique for rag is semantic search, where you basically do the coline similarity between two vectors, which intuitively find semantic similarities between two vectors.

So basically using these embeddings, we, you  embed your data in the vector database, which is indexed on these vectors, and you embed your query. So instead of doing like classic searches into a vector database, you use this embedded query to index the query the vector index and find the most similar vectors relative to semantics.

And you'll retrieve those packets of information and pass them to the LM. So this is like even more high level overview how red works behind the scenes.

Richie Cotton: Okay. Yes. I like the idea is just basically about retrieving relevant bits of information. So your model doesn't hallucinate pulling from real data, 

Paul: Yeah, exactly. You, you summarize it. 

Richie Cotton: Nice. Yeah, so I guess it sounds like the tricky part then is still just about how do you divide your document up into smaller chunks that are meaningful in some sense in order that you can pull out like the right facts rather than just entire paragraphs or just like fragments of  sentences or things you've gotta get at the right level.

Paul: Yeah, exactly. And it's a lot more harder than it looks because you usually work with unstructured data and every document is unique. Every image is unique and that's, we have most of the AI problems start to fail 

They try to generalize how you do chunking and most of the time that that doesn't work at scale. 

Richie Cotton: And so, another one of the chapters in your book is around fine tuning. So I think Maxim, this is your specialty here. So talk me through one of the chapters around supervised fine tuning. So what does supervised fine tuning involve and why do you want to do it?

Maxime: supervised Fineing allows you to train the model on a dataset that you created, and that's really good if you want to instill some knowledge, for example, in the model. Let's take an example. You work in a company and you would like the model to. Have some basic facts about the company. If you chat with it, it can  be interesting for a chat bot for example, spur van tuning can help you instill this knowledge in the model.

It can help you if you also want to change the formatting of the answer. If you want to have like a specific structure, it can be also a good solution engine can be pretty good if you want to. If you have a bigger model and you want to distill it into a smaller model, you can use the bigger model to create the data sets.

And we train the smaller model on that. This is done all the time to create general purpose function models.

Richie Cotton: Okay, so it's ready to. About customizing what the output's gonna be. I'm curious as to how it relates to, I, I know there's a lot of other different sort of post training techniques in order to change how the outputs are generated or, or what the quality, the outputs is. Like where does supervised fine tuning fit in compared to all these other techniques?

Maxime: So usually you would start with the base model. The base model has been portrayed on a lot of data but it's  unable to or to complete the text because it's been trained to predict the next word or sub wording sequence. So that's not really useful if you want to ask questions because it's just going to complete your question instead of providing the answer.

But this is why you want supervised functioning in the first place. Because in supervised functioning, we have a specific structure where you say, this is the question, this is the answer, this is the question, this is the answer. So. Yeah, more broadly speaking, supervised fine tuning is pretty much always used, at least to transform this base model into a useful assistant.

And then you have other techniques that go on top of fine tuning. They're called preference alignment techniques in general, or another name is reinforcement learning from human feedback. But not all of them actually use reinforcement learning, so I like to call them preference alignment. So in this family of techniques, the goal will be more to tune the  outputs from the model, so they sound better they sound closer to what you want.

This is useful if you, for example, you have an in instruct model and you see that some answers you don't like them, and you are able to write better answers. In this case, you have a preference dataset because all the answers that you don't like, you want to reject them and all the answers that you can be right to improve them, they your chosen answers.

So you can train a model on that. So rejecting the rejected answers and choosing the chosen answers, that's kind of the goal of DPO direct preference optimization, which is probably the most popular preference alignment algorithm.

Richie Cotton: so this is like after you've done the fine tuning, you're then saying, well, okay, these are some possible responses. Which one do you like best? And you say in, I guess you're doing other reinforcement learning or some other technique to say, okay, we want more output. Like these good answers now.

I know deeps made a a bit of a  buzz in recent months 'cause they'd introduced a new algorithm for doing this preference alignment. Can you talk me through what the innovation is there?

Maxime: this algorithm is called GRPO, and the idea is that I'm going to take a high level or overview of it, but the, the idea is that you're going to compare different versions of an answer. You're going to take the advantage of this comparison to have like QB reward signal for the model to, be trained.

So in this case, you do not need to have chosen answers and rejected answers. You only need like some prompts, and then the filtering will happen because you can specify some custom functions to make sure that you have the proper formatting. For example, to give an example of what I'm trying to say right now dipsy used two different filtering reward functions.

One of them was to teach the  model to output thinking tokens. So when you talk to deepika one, it'll always think a lot, like really a lot sometimes, and then give you the answer, well, that's behavior was created design during this. Preference alignment process by punishing the model when it did not output the thinking tokens and rewarding it when it did it.

Another thing that they did, and this one is probably more interesting, is that they extracted the answer when you ask a math question, for example, they extracted the final answer from the model and they compare it to the ground. Truth answer. And they use that to train the model. And that's why it's so good at math.

It's so good at these like scientific questions, general.

Richie Cotton: That does seem pretty useful. For, well, yeah, I dunno whether it works in the fields, but certainly for mathematics science where you've got like a real concrete answer that you can test. Did it get the right answer or not? That seems yeah very good against. I suppose more generally though, if you don't have this  concrete answer, how do you know whether an LLM is giving good answers or not?

It feels like, it's a bit more fuzzy trying to evaluate whether it works or not.

Maxime: Yeah, the usual answer to this question is asking another LLM if it agrees or not because yeah, this is pretty much the best thing that we have. This is using all the preference alignment algorithms. Actually, you can have a function, one of these functions can just be calling another LM and say, Hey, is it correct or lot?

And get this reward signal in your GRPO training. But what GRPO gives you compared to, to DPO is that you do not need a preference dataset. In a lot of cases, it's difficult to create the preference. It is set. So that can be really helpful. And then the question is, do you have multiple reward signals?

As I said, you can specify different functions, so that can be really good if you have like different ways to reward the models. And finally, about computational resources. This is also more efficient than PPO,  which is the third. And really heavy preference alignment algorithm. So this is also good if you don't have like the resources of open AI to train models it can be handy.

Richie Cotton: Paul, did you have any thoughts on how you make sure an LRM is any good or not?

Paul: actually I have one question for Maxim, which I seen like different answers to this and I'm really curious what, he thinks about it. so you have the base model and after you do this fine tuning step to basically create the instruct model that answer your questions properly.

But if you want to only inject tasks specific or domain specific knowledge into the LL lab, is it correct to take a instruct model as Azure base model and further fine tuning team? On other instructions that only inject, as I said, more narrow specific or domain specific knowledge, or you need to start from your base knowledge that doesn't know how to answer questions at all, 

Maxime: No. Yeah, you, you're right. You can do it from an in instruct model. That would work too, but it just lets. Time, it doesn't work. So it's a bit difficult to recommend it. People often ask me like, oh, do I have to take a base model? And the short answer is no, you don't have to. And in most scenarios, actually, it's going to be pretty much the same.

So you don't have to, but I wouldn't necessarily take the risk, or at least I would try both approaches and evaluate them to see which base model was better. But I don't fully trust these instruct models to be retrained. Like sometimes it goes really wrong. So

Paul: and based on your data samples. So. You need like the same data, sample size, the same data sizes for both scenarios. If you move from a extract model or from a PACE model.

Maxime: in theory, you don't need to, right? Because the instruct model already knows the structure. It already has more knowledge.  So in theory, you don't need to. In practice, if you want to have the same performance in your down downstream task, because here we are talking about fine tuning a model for a specific use case.

If you do that, it's probably better to have like as many samples as you can. Of course, if you have like millions of samples, it's, it's too much. Like you don't need that many, right? But usually I would say no. Like I would retrain with the entire dataset if I can. 'cause like, creating dataset is often the problem.

It's often the main problem in the entire training pipeline. Training models is, is easy. Evaluating them is becoming harder. Generating the data, it's, it's really hard.

Paul: Well, that's where my question was because like, intuitively. me, you need less data to further fine tune an already instruction model than to take it from the base model. And that's why I was curious because most of the time you don't have like  thousand of, or tens of thousands of high quality samples and I was hoping maybe you can hijack it with a few hundred samples.

Maxime: I'm not, I'm, I'm not sure that it will work as well as the fully said, but yeah, it, it can happen.

Richie Cotton: I say I like this idea of guests asking other guest questions. It makes my life put cheese, I can sort of, sit back and have a drink.

Paul: I was, I was really curious about this one because you, you cannot find a lot of information of this or if you can find it is very polarized and you don a lot.

Maxime: I, I'll make a LinkedIn post about it before you do. Uh, Paul.

Paul: Yeah, for sure. It will blow.

Richie Cotton: Wonderful. So, yeah, it sounds like, the, the challenge there is about getting high quality data and can you sort of use synthetic data from another LLM just in place of, real dataset. Cool. Alright. So related to this it seems like deployment is very much a, a sticking point for a lot of organizations.

So, Paul, you've just received some. Next to getting this into something in  production.

Paul: I would first ask him if we can do this with cloud or open air models. So he doesn't to deploy anything at all. No jokes aside. So I think the other part is to fight compute when it comes to deployment. Because to deploy these big models, it's very costly, right? So there are big, I think two main ways to deploy them is.

Either to like quantize them very heavily and try to deploy them on a CPU. To be honest, I don't have a lot of experience with that. Only when I deployed them on my local machine by playing around with things what I did like to deploy them more for production grade was still using quantization plus GP models plus GPO machines.

And in that way, most of the time you need, as I said, to find a compute provider and to find like a survey engine which will  basically serve your model. Some deserving engines are tools like VLLM, which is very popular. TGI for finding phase. These two are open source and I tested them and, and really liked them, but you can also opt for more.

Vendor locked options like SageMaker, which we showed in the book how to do, which I think it's a more, more easy to do because AWS provides everything that you need, like in one place, the computes, the, the tools, the framework, the everything. So that's why I think that so many people adopt solutions like AWS and SageMaker because they're, I think, the easiest way to get started.

Other options that I've seen, like more community based that are really, really cool is I recently tested haases dedicated endpoints where under the code they implement TGI because it's made by them. So TGI is made by Aase and it's just  insanely simple, to be honest, to, to deploy a model with adjust. Do a few clicks, like choose your compute, choose your ization methods, choose your model, choose your scaling strategist and, and deploy them. But I think that this is nice and sweet for like more toy projects, but where you want to deploy them more at scale. you need to be really careful, like into thinking what your cost should be.

 as I said before, like do the math really well behind choose the auto scaling methods. Properly choose your as i a framework, properly configure your framework for requirements. Again, in the end it's quite simple because these requirements, most of the time boil to four aspects that I kept saying is podcast cost latency, throughput, and your data type. And another thing that I realized recently is  it'll make your life so much easier if you pick like, battle tested tools. So instead of trying to deploy your models on the fanciest, latest serving DLM tool that's out there and it's raging on LinkedIn or whatever, just taking to the AWS ecosystem would make your life so much easier.

Or your cloud vendor, like AWS Azure, GCP, and so on and so forth. And for example, I, I'm a big fan of AWS. You had so many options to deploy your model. Like you can start with bedrock for the easiest option where you don't have a lot of control, but it's good enough to start with. Then you have SageMaker where it gives you more, more control, but it's quite ly.

But I, I believe that is a really good way to quickly ramp up your training and the first bike lines and then you can go more le low level and actually deploy your model on ECS and EEKS, which are like Kubernetes or Kubernetes life  clusters when basically have full control over whatever you want to do and you can start playing around with whatever you have in mind.

Richie Cotton: I like the idea of keeping your technology stack simple. You mentioned that cost control could be a problem. I can certainly say like for the, with the rise of these reasoning bottles for the rise of agents, we should just consume a lot of tokens that get very expensive. That costs are gonna be a problem.

Is it possible to predict how much you'd likely to need to spend in advance? Because it feels like this is something you need to worry about very early on in your project.

Paul: Yes and no.

Richie Cotton: There's good news and bad news there.

Paul: Yeah. Yay. In the end, you can make speculations and predictions similar like to the stock market, So you can calculate like how much compute you need for your models, and you can know like how big your model will be, and then you can see on what type of machines your model can fit. Then you  know how much those machines will cost.

Then you can assume how much traffic you'll have. again, when it comes to model sizes and on what machine, you can put the models, these are more predictable calculations. But when it comes to pure traffic, that can be yes or no, and they're more like low hanging fruits, for example. What if your users start to input very high documents, like 3000 pages books, because also your costs are dependent.

So despite the infrastructure that your model we sit on you also have like outscale. So basically you will need only one machine to serve a few clients, or more machines to serve more clients with more bigger inputs. So for example, let's assume that one client. Inputs an outlier, a book with 3000 pages, then a machine will be locked to process that, that book for a long period of time  and it'll start to ramp up other machines that will start to eat your money.

So basically you have a few variables that are more fixed, which you can predict, but basically your, your user's traffic and the type of inputs that they will input they're more hard to predict. You can only make assumptions.

Richie Cotton: Okay. So it sounds like users are kind of the problem. They do. They do things that. Unexpected. So I suppose um, I guess the solution to that is you, you start small and work out, see like what are your users using your product for, and then you're gonna get some data and that's gonna give you a bit more information on like what you need to do as you, as you scale things up.

Paul: Yeah. Yeah, exactly. And I like to say that these are the type of problems that you want to have, because if you have users, it means that your product and what you're doing is good, but you need some future steps to further improve it. So as you said, it's better just to start small and do some like, I know some protection layers on top of it so you don't wake up with  a cloud bill of spend of s of dollars. So the easiest way to do this is just to specify a maximum number of replicas that you want to scale. So when you do this. The worst thing that will happen is it'll hinder people's experience.

So if there are no more replicas to scale up it means that some people will not be able to access the models in real time or how they were designed to. And the next quotation layer is that, that that's why you need strong monitoring and observability pipelines, because that's how you gather all this information.

You can gather actual, real insights on average, how bad it costs to run your model like on average, how many models you get to them up at a single point in time. What's the whole cost of your whole infrastructure? So better start early and monitor and gather statistics the right way. And based on that,  take data-driven methods to quickly adapt than just assume how much it will cost.

Richie Cotton: that's actually quite reassuring that like a lot of this secret to this is doing data analytics. It's like, collect data on what your user is doing, how much things are costing, do some analysis, and then that's gonna help keep things under control. So, Maxim, I'd like to talk about something I, I saw you were posting about recently on your LinkedIn, you were talking about automatic obliteration.

So tell me what is obliteration and why would you want to automate it?

Maxime: So obliteration is a technique that comes from mechanistic interpretability research in lms, well, specifically it. A blog post, unless wrong called Refusal in LLMs, is mediated by a single direction by RDTL. And what they found is that when the LLM refuses a prompt, so all the safety layer is actually very, very brutal.

It's actually very easy  to modify the weights of another lamb. So it cannot say no anymore. It cannot refuse any request. This technique has been refined by the open source community, especially by someone named fails by, and this is now what we mean when we talk about habitation. It's about removing what's called the refusal direction in the model weights themselves.

To do that, we take prompts that the models refuse. We take prompts that the model accept. We take the difference. In the intermediate calculations within the model. And we use this vector, like this difference. We apply it in the weight like absolute, and it just works. So it's a really, really funny technique because it should not work really, but it does.

It has worked so, so well. So this is my interest in it. It's like, don't know, like  this is the kind of stuff that should not be allowed but we can still do it. And recently I applied it to geometry, which is by Google. And it was very interesting to see that their model seems to be a lot more resilient to this type of attack than others.

So yeah, I tried to improve the technique a little bit by introducing like different biometers that you can tweak and see what works, what doesn't work. So it was a fun little challenge with this new architecture, with these new models.

Richie Cotton: Okay. This is fascinating because I think a lot of corporate use cases for LLMs, you really want to have quite tight control over what outputs are being produced and having like moderated output. Certainly if you've got like a customer chat bot, like you don't want it to start going off the rails and not talking about things relevant to the job.

So is obliteration then a problem to having these guardrails or is this more of a sort of theoretical attack on LLMs? 

Maxime: I think that there's a big community, open source community that doesn't really like these guardrails in general that are baked in the models and they want to do stuff where it can also like confront, like. It doesn't really work very well and they might get refusal when it's not really meant to be refused.

So they really like these models in general. So it's more like the open source community. And then this technique can be used to do the Opposite. You can create a model of that refuses everything that you say you just want, I don't know, like what's the capital of France? And it's going to say, I'm sorry, but I cannot answer that.

So it can be used to actually upgrade the security level if you want. And it can be used in other creative ways. For example, py mayday model that is absolutely depressed. Using this technique, everything that you say to this model, like it'll answer with a very sad tone, you know, saying that it's very sad and the universe is  bleak.

So you can do a lot of creative little work. What's very interesting to me with this technique is that. You can change the behavior of the model, you can customize it without training the model. So it's a lot cheaper to use because there's no training involved. And despite that, you can get pretty good results.

For some use cases,

Richie Cotton: Ah, interesting. So does that mean it might be a cheaper alternative to fine tuning then?

Maxime: not yet. I have to say, like I working on it a bit on the side. And for example a little project I made was to use it for classification. I thought, okay, if I try to classify some text using this technique and I do the same thing where I have like one classifications correct classifications.

I take the difference. It kind of works. Like I go from 0% accuracy to 23% accuracy, but 23% accuracy is not very high. So I wouldn't say it's, it's super useful, just kind of works  maybe with more work by being a bit more surgical in the layers and the parameters that we use to apply this technique that could raise the performance.

But I'm, I'm not sure we'll ever use it as a real contender competitor to, to fine tuning. I'm not sure this will ever be the case, unfortunately

Richie Cotton: Alright. So it sounds like it's more about moderation control then rather than customization of, of output content.

Maxime: for now, yes.

Richie Cotton: Okay.

Paul: But not even for reference alignment, because you say like you can introduce specific tones to the model.

Maxime: It's true, it's true. I think that for this kind of changing the tone of the model, it can be really nice. But we would need to see like a comparison between a model that has been like properly preference aligned and the model that has been abated to see, okay, like what do we lose?

Because arbitration is quite a barbaric process where you really like destroy the weights of your models.  And sometimes what you see is that the end model is a bit dumber. Like it's not going to perform as well on major benchmarks like MMLU you lose a bit of intelligence because you were not settled enough.

Paul: Okay. Makes sense. And I actually have another curiosity and I, I, is this tactic similar to merging? Because I know that there's also like for similar architectures, you directly, the competition between the weights of your model without training.

Maxime: No, you're right. Like it. Mo mold merging is also a way to customize models without training. So it it's related. It's just way that you modify the weights is different. 'cause with mold merging, you're just going to use the words of other models and kind of apply them, combine them with what you have.

And here with arbitration, you're going to calculate this difference between two things and then you're going to apply this difference to the words of your model. But I agree, like this is very much related.

Richie Cotton: nice.  And actually Paul I was sniffing on your LinkedIn as well, and I noticed that you're posting that people have been predicting the death of RAG for four years now, but it's not quite happened. So I'm curious as to why people think RAG is done for and whether or not it's gonna happen.

Paul: Typical thing that the whole purpose of RAG is because the context limit of the LLM is limited, So in the beginning when GPT went out or, and the safer similar models, the context way though was like quite limited. I think 32,000 cos or so, maybe I'm wrong. But anyway, it was quite small and that's where RAG was born because we wanted like to throw in, I know company's data, which is usually like terabytes, petabytes even more in size.

And we want to make the model, give the model access to this data. but years later on research advanced and has also the context window  got larger and I think now Gemini has like a 2 million token or, so, maybe again, I'm wrong about the actual number, but like, this is the, the scale of the context window that it can support.

So 2 million tacos is huge. Like I think you can throw in many books in there just with a single call and call it a day. So if you see it like this, it doesn't make sense to do rag anymore because you just complicate yourself with chunking, embedding, finding these entities, storing, retrieving and all that hassle.

But the issue is actually that you, it doesn't work well because as I said, if you throw it everything into a model, it'll be really hard for it to focus on the right data then actually. Have a retrieval problem inside a crop because when you input a specific question, then you need to optimize your model to be able to retrieve the information it needs for question  from the context itself, which usually it, doesn't perform that good or is not that controllable.

Let's say it's more non-deterministic. Maybe you do find it, maybe it'll not, but from what I heard so far, usually it's more it'll not, or it'll not find everything you need because usually the model are more biased at the beginning, at mostly at the end of the contest. So most of their answer are based on the end of the context.

I guess similar to humans, usually remember the latest sentences from a conversation. So if what's relevant to your question is that the baby or in the middle? Most probably to. Hallucinate or something like that. So this is one issue. You don't have the control over the flow of information. And the second issue is, again, requirements costs.

If you, every time you input millions or 10 of thousands of talkers, the cost will skyrocket and also  latency because the model needs to compute all these tokens. So you add huge latency issues over there. And again, now we hit like compute requirements you need, but your machines need more DA to call this context in memory or, or not.

There are other engineering issues at hand that you need to somehow store or cash that, context. So there are other issues. And actually rag is still the simplest, the simple methods where you just. Throwing all the data into, into vector database and retrieve only small chunks of data that you actually need to, and like this, you put into the context only what you need for the specific question, and the model will focus only on what's useful.

And the prompt itself, it's smaller, so more cost effective, more latency effective, so it's actually more  predictable, a more predictable way to give the right answers.

Maxime: Something I would like to add is that all are also like not that good actually, with long context we can see I worked a lot on that evaluating the effective context, length and models that. That they can process more than 30 2K tokens. Usually like quite bad at more than that. I think that it's not that true with frontier models.

Frontier models can process like pretty long context, but still like 2 million. I, I don't know, I, I wouldn't trust it, especially if the answer is somewhere in the middle. It's probably a very bad idea. But it's true. Even with like 60 4K tokens for example, I think that you quickly, the accuracy really drops quite fast.

So it's another reason why even though it's advertised as being able to process and tokens, it's probably not that effective in practice. 

Paul: You use these rights to advertise. It's similar like to phones or machines. Hey, my, my phone has another camera. But just to make people buy it, but that doesn't mean it'll actually be more useful or not. And another aspect of it that I would like to add is that when you want to further optimize your system, we actually go away from these like two, three big models to more smaller models that are more specialized on specific tasks.

And again, usually these smaller models have smaller context window where you.

Richie Cotton: Alright, so, it sounds like these sort of promises of like giant context windows in, LLMs, they're a bit overhyped and actually rag is still a lot more useful. Like, rather than stuffing absolutely everything. Prompt, you would just wanna retrieve the useful bits and then augment your prompt with those bits.

Okay. So I'm curious just to finish, like what excited about in the world of ai.

Maxime: Yeah, I  think that there's a lot of focus on everything related to a GI and that it's supposed to solve all the problems in the world with just, just a GI, you know, like we, we don't really know why, but a i is magic word and it's going to solve everything. think, like, is, this is really cool. Of course, like I love seeing LMS acing math problems that I don't even understand.

I think it's, it's really beautiful when that happens. And this year I'm pretty sure that it's over for humans. Like, we are going to get superhuman results in every international math competition, but what I'm mostly interested in is how we can. Embed this technology everywhere in a way that is like not forced because we can see companies trying to push users to use LLMs, even though sometimes it's not what you want.

Right. But I, I think that's having possibility of getting like these offline assistance pretty much everywhere can be really game changing  in a lot of applications. In lot of use case it can be super useful for people to learn new skills. I know that I would've loved having this technology when I was younger and I was trying to learn German through Google Translate, which was not the best way of learning German.

But I, I'm mostly interested in that. Like how can we embed this technology in cars, in phones, in tablets pretty much everywhere where it can be useful and, and not just pushing for it blindly.

Richie Cotton: Paul, what are you excited about?

Paul: I'm more grounded a little and not that, I, I love productivity tools and productivity stuff, so I would say I'm a pro productivity gig in that sort of sense. And most of what I work, as you can see, is around more or less around this. And I personally see like AI and mostly this like LLM, even if they are like the current state of  LMS or more like physics or general purpose lms, we, which we can slowly see this new trend.

I think this will be like the next industrial revolution. So I think that it'll completely change how people interact with machines and how we will work and how our life would be, because how personally I see the future is that will. We mostly focused on thinking and planning and I know solving solutions or other creative processes like, okay, I want to talk about a specific topic I want to write to about specific topic to create a specific piece of art.

So I, I think that we'll automate a lot of these boring, mundane tasks and it'll let us actually be more, more human which is not that intuitive. But because  we don't have to like feel all the day or lose so much time answering to emails or even like writing blog posts or making direction videos, we can just focus on what we want to express, on what we want to, to send to the world on what we want to create.

And it'll boost a lot of. of our human parts.

Richie Cotton: Brilliant. So, last question. I always want some follow recommendations. So tell me whose work are you most excited about at the moment? Who should I be following?

Maxime: Personally, I really like the work made by the people hugging face everything everywhere, everyone there. I think it's really cool to have as a kind of sense of excellence for like all the open source efforts. They've been brilliant. They started with this Transformers library and they slowly expanded, you know, they're everywhere.

Like you cannot they standardize everything. So, you, you doomed to use one of their stuff. But they also do like really cool scientific work. They create really cool  data sets. I think that, yeah, we really lucky to, to have them and yeah, if you don't follow them already this is a really cool.

Place to learn about the latest news, the latest algorithm. They implement them, they talk about them, and they're really good at it.

Richie Cotton: Yeah. Lot of great people at hooking face, so, definitely worth looking into their work. And Paul.

Paul: Let's, let me think. So usually, for example, when it comes to more like educational part of it, and when, when it comes to like, teaching people, for example, I really respect chip experience work. Like I guess like as a engineer and writer or, or creator or however you want to call yourself, like she's, she is a role model in the se sense of things.

But at the same time, like also respect a lot what huggy face is doing in the o open space, open source  space. But again, in the MLO part of things, I, I respect a lot what people are doing for companies such as. XML or, or Databricks. So, There are so many people that are doing so much great work, so it's hard to like pick only one.

Richie Cotton: Absolutely a lot of great people speaking on all, all these topics. Alright, wonderful. Thank you so much for your time, Maxim. Thank you so much for your time, Paul. It's been great having you on the show.

Maxime: Thank you.

Paul: Thank you, Richie. It was great being here. 

Topics
Related

podcast

Building Multi-Modal AI Applications with Russ d'Sa, CEO & Co-founder of LiveKit

Richie and Russ explore the evolution of voice AI, the challenges of building voice apps, the rise of video AI, the implications of deep fakes, the future of AI in customer service and education, and much more.
Richie Cotton's photo

Richie Cotton

46 min

podcast

Developments in Speech AI with Alon Peleg & Gill Hetz, COO and VP of Research at aiOla

Richie, Alon, and Gill explore speech AI, its components like ASR, NLU, and TTS, real-world applications, challenges like accents and background noise, and the future of voice interfaces in technology, and much more.
Richie Cotton's photo

Richie Cotton

45 min

podcast

How To Get Hired As A Data Or AI Engineer with Deepak Goyal, CEO & Founder at Azurelib Academy

Richie and Deepak explore the fundamentals of data engineering, the critical skills needed, the intersection with AI roles, career paths, interview tips, and the importance of continuous learning in a rapidly evolving field, and much more.
Richie Cotton's photo

Richie Cotton

51 min

podcast

The Past and Future of Language Models with Andriy Burkov, Author of The Hundred-Page Machine Learning Book

Richie and Andriy explore misconceptions about AI, the evolution of AI, AI research, the role of linear algebra in AI, the resurgence of RNNs, advancements in LLM architectures, the reality of AI agents, and much more.
Richie Cotton's photo

Richie Cotton

65 min

podcast

The Unreasonable Effectiveness of AI in Software Development with Eran Yahav, CTO of Tabnine

Richie and Eran explore AI's role in software development, the balance of AI assistance and manual coding, the impact of genAI on code review and documentation,the future of AI-driven workflows, and much more.
Richie Cotton's photo

Richie Cotton

40 min

code-along

Building an AI Note-taking Application

Paul-Emil Iusztin, Founder of Decoding ML, walks you through the process of building a Second Brain AI assistant from scratch.
Paul-Emil Iusztin's photo

Paul-Emil Iusztin