RAG 2.0 and The New Era of RAG Agents with Douwe Kiela, CEO at Contextual AI & Adjunct Professor at Stanford University

Richie and Douwe explore the misconceptions around the death of RAG, the evolution to RAG 2.0, its applications in high-stakes industries, metadata and entitlements in data governance, agentic systems in enterprise settings, and much more.

Jun 9, 2025

Guest

Douwe Kiela

Douwe Kiela is the CEO and co-founder of Contextual AI, a company at the forefront of next-generation language model development. He also serves as an Adjunct Professor in Symbolic Systems at Stanford University, where he contributes to advancing the theoretical and practical understanding of AI systems.

Before founding Contextual AI, Douwe was the Head of Research at Hugging Face, where he led groundbreaking efforts in natural language processing and machine learning. Prior to that, he was a Research Scientist and Research Lead at Meta’s FAIR (Fundamental AI Research) team, where he played a pivotal role in developing Retrieval-Augmented Generation (RAG)—a paradigm-shifting innovation in AI that combines retrieval systems with generative models for more grounded and contextually aware responses.

Host

Richie Cotton

Key Quotes

A lot of previous RAG systems were what we called Frankenstein's RAG, where all of these parts are coupled together, but they're not really designed to work well together. Our RAG 2.0 approach is about making sure that all of these components are designed to be state of the art and they're also designed to work well together.

A particularly useful definition of an AI agent is just something that actively reasons. So something that thinks about what it's doing, formulates a plan, executes on the plan, and then can revise that plan based on the information that came in. So that's active reasoning. The really exciting technology that has enabled all of this is just test time reasoning and the insight that shifting compute from the training side through the test time inference side actually has very, very nice properties.

Key Takeaways

Embrace the complementary nature of Retrieval Augmented Generation (RAG) and long context windows to reduce hallucinations in AI models by retrieving and embedding relevant information into prompts.

Consider adopting RAG 2.0, which integrates state-of-the-art components designed to work harmoniously, especially for high-stakes use cases in regulated industries requiring high accuracy and data sensitivity.

Utilize natural language unit testing to define precise characteristics of acceptable AI responses, ensuring compliance and improving the quality of outputs in regulated environments.

Links From The Show

Contextual AI

Course: Retrieval Augmented Generation (RAG) with LangChain

Transcript

Richie Cotton: Welcome to the show. So I've been hearing a lot in the last year or so about people predicting the death of Retrie augmented generation. Why do you think people thinking this?

Douwe Kiela: That usually happens , with good, simple ideas , where people are trying to sort of, rewrite history maybe from a marketing perspective a little bit on Rag is such a simple idea that I think it's a little bit silly to declare it that. guess everybody listening in here knows what RAG stands for, right?

Retrieval augmented generation. And so the G is just any gen AI model. And then you want to make that model work on your data, which means you need to augment it with your data. And the way to do that is through some form of retrieval paradigm. Doesn't really make sense to pronounce it that I think.

But from a marketing perspective, a lot of people keep saying, oh, you don't need rag, you need fine tuning, or you don't need rag, you need long contact windows. And yeah, I mean, we can go into the specifics of each of those. But I think these are mostly marketing tricks.

Richie Cotton: So I suppose it kind of feels a bit like linear regression is very simple. It's been around for a century and , it's still in use because it's simple, even though there are more sophisticated models out there. .

Douwe Kiela: Exactly. Yeah. We even bought a domain is dead yet. Uh, Where we can blog about whether is dead or not. Yeah. These are the types of things. I think but I guess I'm bi... See more

ased.

Richie Cotton: Are there any problems though, like are there any limitations to the sort of standard

Douwe Kiela: Oh, absolutely. so think if you look at agents, right, agents in general retrieval is just one of the tools in the toolbox of an agent. so, I think , it's very true that we should not just. Rag is the only thing that these agents can do. But yeah, that feels a little obvious.

in terms of, the current limitations still of rag, I think part of the solution there is actually long context, so if a retrieval system is imperfect, then ideally you want to cast a wider net. So have more information that you can put into the context of the language model in the hope that there's something useful in there.

And then , that's where you need a longer context. that's why I always talk about these dichotomies as sort of being false, right? It's like you need both. You're not gonna put the entire internet in the context of your language model. So that's why you are going to do some search.

So rag basically. But then you want to ideally put lots of search results in the context of the language model so that it can do its job and answer the question correctly, right? So, yeah, we need all of those things.

Richie Cotton: So, just to make sure point of you hallucinations by. Relevant information, putting that inside , your prompt, and then that's gonna help the large language model give you the correct answer. And if you've got a longer context window, you can put more information in there. So that's gonna assist.

So the idea of wrong context, windows and rag, they sound like they're competing, but actually they're complimentary. Is

Douwe Kiela: Very, very complimentary. Yes.

Richie Cotton: Okay, wonderful. Now I know your company contextual in your blog, you've got you've introduced the idea of Rag two Oh. Can you tell me what's two oh and.

Douwe Kiela: So, the way we do is a different, I guess from what everybody else is doing. So we started the company when we, the world very. And then very frustrated because the technology wasn't quite ready and that was especially true for enterprise use cases. So, we knew that RAG would be part of the answer.

Obviously given what we knew about rag, we also knew that RAG was just the first idea there , and that even in the rag paper, we talk about sort of what the vision is and. These components are designed to work together, , so a lot of rag systems like two years ago were what we called like Frankenstein's Rag, where it's all of these parts are kind of coupled together, but they're not really designed to work well together.

So, our RAG 2.0 approach is about making sure that all of the components of a modern RAD pipeline, so that's not just a model. In our case, all of these components are designed to be state of the art, and they're also designed to work well together. And you can do that through training on the same data distribution essentially, so that all the parts are designed to work well together.

They're literally like trained to work well together. And so, that combination of having very good components and then having a very good compound , that makes , our system much better at RAG , than anybody else.

Richie Cotton: in theory that sounds like a very useful idea, the idea that the data in your Vector database and the data in the large language model, they kind of harmonize in some way. So all the components are working together. Do you have a sense of like, when this might be useful and what sort of benefits you might get from it?

Douwe Kiela: they tend to be high stakes use cases with a low tolerance for mistakes and sort of high accuracy requirements where you wanna have very accurate attributions. So very often this is in regulated industries. And you know, where you're also sensitive about your data.

If you have those characteristics, , then we are much better than anything else out there.

Richie Cotton: yeah, I can see if you've got a high requirement that the answer you're giving is correct in some way, you're gonna need to put more effort into uh, doing this. Do you have any examples maybe from one of your customers of like, things that have been built using this approach?

Douwe Kiela: Yeah, so, one use case I'm very proud of is the work we've been doing with Qualcomm. their customer engineering department is using us as we, their only gen AI deployment at scale, as far as I know. Where , these are thousands of engineers who are using us on a daily basis to answer very complicated questions.

So these are not the simple kind of questions where you have like your internal search and maybe it's you know, something like clean and you would ask it who is our 401k provider? Or How many vacation days do I get right? that's not really where the ROI of AI is going to come from.

You want to focus on these much more expert kind of specialist knowledge worker use cases. And so that's what we did with Qualcomm and we can answer very complicated questions where I don't understand what the question means. And definitely don't understand what the answer means. But the system does a really good job at explaining , the information and giving the right answer.

Richie Cotton: So, if you just got a simple, like true or false or gimme a number, retrieve one fat kind of a situation, then rag 2.0 is maybe overkill. Is that right? But if you've got a more complicated question, then , that's where it's gonna shine.

Douwe Kiela: really overkill. It's more that so the Rag 2.0 happens during training, right? Not during inre. So during inference, it's still a normal rag system, but just a really, really good one. So it, doesn't make things more complicated for you. If anything, it makes everything easier because we offer it in one platform where you can build these agents really in 10 seconds.

, I can build a state of the art rag agent in, 10 seconds, which is I think pretty powerful.

Richie Cotton: 10 seconds. A pretty well about as short of development time as uh, you can get. So talk me through what do you do then to build something in 10 seconds?

Douwe Kiela: So, we had these two concepts of agents and data stores in our platform. And so what you would do is you would create a data store and then you would tell the agent that that is the data store that it has to work on top of. And you put some files or a database into that data store.

And now you can talk through that data and do rag on top of that data with a, state of.

Richie Cotton: Database and say, okay, , go look at them. Tell me , something about what's in there. Alright, nice. And if you're going for one of these, , very sort of, strict use cases that you mentioned, like the idea of regulated examples, do you need to do some tuning of that, like some kind of optimization to make sure it works properly?

Douwe Kiela: Yeah, so can do that. So our is performance.

Yeah, if you work in finance for example, then you really want to get maximum performance. And so that means that you need to really specialize for the use case, and you can do that through our platform. So we allow you to tune, not just the language model or just the retriever. We allow you to tune the entire rag pipeline.

For your specific problem and so often that leads to pretty substantial performance improvements on top of already state of the art, out of the box performance.

Richie Cotton: . Alright. So it sounds like, a lot of the secret. To this is around the fine tuning steps and these other steps to make sure the model quality is high.

Douwe Kiela: So.

Once you have a grounded language model that is designed for RAG and nothing else, so it's really specialized for rag, then you can use that. If you want to use a reran that is state of the art and that can follow instructions. You can also do that, so it's the same for our retrieval pipeline and for our document intelligence pipeline.

So all of the components that make up our state-of-the-art rag system, they're available on their own as well because we just like seeing what people built with them.

Richie Cotton: Alright, so it seems like love, just like starting with a model that is designed specifically for just retrieving information then rather.

Douwe Kiela: Exactly. Yeah. So a general purpose model has to be good at everything. so as a result, not as good at ize all these components just.

Richie Cotton: curious as to how you measure what the benefits are, work responses are.

Douwe Kiela: Evaluation is such an important area and so.

, we do it in a variety of different ways. I mean, obviously we look at. But that doesn't even necessarily translate to real world performance. So we, look much more closely at our, actual customer data sets and, and sort of the process of UAT user acceptability testing. So do people actually consider this answer to be correct and useful? And that's ultimately what you care about, right?

That's comes from so that's one thing. I think other thing worth. Mentioning here is that we have , this framework around natural language unit testing you're familiar with unit testing, right? it's what you do with code is like just step by step measuring small, things about your code, small units and then making sure that they are correct Idea for language model responses where you can delineate very precisely what the characteristics are of a good answer. , so you could say, okay, like it needs to mention this thing first. It needs to be in this particular style. It can absolutely not talk about this thing. And it, you know, whatever the characteristics are of the answer you can write specific unit tests for.

doing that gives you much, much richer signal. Then just looking at what everybody else is doing now with LLM as a judge models, which is just like, is this generated response equivalent to my ground truth response? maybe that makes sense if you're generating a few sentences. But if you're generating a long answer, then that's just not gonna cut it.

Richie Cotton: So you want to be able to ask specific questions about the response. And just check. Does it match all of these different criteria? Pass.

Douwe Kiela: But that's really great too, right? Because if you think about like a regulated industry , , you'll have people in like the, model risk management department of the bank, and they need to write out what a good answer looks like and what cannot be a part of any answer. Right. So if they can just write those unit tests in natural language and then the system can test against that for every generation that's something that, will make regulators much happier basically.

Richie Cotton: Yeah, I certainly from experience of creating software is like when you write the test, it really helps clarify what you actually want the software to do. And I'm sure it's the same situation with your large language model. You think about, well, what,

Douwe Kiela: driven development.

Richie Cotton: to be? Yeah,

Douwe Kiela: Yeah.

Richie Cotton: absolutely.

Cool. So, I guess related to this is the idea of maintenance. So you data is changing fairly rapidly, how do you go about maintaining your model to make sure it's continually giving the the right results?

Douwe Kiela: what is useful for is making sure that things will, working on your data even as the data changes, right? That's part of. So, I think you, kind of do that automatically there, but one of the things that we're always thinking about is how can we make sure that we have the best components and continuously keep updating the best components to make sure that they're integrated into the overall pipeline.

and so that actually takes a lot of effort, right? , just staying at the frontier of ai, which is moving so quickly that is a lot of work. So, yeah, , that's exciting too, that we get to do things like that, right? That's a broader kind of research community.

Richie Cotton: So in theory you shouldn't have too much maintenance as long as the in.

Douwe Kiela: Yeah. I was at this Gartner conference. It was data and analytics so everybody at the Gartner Conference, and this is really not my crowd, to be honest, these are, not like, you know, ai people there, like real enterprise people. And it was really eyeopening for me to just be there and talk to all these amazing people and get perspectives which different from what I've been hearing everywhere and.

Making your data ready for ai , and I just felt like that's a massive cop out, That you should not have to make your data ready for ai. You should make your AI ready for your data. so that in an ideal world, and I think where we're headed is that you don't have to do anything to your data to make it work.

You just have very good AI that works on top of that data and we just have to accept that data is noisy. And so that's why we have like this multi-stage retrieval pipeline. That's why we have this powerful reran that can follow instructions. That exactly, because you need to make sense of noisy data and filter out the things that you don't want, and make sure you get the things that you do want into the language model.

So, yeah, in an ideal world, in the long run, I, can't promise it now yet, but in the long run, you want to make sure that AI just works on your data and you don't have to make your data AI ready.

Richie Cotton: I think as heard, like everyone listening in who works in data governance. It's just gasp all at once.

Douwe Kiela: Wait, waste time preparing for.

Richie Cotton: Okay. So that's interesting. So do you have any advice for people who work in data governance then? Like if your company is making AI applications that involve like search or or retrieval, what do data governance people need to do?

Douwe Kiela: Be very careful about your metadata. So I, I think that's really one of the crucial parts , of state-of-the-art modern rag pipelines is making sure that you have high quality metadata, sort of annotations for your documents , and database schemas or things like that. So that, is definitely still.

I think thinking carefully through like your entitlements model, especially when you have multiple data sources that might be very disparate. we talk to a lot of companies that have like SharePoint and Confluence and like Jira and Slack and like Google Drive and all kinds of different things, and then they want to work on top of all of those.

But getting that to actually work with a.

Set not trivial at all. So that's one of the things , that we are good at. But if you have a more centralized way of dealing with that from a data governance perspective then you can save yourself a big headache.

Richie Cotton: Yeah, so certainly I can see how Lineage is incredibly important. Do you wanna go into a bit more about their entitlements? It sounds like if different people have got different permissions on data, it makes retrieval a bit complicated.

Douwe Kiela: So if you think about like real world companies with like hundred employees and not everybody has data, companies actually from a regulatory perspective, have sort of hard fencing between departments. Yeah. So you need to make sure that you capture that and that you don't make any mistakes. so, setting things up the right way for handling that at the scale of a hundred thousand people in a large company.

that's a very interesting problem.

Richie Cotton: Yeah. So, that case, I guess you want different responses on who is asking. They're gonna see different.

Douwe Kiela: Yeah, so like I always enjoy saying everything is contextual so you wanna get that context into the model. And so for every user and every user interaction, even , the context should be different.

Richie Cotton: Handling permissions and things has been around forever in sort of relational databases. I don't know what the status with vector databases like. Is it easy to manage? Who gets access to what data and things? Is it as sophisticated as traditional relational database?

Douwe Kiela: No, it's not, it's not nearly as sophisticated

relational.

Easy. So you can do it in like , one sort of single database, but doing it across databases is also not easy.

Richie Cotton: Is there a solution to this then?

Douwe Kiela: Yeah. So the, the way we do it is , we synchronize the data with our sort of data store concept, which we talked about, right? And then when we do the retrieval step. We validate using an entitlements API that sort of calls all the upstream APIs , to make sure that you still have access.

And so that obviously cannot always happen in real time, depending on your latency constraints. But there are some sort of synchronization step that happens in between there. So , the model is basically like you ingest. You do the retrieval over that index, so kind of a complicated index.

And then when you find the results, you validate that the results you found are actually accessible by the user using an entitlements a p.

Richie Cotton: so, this is one of those problems where it sounds like it's gonna be really better if someone else solves it for you. Like, it doesn't sound like the sort of fun thing you want play around yourself.

Douwe Kiela: That's what I've been trying to tell people. Stop, trying to DIY complex rag systems. It's not worth your time. Just like you wouldn't build your own database or your own language model, like you shouldn't build your own rag platform. You should be building rag agents and applications on top of that platform to solve important business problems for your business.

Richie Cotton: I mean, that seems incredibly important. It's like, try and figure out where you can add the most value to solve problems to your business. Like on where you should be spending your time. you have any examples?

Douwe Kiela: Yeah, so, so the differentiated value for a company is what companies are all about, Like, your company wants to be better than your competition, and that's sort of the goal of every company. and so the more you can focus on that. and thinking about, okay, how can we use our data and how can we automate some of our processes or improve some of the things that we do?

The more time you can spend on that and not on like, the optimal chunking strategy or like how to make sure that your VLLM doesn't continuously go down or like. Basic problems that you have to deal with on the sort of rad plumbing side of things. The less you have to worry about the plumbing, the more you can worry about actually solving real problems that add business value.

Richie Cotton: that makes sense. Just like really think about like, how can I actually make my business better and then try and get as close to just Creating stuff on that level rather than worrying about too much about the, the lower level infrastructure if, possible. Now there's something you mentioned a few times is the idea of a reran, it sounds quite technical, but feel like it's one of these sort of underloved components of AI systems that isn't talked about much. So, do you wanna explain what is a reran and why would I need one?

Douwe Kiela: I.

The idea is actually very simple. It's when you do retrieval because , you need to do retrieval over large amounts of data. You cannot have a very big model doing that. So embeddings models are, you cache the embeddings, right? It's relatively cheap compute. Maybe you do some like get bunch.

You will have made mistakes because you had to do it sort of quickly enough, right? So you couldn't have like a, smarter model take a look at it. So that's what the reran is. It's a smarter model that does a second pass at your initial retrieval results and says, actually like this. know, I can see why it was sort of relevant, but it's really not relevant or, this one is actually super, super important for getting the right answer.

And so, our reran is state of the art by a large margin. So it's much better than other re bankers out there, and it's the only one that can follow instructions, which is really important because now you can tell it what your data hierarchy is basically. So, again, at the level of the user or even of , the individual query, you can say, this is the priority of my retrieval results based on my preferences.

So, most recent first. If it is a PDF, then it's more true than if it's a Slack message. You know, if it comes from our internal wiki or if the boss wrote it, then it's more true than if it's an intern document and it has draft in the title, right? , these types of, of rules for how to prioritize data and how to break conflict in pipelines.

That's, you need an instruction following for one things we, one.

Richie Cotton: Okay. Yeah, so certainly I think like , the standard approach for retrieving information from a Effecti database , is just like, it's simple linear algebra, it's like cosign similarity or whatever. So, . Yeah, top product I think is, well, like nine 19th century mass. I'm curious you mentioned like if you information, well, you've have a set of rules on like, precedent, something you write manually. These.

Douwe Kiela: We usually have pretty good default rules in place for.

So you don't have to come up with them yourself if you don't want to. You can also ask a language model based on your data, what it thinks the should be do it yourself or you can just go.

Richie Cotton: Something where you could get really, really into the weeds and like trying to figure out what's the optimal strategy here. And it feels like, again, that could be a,

Douwe Kiela: So if you have a very specific rank problem and there's a very specific thing that needs to happen there just because of the problem you're trying to solve, , then you could try to solve it through the system prompt. But ideally you solve it by prompting your retrieval pipeline, right? And so you obviously can't do that in your first stage retriever because it doesn't have the ability to follow any instructions, but you can do it in your second stage.

So in your.

Richie Cotton: Information that conflicts with each other. So, you got a problem that's trying to give like a specific answer. So pull some information from your knowledge base or whatever, and you've got two different results that give you a different answer. So talk me through, how do you go about resolving this?

Douwe Kiela: It depends on the type of conflict. But you can tell it through instructions, , how to deal with that conflict or how to break the tie basically. So you could say most recent first. So if you find two documents, then one is more recent, then you rank that one much higher, Or you could say different data sources, like, I prefer this data source over that one. . That's why, it's so important that you can prompt these components is because there are different strategies for dealing with conflicts. So one is you go for most recent, the other is you report both results. But you say that one is more recent, So, really depends.

and that's, why it's so important that those.

Richie Cotton: it's good that there is some way of sort of resolving these conflicts. I'm, is there any way from, we have these conflicts, this going into your data governance.

Douwe Kiela: so one of our, very special capabilities is our ability to say, I don't know. Which is an underrated feature. It's much better to say, I don't know, than to make up a wrong answer, So, that's a really, really good thing.

But then when you have an I don't know, answer, ideally you want to have the ability to annotate that answer. So get the next time you get the question, you do have relevant information in your.

And then the other is, yeah, we, collect feedback through our, or through our APIs, and then you can export that feedback and actually train on that feedback. that's how you can specialize it for the use case over time. Just keep making it better and better by incorporating the feedback that you get.

Richie Cotton: So, this sounds pretty useful. It's like, is going okay. We've got wrong answers. We're going back to the people who actually curate the data or maintain it and then that's gonna into a better system.

Douwe Kiela: So that's time, at least how you can capture most of the distribution. Obviously the of the distribution gonna be, but fine also at.

Richie Cotton: We before a bit.

Douwe Kiela: Yes. And. I think an agent is a very general concept. I mean, so it comes from reinforcement learning or even, it's maybe even older than reinforcement learning itself, where it's really just about like a policy that takes actions in an environment and that policy has some sort of state and so it can manipulate the environment but it doesn't necessarily have to manipulate the environment in order to be considered an agent.

So another way to put that is , maybe a bit closer to home for folks listening to this podcast is if you think about an agent that can do SQL queries some people will say, oh, but it's only an agent if it changes the environment, which would mean that it is only an agent if it generates insert queries or update queries.

And if it just does select queries, then suddenly it's no longer an agent. So, when you explain it like that, it sounds a bit silly, It's like, obviously that's still an agent. Like deep research doesn't do any insert queries or update queries, but it does do a lot of useful stuff, So, I think the, a much more useful definition of an agent is just something that actively reasons.

So something that thinks about what it's doing, formulaes the plan, executes on the plan, and then can revise that plan based on the information that. So that's active reasoning. And, and so the really exciting technology that has enabled all of this is just test time reasoning, and the insight that shifting the compute from the training side to the test time inference side actually has very, very nice properties.

Richie Cotton: Okay. Yeah, I like that. So I think your first explanation, I was just introducing all the important pieces of jargon from reinforcement learning the idea of like, policies and environments. But yeah I like the idea of just it doing reasoning at the time why you're asked if. can do things step by step and, and form a plan and, you know, go and execute it.

Okay? Alright, so, yeah, the, the, same part of this was if you've got some retriable augmented generation involved in this do you to do things differently with an agent as opposed to just a simple chat bot.

Douwe Kiela: I think like the, simple chat bot is a subset of the overall concept of an agent. So you can have a chatbot agent and it, probably, for most chatbot use cases, it doesn't have to do a lot of thinking. So , the same agent can also power the use case. it is just more powerful.

It depends on , how much. You want budget for test time compute. And so if you wanna minimize that, then you probably just have something like a standard chat bot. but the boundary is, very blurry, Because when I have a rag application, which maybe is not really an agent, but as a part of my rag step, I do query decomposition.

And I formulate sort of a plan and then I do some filtering on top. It's sort of an agent, right? , that is what agents would do, but the agent sort of maybe determines that more dynamically on the fly, sort of what it would wanna do. But yeah, the boundary is not as well defined, I think as a lot of people like to pretend.

so overall I this of agents, there's a excitement.

are sort of blown away by the potential, but like in practice I haven't seen any real agent deployments that have like material impact on a company's business yet. I mean, I I'm sure it's gonna come, but it's gonna take some time. But these agents, obviously, they need to work on your data too, right?

Just like what we were doing before with NI is like, yeah.

Data. So, one tools rely on.

Richie Cotton: This is interesting. I think the big difference is with the sort of standard rack approach. You've got some software saying, okay, let's shove all these basic information from the database into the prompt. So it's kind of being pushed to the LLM, whereas if you're doing things with a sort of, in instance time reasoning, then the LLM has to ask for information kind of pulling it in

Douwe Kiela: Yeah, that's right. but again, right. It's not really like one or the other. It's really a spectrum. I think most modern rag systems, they probably have some kind of classifier that says, should I retrieve or not? and then based on that, you say, okay, so like if I say hello, then you don't have to retrieve in your rag chat box.

You just say hello back. So that's already kind of active retrieval, where in the old rag setup, where it's really passive retrieval is you get a query. You always search for that query. In your vector database, you always give the results to the language model.

that's. Prehistoric at this point, So it's much more complicated. There's, active decision making involved in these rag pipelines. There's a lot of filtering. There's the reran that has a huge impact There's this sort of active retrieval component.

There's square decomposition, which is almost like formulating a, retrieval plan . That's all very agentic. But now when you have an agent, you can do that much more dynamically and that's, that's why.

Richie Cotton: . Wonder. Well, yeah, so you've got a lot more flexibility in like what the behavior. A, more rigid pipeline.

Douwe Kiela: Yeah, exactly. And it can also like recover from its mistakes, which is very important, right? So if it retrieves something and it thinks, oh, actually this is not what I wanted, let me try a different query, and then it gets the result, that's very powerful. And that's something you can do with, abilities.

And then the other thing is, more around sort of multi hopp questions , or like multi-step reasoning is like, , first I need to know this, and then I need to know that, and then I need to compare those things and then maybe based on the result, do something else, right? That, type of multi hop problem , that's I think very interesting in terms of business value as well.

Richie Cotton: So mentioned that you been.

You so far.

Douwe Kiela: They, they've wowed me in terms of like seeing the potential, but they haven't wowed me in like, oh, this company has saved like $10 million this year because they had an agent doing something. Using test time reasoning and not something that was branded an agent, but that was something that we were already capable of doing before, kind of, reasoning models.

Richie Cotton: What do you think the sticking points are? Like where are we falling short?

Douwe Kiela: I think generalization, actually having it. Work in real world settings where it's not in the, toy domain, So getting things to work in a toy domain or in a nice demo that used to be my sort of story around rag is, and that's still very true. People think that rag is easy because you can build a nice rag demo on a single document very quickly now and it will be pretty nice.

But getting this to actually work at scale on real world data. Where you have enterprise constraints, it's very different problem. so it's the same with agents where it's like, oh, I can make something like do this one particular thing when I prompt it and basically just make everything look good for exactly the one thing I wanted to do.

But then when you actually have to make this work in, then everything just breaks down very quickly still. So that's gonna better over time, obviously. Hype. But gonna take some time these systems to be enterprise. Great enough for anybody to deploy this in,

Richie Cotton: since a lot of companies are just thinking, okay, we've gotta get in on the edge and again, we've gotta build something, what can you build that is likely to work and add value?

Douwe Kiela: So you can build basic solutions for relatively boring problems. one thing you can do is go for like the basic problems where you ask the basic questions like internal search, But that doesn't really get you value. It's much more like trying to find workflows that exist in your company.

That are a little bit boring, but where it's important that you get it right and where it requires some expertise. If you can solve those problems, then you can make your, company much better, so these could be very simple things from like, checking for compliance against your set of policies or doing basic research.

We have a very nice demo where we fill out Excel spreadsheets on the unstructured data on the fly, so that you don't have to manually go and copy and paste. You're just directly in your Excel call 'em macro and then fill out the spreadsheet with unstructured data from different data sources.

Doing things like customer support. there's a lot of low hanging fruit. In kind of , the code gen side of things. a lot of it is happening across the board, right? It is just yeah, doing this, the right way takes time. So , there's a big gap between sort of where the hype cycle is and where like reality is in enterprises.

But I mean, it, it's coming coming it's just yeah, it takes time.

Richie Cotton: So it sounds like, the best approach then is to go for maybe slightly more narrow use cases where there's less flexibility needed, less generalization needed, and I guess almost build maybe disposable agents. It's like something you can build quickly and then just solve your problem and then done.

Douwe Kiela: I like that idea. You can build disposable agents on our platform, iBrand.

Richie Cotton: Nice. Cool. Okay. So, just to wrap up what are you most excited about in the world of AI At the.

Douwe Kiela: Yeah. So, I'm obviously very excited about all the gentech things. I think for me personally, where I, see a lot of very interesting problems is at the intersection of structured and unstructured data. So you have a bunch of documents, but you also have your traditional kind of structured relational databases.

Your Snowflake or your BigQuery or whatever you use. And now you wanna kind of cross stack that information using agen rag. so if you can do that, which you can now start doing because of these egen abilities that unlocks so much interesting potential. So I think that's really exciting.

The other thing is multimodality is, is obviously still very unexplored. I think every time there's like an image generation feature getting shipped, that kind of goes viral. But I think image understanding is actually much more valuable. From a, kind enterprise.

The unlock that is coming soon. So chart understanding, for example, and, things like that. Understanding , you know, a McKinsey slide deck that has lots of different diagrams and charts and things in there, like right now, that's not really within the capabilities of these systems. But , it's coming very quickly.

Richie Cotton: I got a follow up on that. So talk me through it. You want basically the ability to understand a presentation then just like throw a PowerPoint and then explain what the output is.

Douwe Kiela: that's the simple case. So the hard case is I have a hundred million PowerPoints that my company has made in the history of my company with, in my company. And now I answer question based on all of that information.

Richie Cotton: Okay.

Douwe Kiela: So it's not just like one single PowerPoint, because that you can kind of start to do, even though it's not very accurate, but you need to do it at scale, So over, lots of presentations. So if you can do that, then you can do very interesting kind of synthesis on top of it, right? So how did our, perspective on a particular type of thing. Change over the and then you can just find the relevant presentation decks that covered this particular thing.

Look at the charts, reason about the trends in the charts, and then combine that into a new insight. , that's all kind of starting to become possible now.

Richie Cotton: That would be very interesting. Oh, I have to say definitely had a few colleagues in the past where it's like even with them talking over the PowerPoint, I.

Douwe Kiela: Everybody says garbage.

Richie Cotton: Absolutely, absolutely. Wonderful. Alright so actually since you've been involved in Rag since like the beginning, I mean, you're there when RAG was created, part of that team. So do you think RAG has sort of lived up to your expectations? Has it panned out as you expected?

Douwe Kiela: That's a nice question. I mean, I think , the original vision was always that we would have kind of a decoupling between the knowledge and the reasoning. Where the reasoning is really just like taking whatever the relevant knowledge is and then giving the right answer on top of it.

Having any of, the knowledge in it itself. So that didn't really pan out. And, so, that's part of the reason why these systems hallucinate that's I mean, it's a longer story. but, , for rag, I think when the paper came out it was very focused on open domain question answering, which is , the domain that you evaluate these systems on.

so it was well received. But at the time, gen ai, like vector databases basically didn't really exist, That became a thing after the paper language models didn't really exist. We had like BART and T five, but , there wasn't really a concept of like an auto regressive generative model.

So I think like the reason rag became such a popular paradigm and concept and why it's called is because of the G. So it's really just because gen AI became a thing. That rag became a thing. And there are lots of papers from around that same time. There's an amazing paper from folks at Google called Realm, where that didn't become the name of the paradigm because it didn't have a G in it.

It was a mass language model. Right. So, yeah hindsight is disorder of 2020.

Richie Cotton: It's amazing how like small change the name has a big difference on your success or not.

Douwe Kiela: I mean, it's not just a name, right? It's that we were interested in trying to see if you could generate the answer without sort of, so the alternative is just predicting the answer, which is what was much more normal to do at the time. So I guess we were ahead of our time in the right way there.

Richie Cotton: Wonderful. and certainly, I mean, , it's taken over in so many ways. It's yeah. U. Alright, so just finally, I always ideas people to follow are.

Douwe Kiela: Interesting work happening in this new test time compute paradigm. So I mean, I still have my part-time Stanford adjunct professor gig, which is great for me to kind of, stay at least a little bit up to date on the latest, research trends. And , I think there's just a lot of interesting research happening around this test time reasoning and, what you can do there.

Scratch the surface. and so, what happened with Deep seek and things like that has been very encouraging, I think from that perspective where it's actually not that hard for like non Frontier Lab folks to do interesting things in this space and have impact. So, yeah, and I, I follow a lot of just smart academics.

Richie Cotton: Always worth following, I think yeah. It's a good genre Of people to get ideas.

Topics

Artificial Intelligence

AI Agents

blog

Agentic RAG: How It Works, Use Cases, Comparison With RAG

Learn about Agentic RAG, an AI paradigm combining agentic AI and RAG for autonomous information access and generation.

Bhavishya Pandit

6 min

blog

Advanced RAG Techniques

Learn advanced RAG methods like dense retrieval, reranking, or multi-step reasoning to tackle issues like hallucination or ambiguity.

Stanislav Karzhev

12 min

blog

What is Retrieval Augmented Generation (RAG)?

Learn how Retrieval Augmented Generation (RAG) enhances large language models by integrating external data sources.

Natassha Selvaraj

6 min

podcast

Creating High Quality AI Applications with Theresa Parker & Sudhi Balan, Rocket Software

Richie, Theresa, and Sudhi explore RAG, its applications in customer support and loan processing, the role of testing and guardrails in AI, cost management strategies, the potential of AI to transform customer experiences, and much more.

Tutorial

Agentic RAG: Step-by-Step Tutorial With Demo Project

Learn how to build an Agentic RAG pipeline from scratch, integrating local data sources and web scraping to generate context-aware responses to user queries.

Bhavishya Pandit

See More See More

RAG 2.0 and The New Era of RAG Agents with Douwe Kiela, CEO at Contextual AI & Adjunct Professor at Stanford University

Key Quotes

Key Takeaways

Links From The Show

Transcript

Agentic RAG: How It Works, Use Cases, Comparison With RAG

Advanced RAG Techniques

Top 30 RAG Interview Questions and Answers for 2025

What is Retrieval Augmented Generation (RAG)?

Creating High Quality AI Applications with Theresa Parker & Sudhi Balan, Rocket Software

Agentic RAG: Step-by-Step Tutorial With Demo Project

Agentic RAG: How It Works, Use Cases, Comparison With RAG

Advanced RAG Techniques

Top 30 RAG Interview Questions and Answers for 2025

What is Retrieval Augmented Generation (RAG)?

Creating High Quality AI Applications with Theresa Parker & Sudhi Balan, Rocket Software

Agentic RAG: Step-by-Step Tutorial With Demo Project