Will World Models Bring us AGI? with Eric Xing, President & Professor at MBZUAI

Richie and Eric explore world models as simulators for action, the jump from book intelligence to physical and social skills, why long-horizon planning is still hard, architectures, robots, data generation, open K2 Think LLMs, virtual-cell biology, and much more.

16 mar 2026

Guest

Professor Eric Xing

Professor Eric Xing is President of Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) and a world-leading computer scientist whose work spans statistical machine learning, distributed systems, computational biology, and healthcare AI. A fellow of AAAI, IEEE, and the American Statistical Association, he has authored over 400 research papers cited more than 44,000 times.Before MBZUAI, Eric was a Professor of Computer Science at Carnegie Mellon University, where he also founded the Center for Machine Learning and Health. He is the founder and chief scientist of Petuum Inc., recognized as a World Economic Forum Technology Pioneer, and has held visiting roles at Stanford and Facebook. He holds PhDs in both Molecular Biology and Computer Science.

Host

Richie Cotton

Key Quotes

The reason we need a world model is that it's actually a simulator of the real world, of the physical and social world where agents can learn autonomously and perpetually, just like you send a pilot into a flight simulator. It is not going to deliver on everything in the world, but it will give you the experiences that an agent can learn experiences from.

Most world models, the so-called world models out there, only generate videos and also short videos, which are maybe 60 seconds or two minutes and so forth. It's not because of a memory boundary or compute boundary. The real problem is the technical limitation of long horizon consistency and long horizon planning.

Key Takeaways

World models are positioned as a bridge from “book intelligence” to “physical intelligence,” enabling agents to act, plan, and learn from simulated experience—not just generate plausible text or video.

A key unsolved technical challenge is long-horizon consistency and planning, which shows up when you try to steer a generated world over minutes/hours rather than seconds.

Openness is positioned as a competitive advantage for trust and scientific rigor: publishing weights, data, recipes, logs, and checkpoints helps address contamination concerns and improves reproducibility.

Links From The Show

MBZUAI

Pan World Model

AI-Native Course: Intro to AI for Work

Transcript

Richie Cotton: Hi, Eric, welcome to the show.

Eric Xing: Hi, Richie. Nice to meet you. Thank you for having me.

Richie Cotton:Yeah, great to have you here. Now I wanna start off by talking about world models. And there've been a few prominent experts, like Jan the Kun, have suggested that these the next big thing after LLMs. World models gonna be the thing that takes us to a GI,

Eric Xing: first of all, AGI is a very ambiguous password that many people refer to different things.

So I'm not sure I know the answer that you ask. That word model alone is going to take us to AGI In a recent c... See more

onversation in wa in levels, I breakdown in intelligence into multiple levels and where you can see, war models as well in the middle. So what currently the large language models.

Is delivering, which we see, to be amazing, like solving very fancy mass problem, engaging in very intense and clever conversation, so forth. I call all those intelligence to be book intelligence, which basically draw knowledge and insights from a written materials produced in history, but also deliver, that kind of outcome in text.

But on the other hand, if you want to. Do real world things like, pour a coffee into this cup and navigate in the real environment and plan for complex missions and so forth. You already go beyond book knowledge because they need to put those plan and the reasoning into action. That requires what we call physical intelligence, right?

And obviously, word model is right there to deliver that type of intelligence because a word model is trained on more than textured data, but also image, data and other sensory data. So that level of intelligence is stronger. Including, a superset and of book knowledge. But then beyond that, why we need a word model.

It's actually a simulator of the real world, of the physical and the social world where agents can learn autonomously and perpetually by, just like you sent a pilot into a flight simulator. So word model is basically providing that it is not going to deliver you all your need, but they give you the experiences that a agent can learn, experiences from.

So that level of intelligence is a different one, which I call social intelligence, because now agents, need to know skills, but also need to know boundaries of their skill. Who am I? Who are you? How can we collaborate? How can we divide a job among different people so that we can, together jointly finish something bigger than every individual?

So that level of intelligence is what I call social intelligence, which is, really behind a new generation of agent models that is different from what we call now the agents, which are. Autonomous scripts, com accomplishing fixed tasks. After you have the agent intelligence, there is yet another level, which I call the philosophical intelligence, which means that the agents start to develop their own curiosity, ask their own problems.

The drive and maybe even the reward of solving those problems like we do as a scientist in Discovery, Sophos. And that requires all the level of intelligence. I just said before, maybe after philosophical intelligence, I can see what maybe a GI would look like. Does what model leads to a GI?

I believe it is a essential step, but still not enough.

Richie Cotton: That's a really great hierarchy. I say it's something I've not heard before, but certainly the idea of being book smart is, I guess it's a common plot line. I think in Hollywood movies you have someone who know, who's read a lot of books and then you throw 'em into an adventure and they can't cope.

But

Eric Xing: exactly. Yeah.

Richie Cotton: So that's interesting. So the next level up is physical intelligence. So I presume is this mostly robots, are we talking about, or there are the other use cases here?

Eric Xing: Yeah, robot is a major platform, but you can imagine even, developing. Personalized games.

And metaverse experiences also requires that level of physical intelligence because you need to simulate the physical world, very faithfully and creatively, and also with infinite complexity. So that without going to the physical world, you can already start experiencing, all the possibilities.

I call this a environment for thought experiments. In fact when people do reasoning, very oftentimes, not through optimization, but through thought experiments, we just imagine different scenarios. We have our mental word model that could give you the outcome. A complex sequence of actions that you would like to perform in physical ward, but you don't need to go there.

That's why, in know, while training the bots, which are supposed to be landed on Mars without going to the Mars, right? So the war model basically gives you that possibility. It is a simulator of all possibilities, conditioning our actions in all situations. So it could be in a digital, virtual, or cyber environment.

Also could be in a physical environment, I think a robot to really develop their true general purpose. Physical intelligence needs a role model, but even people, when they want to train themself for a complex environment, they would also use a word model. Without going to the real environment.

Richie Cotton: Okay. That's interesting. So you've got the sort of the real world physical environment, that's the robots, and then you've got the virtual environments, that's games, that's the metaverse, things like that. So there's a choice there. What sort of state what does the cutting edge look like?

What's the sort of fanciest world model capable of at the moment? Yeah you mentioned about Jan Kuns model and there are a couple of others out there. The model behind the Ginny is also, in my opinion, a word model. All these word models give you the impression that they can generate scenarios in the physical world, some through pixel manipulation.

Eric Xing: And you don't actually need to understand the word. You can still generate pretty videos, give you this kind of the feeling of having the physical knowledge. If you want to put actions in the middle and start to steer this physical word, you will tell the difference. Some word models.

Are simply not able to take in arbitrary actions in a steerable way, and that leads you to the outcome that caused by your action. Some are able to, I will say right now there are still a lot of disagreements even on what is a word model? Most of the word model, the word model out there only generates videos and also short videos, which are maybe seconds or two minutes and so forth.

It's not because of a memory boundary or compute boundary. The real problem is the technical limitation of long horizon consistency and long horizon planning because for example, you can play with the existing video jam models. You say, I look straight ahead and in five minutes, I turn degree.

And show me what you may see a different thing already from what you start with because to maintain consistency in five minutes where your turnaround is actually, a architectural problem, in your architecture, you need to have the representations, which carries, that information consistently without disruption, without distortion.

And also, ingest any perturbations and the inputs and the right architecture right now are still being developed. You saw, for example, one proposal proposed by young Kun called Japan, the joint embedding predict architecture. I recently had a debate with him because I have my reservation about that architecture.

That architecture basically thing say that you need to do reasoning only in the latent space. And also you verify your reasoning also in the latent space, meaning that you only do thought experiment without ever emerging into the real world, which to me is a very problematic architecture.

The real world gives you the opportunity to correct, to calibrate, but why that's the case? Because coming back to the real world needs a cost. It increase the computational cost. Whenever you reconstruct what real world, especially, there has been argument that it is not necessarily even possible to reconstruct every details in the real world, and therefore Johnny Bedding is a cheaper and cost effective way of doing something.

But on the other hand, there. A different way of doing a closed loop information that you take information from real world and go back, but without necessarily all the pixel level details. That basically leads to a representational question. How do you represent real world knowledge using proper encodings?

The architecture that we propose called the generative latent prediction architecture is using a generating framework, which allow different level of abstraction and the granularity to be deconstructed and compared to the real world. I'm not saying this is the right model, but I want to say that the world model debate and development are far from being settled yet.

It's just the start of this new movement, which is already going beyond large language models. One thing we all agree, who work on work model is that yes, large language model is not going to be. Sufficient, to get to that level of intelligence. But then what is to happen next?

We'll see, I think in this year and know many teams are already in the game to push for their in parallel, basically different architectures. We recently released a PAM model, which actually uses a a mixed backbone architecture that combines the symbolic inference capability of a large language model, which allow to do very long term inference.

Because once you go symbolic, maintaining long horizon, consistency is cheap, but on the hand it doesn't give you the pixel level resolution, fidelity, on top of that, we have a diffusion based backbone that is going to do this more reflective, more high resolution short range consistency inferences alongside the symbolic inferences.

So this is a joint architecture that is very different from what we see in other paper which of course makes the computation a little bit more difficult. But on the hand, there are additional approaches to reduce the cost through approximation or through, infrastructure innovations and stuff like that.

Richie Cotton: The fundamental problem that you're describing, how do you represent. Or how do you represent the world? How do you represent knowledge? I feel like this is something philosophers have been arguing about, pla or onwards. So it's interesting that it's still an unresolved debate on how do you best engo this stuff?

Eric Xing: I think you know this is a question that needs to be asked. Every time you face a new problem, because every time you face new problem, that means you are solving the problem in the new space of data and the possibility, and therefore the information, naturally posts you. The question about how do represent that, right?

For example, in the JPA framework, the proposal is to use only continuous representation. In the latent space, which basically rule out this symbolic knowledge that is already dealt with in high efficiency with large language models. For a good reason, because continuous information allow you to play with grading based calculation, which is efficient.

Basically the assumption is that maybe our brain is only doing gradient descent for learning and for action. But on the other hand, I think we do not only gradient based computation, we may do nearest neighbor search or comparison, or indexing based retrieval. So these are a typical operation UN symbolic, discrete knowledge.

So once you have that representation, you open the space of one more modality of computational inference. The interesting part of the research, which goes very technical, different specialists, different innovators really are playing with this very, very big space of our possibility of innovations to find both a high fidelity, high quality inference engine, but also competitionally affordable and efficient.

There's a obvious trade off over there.

Richie Cotton: Okay. So there's essentially lots of different ways of doing it, and maybe it's gonna be different representations for different use cases, and I guess a lot of research to be done. You mentioned that you have your own wealth model pan. Just maybe let's take it a step back.

Why did you decide to build this though? What were the goals of the project?

Eric Xing: The goal of the project is to really try out this new idea and also to allow a more general purpose word model that is serving our needs. Because in our definition of the word model, it is way more than video generation.

If you look at driving, for example, a visually impaired person can also drive very well, even though they don't have a high resolution picture knowledge of the world, right? So you actually need the model at its necessary resolution, to do reasoning. And that's basically, a guiding principle in our word model architecture design.

So we design a word model that is allowing you to do thought experiment, to do simulation recently. And we found that there has been no such instances out there trying these ideas. In the PI model, we want to prove, first of all, this is the necessary pathway toward more general, intelligence and inference, which allows you to do.

What people called system one reasoning and system two. Reasoning One is reflective and and intuitive. The other is more deliberate and a purpose for symbolic reasoning. And these two are both needed. One is called maybe muscle memory. You don't need to think and you can still act, but the other requires you to do very abstract thinking so that you can do long-term planning.

I think PEN is the only architecture out there that allows both type of reasoning to be facilitated. And also our pen model benefit from our own development of the large language model. So different from the the Yuns architecture. We don't necessarily dismiss large language models. Okay. We think large language model is not enough, but it's also useful because we use that as a building block inside the word model to cover.

Part of the reasoning task, which is a long-term symbolic reasoning. Therefore, we actually have our own, backbone, large language model, trained from scratch, from other projects. So we have a very good in ecosystem, unlike many other developers, where we can. Be self-sufficient, in providing all these different building blocks.

And when we find, deficiency and flow, we can go back to this model and retrain in our augment. This is very different from taking, for example, a core model or some other open weight model, plug them in and just use as what it is in our case, we have a more systematic and holistic approach and opportunity to get everything right.

Richie Cotton: Okay. I like what you said about the being a difference between muscle memory and having to think abstractly about and a reason about things. I suppose that's the way humans do this as well. It's like sometimes you can do a task and you don't need to think about it. It's just something exactly you've learned and you're straight away.

They're solving it. Okay. So do you wanna talk me through more about like what goes into this model then? So you said you've got a large language model in there. What sort of data goes into to train a world model?

Eric Xing: So it is not trained at this point, monolithically in one shot, right? So the model.

Divide into different modules. You have a, a visual language model as a encoder to basically ingest the real world signals, be it visual signals, or texture prompts. They basically turn these signals into some lead representations, discrete or symbolic that can be entering the reasoning backbone.

The reasoning backbone itself is a large language model, but augmented with a bigger vocabulary because you actually get a lot of new vocabulary by discretized the visual signals. Therefore, you actually are really influencing not in the traditional, text traditionary, but also more abstract representations representing physical phenomenons and behaviors.

And then the last stage is a diffusion based decoder, which allow you to in a sense produce the outcome of the action and the reasoning in the way that can be compared with you actually see in the real world. So these three components are trained separately based on either pure text data.

Pure visual data or a combination of visual and text data. In fact, we also have a, in the future placeholders for sensory data, audio data and other modalities, right? And once these building blocks are separately produced, into some satisfactory, quality you're putting together to do the joint tuning and the fitting, right?

It's like building an aircraft, parts, built separately. But then there's assemblage phase where you need to do a lot of tests, holistically on everything.

Richie Cotton: Okay. So it sounds like the sort of the end goal phase is you've got lots of different models working together in harmony and maybe some sort of top level control, say which bit, which models need to work overall.

And then the output is some sort of a diffuser model. So is this is video generation that, that's always the output, is it

Eric Xing: the output is importantly a video generation, but more than that video generation is usually. How should a pragmatic needs, because I need to convince people that I'm doing something reasonable.

Therefore, the best way convince people is to let 'em to see it, right? But our internal evaluation is actually not only based on the video data, the video quality is a important but inadequate metric of modeling, of measuring the quality of the data because they only measure pixel continuity, visual, consistency, action generation and solvers.

But our goal for the war model is to solve long-term reasoning problems, therefore, the ability to solve actual inference problem. For example, can you put objects into the right, B, for example, do you also have a long-term consistency in the order of minutes and hours? After you take the first signal, though, these type of measurements of performance are not done in the video space because you cannot generate hours of videos.

In fact, it's useless to generate that money. We actually directly take later representations, and decode them in proper form to measure the quality.

Richie Cotton: Okay. I see. So you mentioned the idea of putting objects in bins. It's like a common factory task. You've got some robot arm has to put an object somewhere.

Is that the sort of the most common use case you think it's gonna be used in factories or do you have any, anyone using it for real tasks at the moment?

Eric Xing: We haven't have anyone using it, but that's actually the set of benchmark that we're developing right now. The being object sorting is one simple task that people can understand because you can basically visually see, I have this path to put the object in this way and the conditioning on the intermediate results.

My next move will be, so on and so forth. Or the other branches. There is a branching simulation that can happen in parallel so that your thought experiment can happen. But again, video in this case is a auxiliary window field peep in tool. The activity rather than the outcome, of the simulation.

You talk about utility, we imagine now two possible utilities already, incorporation. One is indeed a robotic platform. We want to equip, a a robotic platform with this kind of a role model so that they can autonomously accomplish certain mission without I program the robot or train the program the robot to do exactly that.

They should be able to do their own thought, experiment using the word model and figure out what to do. The other actually is the game. We actually have a number of, potential parties, reaching out, I can name two scenarios, right? One is the sports game. How can you basically even capture, all the moments and maneuver and and the tactics, where you play soccer, or other fancy games.

So using the war model, you can actually have a infinite possibility similar to all these scenario. So the gamers, would be given, a lot of more possibilities of playing, their different experiences. The other use case that we already were approached with is the autonomous driving again, when you want to train very safe autonomous driving agents, you don't necessarily have all the videos you need, especially those videos with the accidents because in real world, such thing happen.

Very rarely. You just don't have enough videos for those. But in your work model, you can simulate all in manageable. Accidents the way you know they could happen. And then turn them into videos. And then these videos will be given to, the driving agents to learn policy, offline. By the way, this is another differences of our model in using word model.

Many of the current system are using a algorithm called MPI, I think it is called multiple projection, I don't know, I forgot the name for that. Basically it's automation problem. You simulate all the scenarios, conditions, then you solve a sequence of actions that optimize certain scores.

In our case, we also allow reinforcement learning to be used to learn policies offline. When you are not doing any task, you can play with the word model, like a chess player, play with himself based on their word model of the chess, in this case, the word model of the world, to learn policies of different conditions.

And then when you are in real action, your first skin, you bring and look for those policies, maybe they already solve the problem you'll face, then you directly use it very quickly. And if not, then you run the. Real time, online inference to solve the problem. So the world model really gives you multiple different operational mode for system one, system three, or system three inferences.

Richie Cotton: Okay. That's very interesting. It seems you talked about factory examples. You talked about computer game examples as self-driving car examples. And that last one particularly, I like the idea that you can simulate crashes without having to go and crash a real car somewhere. It's, it seems like a much cheaper or much safer approach anyway,

Eric Xing: if you go to the pen, we have a website, called pen ward.ai you actually will see some demos about how those accidents are simulated.

We actually will be very soon opening up a playable site. People can prompt, the system with their own scenarios and also in real time, steer the model to generate different scenarios.

Richie Cotton: Okay. And so for some of these other use cases, like we have factories that kind of work or already computer games exist already.

What do you expect the impact to be, like how do you measure the success of Pan or these world models? Is it about making things cheaper or is it about exploring things that you couldn't do before? What's the sort of the impact you're striving for?

Eric Xing: Multiple impacts, right?

The game market, first of all is already, a very lucrative and promising, consumer space, for generating revenue because you cannot generate games a lot faster and cheaper, and also interactively and personalizable, with the user. So this is a space that we definitely would look into.

But another low hanging fruit for the value of the war model is the data generation. Data right now is a commodity. You actually sell data, but waste the war model. You don't have to basically send a crew to to collect the data from real world. You do that, but not necessarily for everything.

Your world model can actually generate data on demand and then given more complex scenarios. Again, this word how world model operating is fascinating, right? It's not like you need to know all the physics. All the theory before you create, what you actually see as an outcome of this theory because look at, for example, how people generate, the, a goal, scenario, right? The rule of the, a goal is very simple. You have a few rules of playing the goal or playing the chess, but the combination of that in simulation gives you infinite complexity. Therefore, you can get huge complexity out of a very simple, set of principles.

That's basically what model is. The word model itself doesn't have to be infinite complex, but the way word model are used can generate infinite complexity scenarios. And that's actually where I saw, a lot of potential because without, for example, going to the Mars, going to the moon, going under the water to see everything that you could possibly see, you already could extrapolate from the training data to generate those experiences.

Richie Cotton: That's very cool. So do you think there can be some scientific breakthroughs aided by world models then?

Eric Xing: Actually, that's a very good question. So I talked about, the four levels of intelligence as now the world model is a basis right for you to create, a simulation of the real world possibilities.

And one of these possibilities that we are also exploring in another project is the possibility of life. Imagine, how do you want to design a medicine and why people require not drug trials of multiple phases, because you don't have a simulator of life so that you have to basic test your drug, in a web lab, on, maybe on animal models.

But imagine that you have a virtual cell or a virtual organism, which is actually a word model, which will simulate back to you the potential outcomes and responses to drug, or genetic perturbation. Then you actually could shortcut a lot of the well lab and the physical trial and experiments for drug validation.

So this is where I see a immediate. Possibility of a breakthrough in the science because you dramatically reduce the risk and the cost of a scientific experimentations.

Richie Cotton: Okay. That seems very cool. Certainly there are so many different simulation type problems where the computational space is just incredibly huge and having these world models, I guess if you can reduce the time to run these things or the cost to run these things that yeah, that's definitely gonna help with breakthroughs.

Now you mentioned AlphaGo a few minutes ago now. I remember there was, AlphaGo was the first sort of champion Go player model. But then there's also Alpha Zero, which was the same, but it wasn't taught the rules of go, it learned them itself. And I'm curious, with the world models, do you have to feed the rules of physics into them or do they learn the rules of physics themselves?

So is it like the alpha go or the alpha zero?

Eric Xing: I think the world model as of now. Is at the AlphaGo phase, which is purely draw connectivities and the learn embeddings and manyfold from data so that those manyfold can be used to extrapolate from visible data and then get the possibilities beyond the seeing data, to be simulated.

But I think that's the first step toward F zero because once you have this data or this capability of simulation, you can always go backward from what we simulated and start to trace off, how those outcome are produced and start to examine basically the thinking path, in those models and some of which may actually reflect a physical role.

So I think there is a potential, to reverse engineer. Some of the simulation results within a work model and then hopefully helping you to discover, the real physical loss out of it. Again, I'm not sure I'm speculating on that, but I think this is a natural move people will do next time.

Richie Cotton: Okay. Alright. And evolving field, then we can gradually moving towards learning the universe from first principles, that's like the long term thing. Okay. Alright. So I'd like to talk about your LLM as well. So of course you've got an LLM called lemme get the thing, right?

So it's Kthink V two,

Eric Xing: k

Richie Cotton: not to be confused with Kimmy K two thinking the names are very close, so don't get those two confused. Anyway. Tell me about K two and what the goals were around this and how it's different from other large language models.

Eric Xing: First of all, we started the K two series even before the QI started their K series because they already are going beyond K two and call it K or three or whatever, because the K stands for differently, right?

In our case, the K two refers to that mountain, that very difficult mountain in in Ka. Originally the goal is to build a very open implementation of the large language models in the academic way so that the community can actually reproduce our results and can study that and then, promote the field of large language model to be more transparent and also more rigorous.

And maybe more safe because once you expose all these issues, there are, more opportunities for people to look into it and make it close loophole and discover any any any problems. But as we do, more and more on this, we also, realize that on the one hand you need to be scientifically rigorous.

Transparent and willing to share. But on the other hand, you need to first and foremost be performant. If you have a WIC model that is not as good as the industrial friend runners, nobody's going to use it even though it is academically, a interesting, a project, to play with.

And that basically leads to continuous versioning, of the K two model. We took two parallel approaches. One is, of course to keep building the base model, which need to be done from scratch. And it's a long term expensive. And also not necessarily, high reward project because of beating those frontier models, which are order selected bigger does, is very difficult.

We don't have that much resource. All we can do is to do a mid-size model that is adequate, for that level of task. On the other hand, there are special interest in the UAE community and also in my university on advanced reasoning such as mathematics, such as iq, games and so forth where you need to, use strong base model, but also do a lot of post training, long, train of thoughts, reinforcement learning and in the inference time optimization, many other techniques that is specially designed, on strong data.

To enhance the post-training capabilities. So we, want to establish also, our track record, in both track and also to test some of our new ideas in both pre-training and post-training. So these two things go in parallel. Our K two think last year first release was actually based on a commodity open weight model, which is the current model.

We use it and we do post training on that, which bump up the score. or points, which get us into the first tier comparable to, the deep seek models and a few other frontier models. And then later last year when we finished our base model training of the K two, we actually replaced the K one with our own base model.

And now the K two, think V two is a % sovereign in-house built model from, head to tail, with our own data and also our own capability. I think this is a very important because it is the first time that you'll see a frontier model, which is able to be open sourced, both in model weights, but also the data, the training recipe and so forth, so that people can really take it away and reproduce it.

I think this is important 'cause otherwise we don't even know, we will actually accused by someone you know about data contamination. Which is is a very how should I say, a serious issue because the base model that we took, we don't actually know what data they use in training their data.

And it is not impossible. There are contamination right now that we replace all this model with our own, self-made model. With the data that we collect, we actually have a stronger guarantee on the performance. Also on the right implementation so that the score are more reliable and the more reproducible.

I have to say that many of the strong scores that you see on AA can be very difficult to reproduce because the test scores the data all over the place. You actually could achieve a high score just by contaminating your training data with those testing data. And the people cannot tell because you don't publish your data.

In our case, we publish data out there. People can help us to find out whether there are contaminations and other issues, which actually could really tangible leads to meaningful improvements of the training strategy because, it is a transparent game that everybody, can play by themself

Richie Cotton: so that last issue is fascinating because I think it, it's been a well established problem in machine learning where you have data leakage and information from your test set gets into your training set, and you think your model's amazing.

You put it in production, then it performs terribly. So I hadn't really thought about what happens in the large language model case once you're training these huge models. Like, how do you even prevent that happening?

Eric Xing: Yeah it is, I don't think it is really a ill intended kind of mistake.

People sometimes just I intentionally, get unfortunately bad data or maybe they just didn't pay enough attention to structure their data. For potential overlapping of the testing data set. It requires additional effort and not many people are doing that. I'm not saying that the people are playing game, to get better score.

It's probably just a inevitable, phenomenon that hard to avoid. But, in our case, because we are opening the data to the public, we had the opportunity to invite the public to be the gatekeeper, to really, if they knew what we don't know, say a particular dataset actually is present, in the testing dataset.

We could be informed or they could be informed, and then we can take that data out. I think this is the better way of doing science. Again, academic research really champions not just performance, but also rigor and reproducibility, which is very different from the corporate practice where the performance and also the confidentiality is the key, right?

I think both are interesting. Both are necessary to make sure that the results and the progress that we make. Authentic and also are, sustainable.

Richie Cotton: Absolutely. This real strong push for openness it's pretty unique. But do you wanna talk me through who might want a fully open model here?

Who's important for, I already hear complaints, in the valley and in the in the US ecosystem about the concerns. Offer those open weight models from China. There is geo concerns for sure, but also there are safety concerns because you don't really know, our trend data has been used in producing all those.

Eric Xing: We are hoping the K two series provide a very solid alternative to those open way models so that for the, the ecosystem, startups and so forth who are building their business and operation on those free, open model, can now, have additional options which could be safer and also, more reproducible.

So I think there are this type of angle and opportunity for us to really, start maybe a third quarter in the landscape that is promoting. %, open list. And then on the other hand the academic to us is a very important user. They don't pay us and we are not expecting any pay.

But having the academic adoption of this model obviously give us the right feedback, which allow us to do all science even better. And also to move the field, further, I have to say that there is a problem right now with academic AI globally in terms of competitiveness and relevance.

Somehow, because of a resource reason, because of many other economical and political reason. Academic research in frontier AI is marginalized. We want to use our effort and results to to maybe promote, or stimulate a comeback that we want to be at the forefront of AI research.

In terms of innovation and performance, not just constantly chase after the big pack to do very small incremental work, right? But on, on other hand, our Wayfair, that we are just spreading money and for no return of course not. We build those models. Our students and faculty are given opportunity to take this further into their own startups and spinoffs.

In fact at MBCI, we have a ecosystem incubating startups, which could actually be directly built on those apps and models that we developed. Being the builder of those models, you have the firsthand knowledge of doing the next step. For example, company may be willing to share with their prepared data for your developer.

Agents and other utilities using the language model as a base, then all engineers and developers are in the good position to actually make the best use of those open models for commercial applications. So those applications, of course, may or may not be open sourced for good reason, but the financial kind of in, reward hopefully will be, substantial, from those activities.

Richie Cotton: Yeah, it's very interesting. In the air space, I think a lot of other technical spaces, open source has been very dominant. It's what I think about the operating systems space, the database space, like the, all the most popular products are completely open source. But in the AI space, I guess the cost of building these things it's it's inverted the economics somehow.

Eric Xing: Yeah. Yeah. I think this is maybe because of, a cost issue and also the culture issue because for building large language models or maybe war models and the future generation of foundation models is not a classical, traditional practice of lone wolves and the solo heroes. It's a big science problem.

You need to have large team and larger sources. Therefore, there has to be the right culture to allow such activities to take place. But I think still it is important to me, open source. I said about science, I said about the safety and other things, but also there is, a open question about who makes the standard, of a next generation of AI infrastructure.

And I don't think the standard should be made by a adversary that is not willing to open or by, a corporate that has its own financial interests built in. Having a true open environment allows the community and the dynamic to be part of, this standard making practice and which will lead to obviously better and more robust standard.

Richie Cotton: Absolutely. And I suppose there are ways of one of the most common use cases for doing things with your own AI is, can be things like fine tuning and adding in your own data sets adapting existing AI models, and it fills up with an open source model that's gonna be a lot easier to tune to your own needs, whether they're commercial or academic.

Eric Xing: Yeah. Absolutely. We, that's also why, in a way. Intentionally started from mid to smalls size models which is a lot easy, for people to adapt and to to take further, for the next phase of post-training and fine tuning. But that said, we actually are in the process of rolling out more powerful, bigger models, that is on par with some of the frontier models that we see out there just because the science in making those models are even less understood.

And nobody actually see how those people or those things are made. In our case, we do see the opportunity and the need to. First of all, learn this experience based on our own exercise and the production, but also to share this knowledge to invite a more activities and study into this topic.

Richie Cotton: Okay. And how competitive value with these closed source frontier models, like how

Eric Xing: close can you get to that top forms? I'm very optimistic that whichever model we're gonna release, there will be the SOTA or close to SOTA in that size band. For example, our K two model K two V two that we released in December is a billion problem model.

It is one of best billion problem model and maybe sometimes even better than the Beto model, slightly larger. The next model we're gonna release likely to be a billion parameter model. I think it will be among the best in that size spend. Again, the recipe, you know of our training models in the sense.

Is known to many people already. What is not known are the secret, tricks and the engineering kind of practices that you need to put there to make it happen. I think this is something require a more rigorous and holistic study and that's actually where, our release hopefully can expose and also draw attention from the community.

Richie Cotton: Absolutely. So I want to ask you what your secret sauce is to getting the high performance ade, do you think? But it's not very secret 'cause it's all open source. Gone. What tricks have you used to get that high performance out of your models?

Eric Xing: Oh it's we actually published a very detailed technical report for each of our release already.

It's there's no single silver bullet that you can use to, to to make this performance better. Okay. The way you pick the hyper parameters. The way you deal with those bad data points, to remove it, you know how to program, the curriculum in the right order to, promote the right kind of convergence curve and how much you want to give, in your post-training, fine tuning, in the reinforcement setting for specific task oriented tasks.

There are a lot of little tricks that requires both experiences and also. A good feeling about the dynamics, of the curve evolution. So that's exactly why, my team alone isn't enough to extract the rigorous science out of it. All we can do is to not publish all these logs and check checkpoints and intermediate results in public so that the whole community can jump onto it and study it.

Richie Cotton: That's not fascinating. The, it sounds like there are still maybe more optimizations you can make to your existing model. It just needs people to dive in and make changes. So for anyone who's interested in working on this, like what can we do?

Eric Xing: The LM production right now, it's still like the old school, workshop violin makers, different master have their own tricks.

It's not, a open, transparent science yet. I think we, we still have a lot of room to understand, how the scaling law behaves when they stops and, when to put more data and change architecture. This study, right now isn't, active enough or rewarding enough, to produce, guidelines and the principle kind of recipes, for the training.

And so I think in the next few years when, the spearhead is moving toward world models and the fancier agent models the understanding of the LM model, lms, it won't stop. They will actually produce, more fundamental principle knowledges.

Richie Cotton: Okay. It sounds like there's lots of work to be done.

If people are interested in contributing to this, is there a way they can get involved? I think by making the model open, the environment is already made easier, if they have a reasonable amount, resources. Definitely they can just download those models and download those checkpoints and the study, the data or maybe to build their own application on top of it.

Eric Xing: And more proactive stimulation or promotion of the involvement while still looking into the opportunities. I know that multiple university institutions are now, including us, are in discussion of maybe creating a consortium and maybe seeking sponsorship and resources to create, some kind of computing resources that allow researchers and students, to, bring into, their own contribution and proposals to this type of study.

But again, I don't think there is a systematic framework, for doing that. At this point. I think the awareness needs to. Come first, but people know that there are these kind of open models that is able to be studied. And then there are also, good and easy to play with.

I think this year we'll see multiple efforts, toward promoting this kind of awareness and also more general practice. And we'll see, whether there are, additional donors, or government efforts, putting into this effort.

Richie Cotton: Okay, wonderful. So it sounds at the moment, just a case of.

Try out and see what you feel, see if you get good answers and maybe give some feedback to you or to your team. Alright. Now before we wrap up I'd love to talk a bit about your third strand of research, which is around creating digital organisms. Can you tell me what that involves?

Eric Xing: I briefly mentioned that just now.

The digital organism could be viewed as a special type of word model that is built on biological data with the goal of simulating biological possibilities, not physical word possibility. Of course, biological word is a part of physical word, but very specialized. So our approach is that we want to come up with a new type of architecture which are more appropriate to deal with biological data with all their special properties.

Right now, if you look at the foundation models for biology, right? They are still, borrowing. Ideas from large language models and treating the biological information as a special form of a natural language. And then let's hope for the best, say having a DNA model, having a structure model, which actually is working to some degree.

If your goal is simple, say, predicting the structure of the protein from a sequence, this is a reasonable task that is built on linear sequence information of the data. And then your model, your technique draw from the large language models and transformers may be adequate, but if your goal is to simulate, for example, cellular behavior, or network behavior, as a result of gene knockouts.

Mutation and the small or big molecule medicine these are by definition not sequential data. There are higher dimensional, spatial and temporal data that is going beyond what the current transformer type culture is able to offer, right? So our AI driven digital organism is exploring new architectures which allow us to integrate this different data modality in a holistic way so that representations of a different entity, be it genes, proteins, and the cells and the seller environments can actually be convergent, on a particular information space that they can talk to each other and create responses.

So in essence, we are trying to do something very different, from what you see out there be it. The the perturbation prediction model of the virtual cell, or drug development and models and so forth, which is built on this old school, machine learning mentality, that what I need to do is a functional approximation.

I built a big model that allows you to predict from A to B. And this model used to be a linear regression, now becomes a fancy transformer. But the spirit is to make a prediction, right? In our case, we're not breeding A predictor. War model is not a predictor. What model is a simulator. Therefore, what it generates is to take the input as the prompts, and then it creates a distribution of all possibilities.

Coming out of the prompt. So the idea is by definition, stochastic and also open, right? It is not a a to B mapping, but a open space simulation, why it is important. So I give people this example. When you design complex, solutions or outcomes, say design a aircraft, design a rocket.

Your mental model is not that of a ab test. You have two hypothesis. One is working, one is not working. You don't ab test to do aircraft design, right? And you actually simulate, because it's so complex, you simulate all the possibilities until satisfaction. And then you build a real thing and you basically launch it on the pet.

In medicine, strangely enough, people are still using this AB test, cohort study, for medical practices, but drug. Or tissue or organ or individual is something that is not, resolvable by two hypothesis A and B. That's why, our approach is to build, this digital organism using the word model principle so that it takes data from all different modalities, sequence, structure, network, and image, so that they can give you a holistic representation of what a cell, what the tissue should truly behaving, and then based on that simulate outcome, with whatever prompt people we come up with.

Richie Cotton: Okay. That's really interesting. I've not really thought about the idea of, I suppose like when you think about drug development, it's like either the drug works or the drug doesn't kill people and it gets approved or not. So it's still like ab testing mentality, but actually once you start, if you're in bond drug development, you probably wanna start thinking about simulating def bilities.

Eric Xing: Yeah. I actually I used to tell people. I used to tell people a funny story in about what we see as the future of drug development versus what is happening now, right? So yeah, people are not using Alpha Vote and others to already develop protein drugs. But then what happens to the design that they enter back into the older pipeline of trials and so forth, right?

This, a whole multi-stage trial process is actually a very ancient practice developed some decades ago before we even know DNA before we even know ai. So in a sense, you imagine that, a horse wagon, which is a old, antiquity transportation tool. Now you replace the horse with a Boston dynamic bot.

You, you'll have a fancy thing in the middle, which is the a four fault, but the whole thing is still a horse wagon, right? What I think the right thing to do is to not ask whether this whole framework, this whole process is even correct. Because you, if you have the ability to simulate.

A digital organism, or a cell, then your trial shouldn't even happen in one lab or in test tube. Your trial should primarily happen in a computer with a simulator. That's how people develop, for example, nuclear power plants and the silicons and the chips and so forth. But in biology, that level of a simulation is not happening yet.

So that, that's where, we see the virtual cell digital organism project that we pursuing in gen Bio AI is truly disruptive. It's very different from any of the existing drug design company, which is to put their Boston dynamic dock in the middle of the horse wagon to replace one slice of the process.

We're trying to now reimagine the whole process.

Richie Cotton: Wow, that's a hugely ambitious goal. I guess the question is like, how close are we to having all this work? What sort of simulations work, what's still to be done?

Eric Xing: It seems to be complicated, but you can start from something modest, for example, simulating a single one cell, right?

And that is something you have to do anyway because with all these cell knowledges it is the best way to put 'em together to basically demonstrate, what is possible, our plan. And we, in fact, we are in the middle of developing a first version of the virtual cell, which hopefully will happen in the next couple of months, which is the, you can imagine that I call it to be the Wright brothers aircraft.

The Wright Brothers aircraft is really. Primitive, it is, using very cheap material. It is only, taking off and fly for I don't know, a half a second half a minute and a hundred meter. But the moment people see it, people know that it is not a car.

Okay? It is not a horse wagon. It is something flying in the air. Then only a few years later, people are already using the aircraft to do their battle, in World War I. So once you see the prototype, the next iteration will be extremely fast. So this year we're trying to roll out the first prototype, which is a virtual cell that is able to give you limited simulation of the outcome based on, let's say,

Prompts that you can come up with narcotic gene put a little compound and maybe he it the things up and let's just ask what happened with the cell and then also make the cell very simple. How about just a liver cancer cell, all brain cell. Just very limited definition, but that's already different from predicting.

What is the UN minus one gene look like where you knock out one genes, that's the older paridine people are trying to work on. But we work on a new paridine, which is primitive, but different. And then we believe that you have one cell, you can put two cells together and multiple things together to put, make the tissue.

And once you have the tissue, you can go all the way up into an organized Sophos. So there is a clear roadmap that needs to be started from a single virtual cell.

Richie Cotton: That's amazing. And I love the idea of you've. Simulate things at a very small scale. You, so you're simulating a single cell, which I guess even in itself is just incredibly, is a complicated thing to do.

And once you do that, then you're gradually scaling things up. And one, one day we get our digital organism and that's gonna help revolutionize healthcare. I love it. Okay. So just to finish up, I always want meet more people to learn from. Talk me through who's at work are you most excited about right now?

Eric Xing: It's a very good question. I can be very candid. Among all the people I talk to, the one that excited me the most is a conversation I had. A few, I would say two or three months ago, is demi. The head of because I was very happy but surprised to find that we almost have a perfect alignment on what is a work model and also what is a virtual cell, which is very different from what we hear, from.

The public in terms of the fan fairs and also, the hypes. He had a very grounded but sophisticated view, technical view about how they should be built and also, how to test them. I actually feel very excited about his vision and also feel quite intimidated by that vision because we need to also move fast.

Otherwise we are not going to be able to make our impact, if we cannot deliver our results fast enough.

Richie Cotton: Absolutely. So it's interesting that you're aligned with Demis has about from Deep Mind about the state of biology. Of course, he won the Nobel Prize for his work with alpha fold and the protein folding.

So his vision

Eric Xing: and

Richie Cotton: his team,

Eric Xing: is ahead of the time, compared to many of the other, they don't say much, and I believe. Again my personal opinion is that what they show, what people see is already quite a few months, if not years, behind what they actually have, in their chest.

So I wouldn't be surprised that they have something fancier and disruptive, coming out in the next few months.

Richie Cotton: Okay. Alright. If you're trying to compete with that, then it's a big challenge.

Eric Xing: I wouldn't put myself in that position, I think it's almost like a a in a confirmation about the way, that the way are pushing.

It's good to have, multiple teams on very similar topics, and we can compare notes and also we can, cross crosscheck each other. That's very exciting.

Richie Cotton: Absolutely. Certainly it's such an exciting time at the moment with all these kind of new developments. It's been a real pleasure to talk to you, Eric.

Eric Xing: Thank you. It's a pleasure.

Temas

Artificial Intelligence

AI Agents

Relacionado

podcast

Enterprise AI Agents with Jun Qian, VP of Generative AI Services at Oracle

Richie and Jun explore the evolution of AI agents, the unique features of ChatGPT, advancements in chatbot technology, the importance of data management and security in AI, the future of AI in computing and robotics, and much more.

podcast

The State of Data & AI with Tom Tunguz, VC at Theory Ventures

Richie and Tom explore the rapid investment in AI, the evolution of AI models like Gemini 3, the role of AI agents in productivity, the shifting job market, the impact of AI on customer success and product management, and much more.

podcast

The Past and Future of Language Models with Andriy Burkov, Author of The Hundred-Page Machine Learning Book

Richie and Andriy explore misconceptions about AI, the evolution of AI, AI research, the role of linear algebra in AI, the resurgence of RNNs, advancements in LLM architectures, the reality of AI agents, and much more.

podcast

Data & AI Trends in 2024, with Tom Tunguz, General Partner at Theory Ventures

Richie and Tom explore trends in generative AI, the impact of AI on professional fields, cloud+local hybrid workflows, data security, the future of business intelligence and data analytics, the challenges and opportunities surrounding AI in the corporate sector and much more.

podcast

How Generative AI is Changing Business and Society with Bernard Marr, AI Advisor, Best-Selling Author, and Futurist

Richie and Bernard explore how AI will impact society through the augmentation of jobs, the importance of developing skills that won’t be easily replaced by AI, why we should be optimistic about the future of AI, and much more.

podcast

A Framework for GenAI App and Agent Development with Jerry Liu, CEO at LlamaIndex

Richie and Jerry explore the readiness of AI agents for enterprise use, the challenges developers face building agents, document processing and data structuring, the evolving landscape of AI agent frameworks like LlamaIndex, and much more.

Ver más Ver más