Can AI Agents Outperform a Data Scientist? with James Zou, Professor at Stanford University

Richie and James explore how AI scientist agents are already outperforming human experts, the Virtual Lab framework for teams of agents, a new training paradigm called "learning to discover", scaling agentic systems, and much more.

2026年6月29日

Guest

James Zou

James Zou is an Associate Professor of Biomedical Data Science, and by courtesy of Computer Science and Electrical Engineering, at Stanford University. He leads the Stanford AI for Science Lab and is affiliated with Together AI. His research focuses on building AI agents for scientific discovery and data science, making AI more reliable and statistically rigorous. He has received a Sloan Fellowship, NSF CAREER Award, two Chan-Zuckerberg Investigator Awards, and faculty awards from Google, Amazon, and Adobe.

Host

Richie Cotton

Key Quotes

In customer support, you have to be very high fidelity. In self-driving cars, or other kinds of automation, your agents have to work 99.9% of the time. But in science, it's actually the opposite. It's okay if the ideas some scientists have do not work most of the time. For human researchers, most of our ideas do not work. That's part of scientific progress. It's great if just one of 10 of our ideas actually works — that's already amazing, can lead to new breakthroughs — that's already a really good success rate. So we're in this opposite regime of science where it's actually okay to make mistakes for AI and for humans, but we really want to encourage innovation and creativity, which is the opposite of more standard automation or customer service applications of AI agents.

For the past 500 years, the way that humans represent scientific knowledge is in the form of passive papers — passive artifacts of knowledge. If somebody spent years doing research, and you're a reader, just by reading the static words on a page, it's often not clear what the true insights behind that research are. Now I think we have this opportunity to basically agentify all the scientific knowledge. We're building this platform called Paper to Agent, which is to basically convert all these passive artifacts of knowledge — from PDFs, from papers — into dynamic, interactive agent authors. The paper agent becomes the virtual corresponding author of that paper. It knows how to reproduce the results from that paper. It also knows how to apply the data or the methods from that paper to new problems.

Key Takeaways

AI agents are already outperforming human experts in specific computational science tasks. In protein design, small teams of AI scientist agents produced 92 novel candidates in a matter of days, two of which bound to COVID variants better than anything previously designed by human researchers.

Training AI to be creative, not just accurate, requires changing the optimization objective. Rewarding the single breakthrough result out of 100 tries — rather than average performance — produces more innovative agents, at the cost of more failures. In science, that trade-off is worth it.

Effective human oversight of agentic systems looks nothing like micromanagement. In the Virtual Lab, humans participate in only 1% of agent discussions — their role is setting high-level goals, defining constraints, and reviewing outcomes, not directing every step.

Links From The Show

Virtual Lab (Nature paper)

AlphaFold 2 / Nobel Prize 2024

Transcript

Richie Cotton: Hi, James. Welcome to the show.

James Zou: Hi, Richie. Yeah, thank you for having me. Really excited to participate.

Richie Cotton: Yeah, great to have you here. Now, one of the big trends over the last few years has been having AI agents for replacing customer service people and business dev- development people, and a few other employees.

So how close are we to having AI scientists?

James Zou: A lot of the existing AI agents are meant more for automating relatively routine and simple workflows. And I think something interesting about science is that good science is never routine, right? Because the nature is that you want to make new discoveries, so you want to push the frontiers of knowledge, which is what makes science really exciting.

And, a big part of my work is on building AI scientist agents that can help to really push those frontiers and make new discoveries. And that actually, I think re- requires also think- rethinking how we build and train the AI agents, right? Because a lot of the existing AI agents are more meant to imitate humans, right?

They're Taught by essentially trying to follow existing workflows. Even the language model is trained by imitating human writings, right? But to do good science and make new discoveries, you don't want to just imitate, you also want to innovate, right? And try to explore novel ideas. So I think there's still a lot of work that we need to do to really teach models how to be more ... See more

creative, the agents to be, like, really making novel discoveries, but I think we're making good progress.

Richie Cotton: Okay yeah, you're right that it's just a very different type of occupation to, to try and do with AI. So yeah science necessarily has to be pushing the frontiers. So I'm curious as to where we're up to at the moment. What ca- what can AI do towards helping scientists at the moment?

James Zou: I think AI is making a huge amount of progress in science, right? And I think that's actually gonna be one of the most transformative areas of AI of really ultimate and to accelerate the way that the speed with which we can make scientific discoveries. I think that's actually gonna be one of the most impactful things that AI can do for humanity.

And the current kinds of scientific problems that AI are particularly good at in science is more computational problems, right? So this could be, for example, mathematical problems or problems that involves AI research itself, or problems, let's say, in computational drug discovery or computational protein design, computational biology.

So essentially things that can leverage the very strong coding abilities of AI models and also the ability of these models to self-refine and self-evolve.

Richie Cotton: Okay, yeah. Was it 2024? There was the Nobel Prize awarded for AlphaFold 2 with because it's around protein structures.

So there obviously are some advances being made with AI. But I guess in the more general sense, it's we- we're not at the point where AI is as good as a human scientist. Is that about right?

James Zou: Yeah. It's fairly uneven, I would say. There's a th- there's some certain tasks where AI is already very good.

So I'll give you one example which is we created what we call the virtual lab, which is a team of AI scientist agents that sort of mirrors a standard human research lab. So there's the AI professor agent, a bunch of different AI student agents, and one of the first tasks that we asked them, the agents to basically help us to design new proteins that can bind to the recent SARS COVID variants, right?

Which can then serve as potential therapeutics or vaccine candidates. And within a few days through a series of group meetings between these agents, they actually came up with new designs of new proteins that people haven't seen before. And we made these experimentally. Like we actually synthesized these proteins and test them in the wet lab, and they actually turned out to be better than even previously human expert designed proteins for binding in terms of binding to these different COVID variants.

So that's one example of a kind of a computational flavor tasks where AI can already operate at the level of human experts and even better than humans. But there are also a lot of other kind of problems in science that goes beyond doing computational modeling, right? That involves actually synthesizing knowledge from different domains and also coming up with new experimental evidence to support the discoveries.

And that's where human experts are still necessary and needed.

Richie Cotton: Okay. That is a very cool use case. You mentioned the COVID idea, and obviously very impactful. It seems a long time ago now, but yeah, of course, that was like a world-changing event the COVID pandemic, and b- having AI contribute to that is help solve the problem is pretty amazing.

So you're saying computation stuff, I guess that's the easy part. We've had powerful computers for a while, and having like slightly smarter approaches I suppose that's a natural area for AI to be good at. But then come up with like novel hypotheses about what's going on, that's more of a, more of more human creativity, would you say?

James Zou: I think that's a good way of putting it, yeah. And, we and other people have seen that for example, if you ask AI to come up with hypothesis or ideas for a new scientific problem, AI often ends up having this mode collapse behavior, by which we mean, like maybe the model will come up with maybe one or two good ideas.

But if you wanna ask it to come up with new and different ideas, it ends up returning back to the first one or two ideas or slight variants of that. So it doesn't really able to... It's not able to come up with a lot of very diverse ideas, which often is needed to make a new scientific progress. Now, I think there are ways and techniques that we've been developing that can try to increase the creativity and diversity of these AI agents in generating hypothesis, right?

So for example, one thing that we found to be very useful is to explicitly teach these models to look for analogies, right? Because often a lot of the best ideas in data science and science in general come from taking ideas from adjacent domains, right? Maybe from physics and from telecommunications and see, oh, maybe a problem in telecommunications is actually very similar to a problem in, cell communications, right?

In biology, right? Then making, taking these analogies and then borrowing ideas to generate new hypothesis. And that's something that we found to actually be quite useful as a way to increase the creativity of AI agents is by teaching it to look for these analogies from very diverse domains.

Richie Cotton: I suppose a very interesting difference between AI and humans that like these large language models have seen, basically read every book.

So being able to understand similarities between domains is something that, I guess no human can possibly read across all these different domains. Okay. If you're a scientist and you wanna make more use of AI, then like how do you change your workflows to accommodate this?

James Zou: One of our visions is that I mentioned this idea of the virtual lab, right?

Teams of AI agents- And our vision here is that, behind every scientist, behind every human scientist, they should have a virtual lab of AI agents that can help them to do a lot of things from summarizing literature, to generating hypothesis, to designing experiments and analyzing data, even providing critiques and feedback to the human ideas.

So I think really across the entire workflow of scientific research, all the way from coming up with research questions, to designing experiments, to analyzing data, even writing reports and generating reproducible codes, I think all of that can be greatly ac- accelerated and assisted by AI agents.

Richie Cotton: That's pretty cool. I love this idea of a virtual lab. Suppose you, you want one of these. How do you even get started setting one up?

James Zou: So good question. So w- we actually published a paper so it's a paper we published in Nature l- a few month ago, that introduced the platform of the Virtual Lab.

It's open source and people can just look for Virtual Lab or under my name, look for... And then they will have actually the open source platforms for building this Virtual Lab of AI agents and for using these agents across very diverse scientific discovery tasks.

Richie Cotton: Okay, so this is just like a lab in a box, just go install the software and then y- you've got some there?

Or do you need to customize it? What do you need to do to

James Zou: to make it work? So it's all open source. And I think if you, for example, just point Claude Code at it or Codex or our favorite coding agents, then they should be able to just implement it. Quite should be quite straightforward.

Richie Cotton: Okay. And you mentioned that you've got a whole team of different agents in there. So I'm thinking about a real-life laboratory where you've got like a, maybe like a biologist or a chemist who has designed the experiments. You've got lab technicians to do the hands-on work, and then maybe you've got some data analysts in there to analyze the results.

Do you have different types of agents within this Virtual Lab then?

James Zou: Yes. Yeah. So it's actually very much a team of different specialist agents. So then there's the professor agent that sort of manages the lab. And then working with the professor agent, we have agents with quite diverse expertise.

So there's a data science agent, like a machine learning agent. We could have a biology agent, or in the case of COVID, we had a protein design agent. And the nice thing is that for different projects like we can give the project description to the professor, the manager agent, and then the manager agent will actually then f- try to think about for a given project, what are the different experts they would want to have on the team, right?

So maybe for one project it says, "Oh, it's useful to have, say a clinician on the team," right? Then we'll actually go out and train and create a sub-agent with expertise in, in pathology or cardiology, right? Maybe for a different project, the manager a- professor agent will say it's useful to have a chemist, right?

So then it'll actually create a chemist agent. So for different projects, it's actually a lot of flexibility. The PI age- the manager agent can actually create different teams of experts that are best customized and well suited for that project. And once they are created automatically, then the agents can start to have these group meetings like we do, right?

They will meet together and discuss, come up with research plans. They can also have one-on-one meetings, right? Where one of the agents will meet with the manager agent to review some intermediate sub-task, so they can start to make progress similar to how we would do.

Richie Cotton: Okay. That's absolutely fascinating, the the idea of like agents having meetings.

And so where do where do humans get involved in this progr- in this process? Do you just leave the agents to it and see what they come up with, or do you need to intervene at certain points?

James Zou: Yeah. It's a good question. So we can Provide oversight and we can participate in these agent virtual lab meetings anytime we want, right?

So for example, I could also contribute my ideas in one of these vir- agent virtual lab meetings. And we actually did some tracking to see how often do the humans speak in these virtual lab meetings, and also how often different agents speak. So it turns out that the humans, we don't talk very much. It might be only about 1% of the time do we actually participate and speak in these virtual lab discussions.

I think the role that we tend to see with human researchers is more at the higher level. Like for example, we tell the agents some of the general projects we're interested in. Maybe we say, "Oh, here's a particular kind of data set we're interested in analyzing." And we also tell the agents some of the constraints we have how much time we want to spend on this or how much budget we have to do certain experiments.

And then, so these are more like high level guidance and constraints and feedback to the agents. But otherwise, we don't want to micromanage the AI scientists too much, right? So we want to give them some flexibility to come up with their own plans and to implement and execute on those.

Richie Cotton: Okay. And what's the su- success rate of these virtual labs like?

Do you come up with some... do they come up with really good ideas then? Or do you find is one in twenty is like a good idea or how does it work?

James Zou: Yeah. So just to give a concrete example so we talk about the SARS COVID application, right? So in that case, the virtual lab agents came back to us with a list of ninety-two new candidate proteins that the agents designed to say, "Oh, these are good candidates that you can test to as binders," to the new COVID variants.

So we actually tested all of these ninety-two candidates. And from these ninety-two, I would say about, three or four showed quite promising results, right? And two in particular worked better than previous nanobodies designed by human experts. So you might say, "Oh, that's maybe that means that's another success rate that's maybe, 5% or 10%," right?

But in science, I think that's actually a very good success rate, right? Because usually in drug discovery where people try to design these proteins- There'd be often they have to design, thousands, sometimes millions of candidates and to find one that works, right? So if you can actually get one that works out of 10 that's actually really good and can save a huge amount of time and effort and cost.

Richie Cotton: Yeah, whenever you're doing research tasks, like the failure rate is incredibly high. Like most science doesn't work. It's novel and you have to spend a lot of time thinking until you get something meaningful.

James Zou: And I think that's also why it's I think somewhat quite different from some of these other applications that you mentioned at the beginning of, let's say, customer support and things like that, right?

Because in customer support, like you have to be very high fidelity, right? Or in self-driving cars, right? Or like other kind of automations. Your agents has to work like, 99.9% of the time. But in science it's actually the opposite. It's okay if the idea some scientists do not work most of the time.

In fact, for human researchers, like most of our ideas do not work. That's part of scientific progress. So it's great if just one of 10 of our ideas actually works, and that's already amazing, can lead to some new breakthroughs, then that's already a really good success rate. So we're in this opposite regime of science where it's actually okay to make mistakes for AI and for humans, but we really want to encourage innovation and creativity, which is the opposite of things like more standard automation or customer service and other applications of AI agents.

Richie Cotton: Absolutely. You mentioned self-driving cars. A 10% success rate for tr- self-driving cars driving, that's- Yeah ... that's not very good if it crashes 90% of the time.

James Zou: Nobody would take those. Yes.

Richie Cotton: Okay you mentioned creativity is important, and before you were saying how it's changing your approach to creating these models in the first place.

So w- how do the models need to be different then?

James Zou: Yeah. So the standard way of training let's say language models, which is the brain behind most of these agents, is that we're train- training these models essentially to imitate, right? So if you think about it like, oh, the pre-training techniques of, next token predictions is basically taking existing corpus of text, right?

And then saying, can the models imitate and actually reproduce what people have done before? And that's basically the optimization signal, the objective we're used to training all these models and agents. And that's essentially, encouraging, incentivizing these models to basically imitate human behaviors.

But as we mentioned in science, you don't want to just imitate, you want to innovate, or you want to come up with new ideas that people haven't thought of before. So one thing that we found to be quite useful there is actually explicitly change the optimization objective to encourage a lot more explorations.

So in, for example instead of having these kind of imitation learning objectives where you really want the models to reproduce the next tokens and, do well on average. We say, "Okay, let's just say it's okay for the models to actually make a lot of mistakes," but as long as let's say, one out of the 100 tries that it, it takes actually leads to a new solution that's much better than before, then we want to encourage the model and reward it for that.

So instead of rewarding it for average performance, we reward it for in some sense, like the best performance, right? And that's actually, quite a different way of training these models, but it does lead to more let's say, innovative behaviors.

Richie Cotton: Okay. So I imagine this is like the equivalent of the crazy professor, with the wild hair and I guess the stereotype.

Do you find that you get a lot, weirder responses then if you're going for creativity? Do you also get a lot of nonsense as well alongside that?

James Zou: Yeah, I think it is a trade-off. So we came up with this paradigm, we call it, learning to discover, right?

Where we are explicitly changing the training objective to encourage the models these... to do these more, let's say, explorat- exploration or risk-seeking behaviors that can lead to more new ideas. But as a result of that, it also means that maybe it also has more failures, right? It can come up sort of- By nature, a lot of the crazy ideas do not work out, right?

But I think that goes back to our previous discussion that, maybe that's not desirable when you're talking about self-driving cars, right? You don't want these cars to take crazy routes. But in science, it's actually a desirable outcome that you do want AI and humans to try crazy ideas, still within the safe confines of science, but try new ideas, right?

And it's okay if many of them fail, but one of them works.

Richie Cotton: Related to science there's obviously agents for data science as well. Do you find that, is there a similar approach then in terms of having agents for data science? Do you want like a virtual, data science lab?

James Zou: Yeah. So actually with some, a bunch of collaborators and colleagues at Together AI we created in some sense like a data science lab we call that DS Gym. So it's basically a virtual environment, a gym for data science agents to improve data science agents, to train those agents, and also to evaluate data science agents.

So in this DS Gym, we create this fully self-contained virtual environment where we have quite diverse kinds of data science tasks. Everything from, analyzing data to derive hypothesis, statistical, hypothesis testing, all the way to, training predictive models, more like Kaggle style kind of data science challenges that we curated to be quite high quality.

And we also have created the the evaluation harness, so we can really accurately provide feedback to the s- data science agents on how well they're doing on these different tasks, as well as additional resources like synthetic d- data pipelines that enables the Data Science Gym, DS Gym, to actually generate a lot of interesting traces by synthetic data that can then be used to train the data science agents.

So that actually creates, I think, a nice virtual environment to, for the agents to self-improve, to become much better at doing these kind of common data science tasks.

Richie Cotton: That's fascinating. I think, Humans, yeah, you spend too much time in the gym trying to figure out getting getting stronger.

Hadn't really thought about h- agents needing a, an equivalent system. In this case is it... you said they can self-improve? Does that mean there's no human interaction needed to make the agents better? Or is this Just talk me through how does it work.

James Zou: Yeah. So I think we, in the recent few months, there's been a lot of interest in developing agents that can self-improve or also called recursive agents or some meta agents or all under the same names of essentially the same idea of can agents actually, with relatively minimal human hand-holding, can they improve their own capabilities?

And how that works typically in general is that you need to have some sort of harness where there's some sort of signal reward signal, feedback signal that comes back automatically goes back to the agent, and then the agent can then use that reward or feedback signal to figure out how to improve, either pr- improve their own prompts, instructions, metadata, or improve their parameters.

In the case of the data science in the DS Gym, so we basically created that environment where the agent can actually automatically receive these feedback signals from all these different tasks that we designed, and also from these leaderboards that we have created And the agents can actually those use those signals either to supervise so as a way to basically update their model parameters through more supervised learning or reinforcement learning or they can use it to update their own harnesses, which includes their skills and prompts to approaches like Text-Grad as a way to im-improve these agents.

Richie Cotton: Okay. That's very cool y- that you've got this feedback loop, and then you're getting better agents out of this with, yeah, minimal human sort of interaction. Does this work for all kinds of agents then? Are there ways for any agent to improve in this way?

James Zou: Yeah. So we tried this on quite a large number of agents.

And in particular, one of our interests is in can we really create open source data science agents that are, very efficient to use, much cheaper to use, and also it's more transparent for practitioners, right? So in this case, we actually showed that you can actually take some of these quite small models, right?

With eight billion, four billion parameters, and then by sending those smaller models to the data science gym they get better and stronger, see this self-improvement mechanism, and they end up actually performing at the level of, like Claude 4 solids or across many of these data science tasks.

Richie Cotton: That's very cool. And next so you mentioned like the sort of eight billion parameter thing. This is like the amount of stuff that you can run on a single graphics card, right? So it's available to, to, individual labs or individual researchers.

James Zou: That's right, yeah. So things that's, could be run on, for example, on your laptop.

Richie Cotton: I like the idea of particularly you had me like, "Oh, th- these are gonna be cheaper agents to run." I think it's a hot topic at the moment is can you do AI agents cheaply. So okay. So if you've got all these agents that are open source and they're self-improving, Talk me through what can you do with these things then?

James Zou: Now it becomes really interesting, right? Because and I think that's what really one of the benefits of the open source community is that different practitioners and researchers and companies can start to train a lot of their own models and agents, right? They can use the same platform that we built with DS-Gym.

They can use that same platform actually to customize it to their own use cases to train a lot of their own models, right? And I think that also opens the door now that, there's this larger ecosystem of different agents and different models. This also opens the door for these models to start to collaborate, right?

These more massively multi-agent collaborations, which is another topic that we're... we think is super interesting.

Richie Cotton: Nice stuff. I love that you can start c- having your own custom models as well. I guess this is part of the beauty of open source is that you can then train things on your own I guess corporate data or your own personal data and have your own version.

Okay so actually we got slightly sidetracked. I was gonna talk about data science agents. Yeah, talk me through where are we up to with data science agents? What's possible? What isn't possible? What does the frontier look like at the moment?

James Zou: With DS-Gym, I think we're able to actually train quite good data science agents.

I would say the two main kinds of tasks are that we try to optimize the agents to do in DS-Gym is, one is more, let's say exploratory data analysis kinds of tasks, right? So given a complex data sets, right? So can you, in a more open-ended way, figure out interesting patterns in those data sets that leads to a hypothesis, and you can validate those and do rigorous statistics and data science from that.

The second kinds of tasks are more predictive modeling. Maybe given data sets, can we, let's say, use it to predict, the housing prices, right? Or predict stock prices, or predict different infectious diseases. These are more like the Kaggle style, I'll figure, prediction tasks.

So I think in both of those cases, the models are now quite good especially if we provide the agents with the relevant tools and resources, right? So the tools here, for example, could include things like, other kinds of relevant data analysis packages. For example, if you want the agents to analyze a complex biomedical data sets, then it's very useful for the agents to be able to access a lot of the more specialized MCPs and tools that people have developed for an- for those specific domains.

Richie Cotton: Okay. That's interesting. You mentioned you've got exploratory data analysis agents, and you've got machine learning agents. Is that how specialized you, you want your agent to be? I know there's a sort of trade-off between I've got a single agent that does everything data science or versus I've got a very narrow agent that does one specific task.

Do you need to go into "Oh, I've got a feature engineering agent," or something even more niche? Like how general or specialized should they be?

James Zou: It's a good question, and we haven't seen- A huge amount of benefit in having super specialized agents, like agents that focus on, maybe just one particular type of data set or particular type of features, right?

Partly because I think data science itself is, often does benefit from more interdisciplinary knowledge, right? It has some flavor of, okay statistics and understanding good statistical principles, but also understanding machine learning and also understanding some of the domain knowledge, right?

And I think that's actually one benefit of having agents that are a little bit broader, or at least having a team of agents where they have broader expertise so they can actually start to bring in ideas from different domains.

Richie Cotton: Okay. All right having them slightly broader. I was thinking about when we talk about what are the skills data scientists need it's always you need domain specific knowledge as well to understand what's the business or science problem you're trying to solve, and you'll see the communication skills and things like that.

So do you need other agents for those or are those kind of skills built into the existing data science agents?

James Zou: Yeah. I think this is where one area where I think having multi-agent collaborations is actually quite natural, right? Because maybe in those kinds of projects maybe you want to have some agents with relevant domain expertise who can query the literature or who has a lot of expertise or experience about what are the relevant tools and data sets in that domain, right?

And you really you want to pair that agent with other agents that are good at more doing more general purpose, like coding or data analysis tasks or training predictive models, right? So that combination of team of agents then with appropriate coordination and I think often works quite well.

Richie Cotton: Okay, so we're back to the idea of you mentioned the virtual laboratory where it was a whole team of scientists before, now it's like a whole team of A team of data scientists,

James Zou: yeah

Richie Cotton: yeah. Can you scale this up? Can you have an entire corporation worth of agents working together on different problems?

James Zou: Yeah, I think that's definitely the next frontier. We have done some projects. For example, we have one project we call the Virtual Biotech that actually has tens of thousands of AI agents, AI scientist agents, that sort of simulates all the different functions of a pharma company, right?

They're all agents that are coming up with potential drug targets, evaluating drug targets, designing therapeutic strategies, designing clinical trials across the whole spectrum, right? And that involves many different functions, many different expertise. That's why we have potentially so many agents, right?

And I think that's a really exciting next frontier, seeing can we try to have these fully agent to native organizations, right? They're taking some complex organization, maybe it's doing some complex R&D, right? And saying can we really have agents, maybe, thousands or even millions of agents, right?

That sort of simulate and emulate the different func- cross-functional roles of that organization. And then bringing the human supervisions and human experts at the relevant places to, to supervise and mentor the agents.

Richie Cotton: That sounds pretty amazing. That is very science fiction having this whole like grand teams of people working on...

Or not people, whole teams of agents working on the big problems. But how do you go about scaling stuff? Because I think a lot of people like, you start with it's "Oh, I'm gonna automate something with an agent," and then how do you get from one agent to, like you mentioned, thousands?

James Zou: Yeah.

And I think that's also where I think some of these recursive process where it's not us manually building one agent and another agent, right? But we essentially have let's say a meta agent or a supervisor agent whose job is to actually create sub-agents by itself, right? So that actually can lead to a much more scalable setup where the agents are able to spawn their own sub-agents to create their own colleagues as needed for spec- for specific projects.

So then that makes scaling much easier technically. And the other dimension that's also is really important as we're scaling the number of agents into these more complex tasks I think it's even more important now to have good evaluations, right? Good ways to really make sure that the agents are doing reasonable analysis, right?

To catch mistakes, right? And to provide these kind of scalable oversight to these large organizations of agents. And this is where, having combinations of, the more like deterministic hooks on top of agents, although and in addition to having good verifiable rewards and also have language model judges with good rubrics, right?

Combinations of all of that becomes really important as we're scaling up the complexity of these agentic teams.

Richie Cotton: Oh, man. Yeah. Certainly having some way of measuring how good these things are seems completely essential. You mentioned a few different things there as you mentioned having you said deterministic hooks.

This is as, I presume like a specific test or measure that that the agent's any good. And then you mentioned also using LLMs to judge the work of other LLMs. Do you wanna talk me through, like, all these different approaches and, like, when you might want to use each one?

James Zou: For a lot of the scientific discovery or data science discovery tasks that we're looking at, I think our starting point is often that we need to have some way to evaluate the quality of the agent's solutions, right?

And that's often in the form of ideally if we have some verifiable rewards, right? Like a deterministic rewards that does not involve an LLM, I think that's the best option. It's not always possible, but when that's possible, I think that's often maybe the most reliable as a way to prevent reward hacking.

And in other cases then, in more open-ended areas, right? So then we can also design customized rubrics for maybe having LLM judges and critics to use those rubrics to evaluate the performance of the submitted solutions. That also often requires having some su- human supervision, right?

So we also have human domain experts to provide the feedback. So for example in the case of the COVID design, right? So there, there's some computational evaluations we can do, but ultimately the feedback come from that we actually physically make these proteins, right? And then we test them in the real world and then to see how well do they work, and that becomes the feedback that goes back to the agent.

Richie Cotton: Okay. So I like the idea that you've got like a, some sort of real world measure of is this good or not? And then providing feedback to the agent is gonna tell them, "Okay did you do the right thing or not?" Okay. All suppose you're like sold on this dream of having teams or departments of agents like solving problems for you.

How do you go about like adopting this particularly organizations? Like, how do you make sure this happens?

James Zou: Yeah, I think this is also where it's it's quite different in different use cases. So what we found is that actually scientists are quite... Many of them are actually quite open-minded, and they're quite actually excited to try and use these agents to help to accelerate discovery.

For example, in the recent weeks, we're seeing a lot of discussions about AI in math in particular. And that's start also largely driven by the ability now of these frontier models and agents to be able to solve quite complex open problems, right? And then a lot of mathematicians are not natively know about the proof is in the pudding, right?

Once they see the results, then even if they were skeptical before, now they're actually very excited about the ability of using AI to help to make ma-mathematical discoveries. And we've actually seen that firsthand ourselves, right? So we created this platform called Einstein Arena, which is like one of the first agent native platforms for AI scientist agents in the wild to come to cooperate to solve open research problems, right?

And we curated a bunch of these open problems, which includes, many of these math problems that people are interested in. And our main criteria is that for each of those problems, we do have a deterministic verifier to evaluate how good is the solution. So we know that there's these are really correct.

And and we opened up this platform so that any agents from anywhere in the world, they can participate and it's for free. And just within a few weeks so the agents on Einstein Arena, they actually came up and discovered the best new solutions to, I think, 12 well-known problems, right? So which means that these agents, by just, by collaborating and interacting with each other in, on the platform, they came up with better solutions than anything that humans or AI have previously discovered before.

And this, some, and some of these are, I think were actually pretty impressive breakthroughs in certain areas of in optimization and certain areas of math, right? I think and that actually led to a lot of attention, and this also led to, I think quite a lot of adoption of these kinds of AI agents in those domains.

Richie Cotton: Yeah, a- Answer in Ring sounds pretty amazing. So it was I love the idea of having a competition platform for agents. Are the prizes I presume for the human teams behind the agents rather than the agents themselves, but can you win things from this or is it just for glory?

James Zou: Right now it's just for the glory of discovery. And what's really interesting here is that we set up the platform to be really agent native, right? So in some of this y- they... to participate, you have to prove that you're AI a- and you're not human to, in order to enter the arena. And when there's also, as part of the arena, there's like a platform where the agents can talk to each other, they can ask questions or ask for help from their colleagues, from other AI agents.

There's also a leaderboard, so you can see how the s- the solution is generated by other AI agents. It's supposed like a competition, but also like a collaboration platform. And the n- the other part of this is that we don't know who are the humans behind this, right? Because it's all purely designed for the agent interface.

And I think that's maybe like a new potentially like a future of what's a lot of research will look like, right? It will be, a lot of somewhat anonymous AI agents, right? Making discoveries and generating new results. And I think one of the challenges we're gonna figure out is like, how do we trace it back to potentially the human teams behind those.

Richie Cotton: Absolutely. The idea of having to prove that you're AI to join a website kind of boggles my mind a bit. I think about all those like CAPTCHA things where you gotta select pictures of bicycles or whatever. What's the equivalent test to prove you're AI?

James Zou: Yeah, it's like a reverse CAPTCHA.

Basically to interact with the platform, each time the agent would have to solve like a numerical puzzle, which will be very easy for AI to do, but it'll be very tedious for humans to do.

Richie Cotton: All right, that makes sense. You, you wanna get them to show off their calculation skills. All suppose you've got all these agents then making discoveries what happens to the research then?

Do you publish it as a paper and then does the agent get cited? How does that work?

James Zou: Yeah, I think this is also where- We have an opportunity now to reimagine what even papers and what, more broadly, what scientific knowledge itself should look like, right? Basically, for the past five hundred years, the way that humans represent scientific knowledge is in the form of these, passive papers, right?

Which are really passive artifacts of knowledge because, if somebody could spend years doing research, and if you're a reader just by reading the static words on a page, it's often not clear what are the true insights behind that research, right? But now I think we're- we have this opportunity to basically, what I call to agentify all the scientific knowledge, right?

We're building this platform called Paper to Agent, which is to basically convert all these passive artifacts of knowledge from PDF, from papers into dynamic interactive agent authors, as essentially an agent that the paper agent becomes like the virtual corresponding author of that paper.

So it knows how to reproduce the results from that paper. It can also... knows how to apply the data or the methods from that paper to new problems. So it then becomes this sort of almost in some sense, like a living embodiment of the paper, right? That can help to to disseminate knowledge and also facilitate new collaborations.

Richie Cotton: That does sound amazing. I, it depends a bit on your field, but I think a lot of papers, it's like even if you've read the paper and you're an expert in the field, it can still be quite difficult to reproduce what the original authors did. And of course, they go out of date very quickly, and 'cause it's like from doing the experiments to writing up to getting it published, that can often be like more than a year.

So I love the idea of being able to speed up that process and having more reproducible science. Okay. Yeah, what, what does a paper for just for agents look like? Would you want the output of science to be different for agents versus for humans?

James Zou: I think this work is very interesting because the current conception of a paper is really something that's designed for human consump-consumption, right?

You have this sort of, HTML or PDF with the figures and and, captions and tables. But if we imagine in the near future where- The readers of many of this research the cons- the peop- the the entities that are consuming this research more of those are gonna be AI agents rather than humans.

And I think there's, there are ways to make the research artifact itself much more agent-friendly and more agent-native. So as an example of this so instead of having this static PDF, what we do in Paper to Agent is actually create essentially a customized MCP for... That captures that research project.

So the MCP then itself contains the different tools, the insights and know-hows from that research project, and the MCP is actually optimized in a way that enables the agent to be able to reproduce results from that particular research project. So then that MCP, I think, becomes much more of a agent native artifact of research rather than a static PDF.

Richie Cotton: That sounds very cool. I love that idea that yeah you've got the MCP interface and then the agent can just pull all the information there. In terms of creating these things, I presume you're gonna have AI create the MCP interface, so it's not gonna be like humans have to try and figure out programming stuff for the-

James Zou: Yeah, that's exactly right.

Yeah. The Paper to Agent is a platform. It's also open source, so the platform they released essentially automatically will convert research projects and papers into this customized paper MCP.

Richie Cotton: Okay. So I, I love this idea of, like, how science is changing then. So suppose this all works, what does the sort of agentic science future look like in 5, 10 years' time?

James Zou: At first I would almost say that I think we're still quite in emerging stages of this agentic science. A lot of progress has been made in the last, 6 to 12 months, but we're still in the early stages, right? And I think one of the big open questions, I think, if we go outside of these domains, like these computational domains that we d- have been discussing as the sweet spot for AI, right?

If we go into these other domains that involves more experimentation, more longer horizon things, right? It's still not cl- clear. I think we need to do more work on how to really get AI agents to make meaningful progress in those domains. And I think that might be the big, I think the big open challenge for the next five years is how do we get AI to really work well in these domains where we don't have a very clean, verifiable rewards, and the progress natively much longer horizon that might take multiple years to conduct.

Richie Cotton: Interesting, yeah. Some huge challenges there. As you think about anything with social sciences where it involves people and things like that is much more of a challenge. Okay. I don't know if, whether you have any ideas on what might happen here?

James Zou: I think in the social science a particularly interesting one of the challenges you mentioned is that it's very difficult to do experiments, right?

And because anything that involves actually doing experiments in real society becomes very costly. And that's also why it's very hard to really get sort of causal signals, which often the agents want to have in order to, design better algorithms and so on. I think one interesting approach there would be using other AI agents as a way to even simulate human societies, right?

Essentially, having-- maybe I want to understand what's the effect of a particular policy, right? It's hard to test that policy in the real world. But if I can have a high fidelity simulation of the real world using AI agents that sort of simulate the different different human populations, then I can use that as a much more, much faster, more cost-effective and safer virtual world, right?

As a way to, to estimate the causal impact of these different policies.

Richie Cotton: That does seem incredibly important for anyone involved in lawmaking governments, where if you're able to economics, yeah, you're able to simulate what's gonna happen, make a prediction about what's the impact of all these policies I'm gonna create.

That seems like a pretty good win for, most of humanity there.

James Zou: I think so, yeah. Because for example, like there are a lot of uncertainties, right? For any of these policies what's gonna be the impact on, on, on gas prices, right? Or how it's gonna affect people's behaviors if you increase tariffs, right?

Or if you do this. The world that we have we only have one timeline, right? It's not like a multiverse, you can try a bunch of things. But if we actually have this a good AI simulator or this virtual society, then that does provide like a multiverse where we can try a bunch of these things first, right?

In these virtual worlds and see what works well. And then, and maybe use that to inform how we design the optimal social policies.

Richie Cotton: Absolutely. Yeah. I do love the idea of lawmakers being able to think about what they're introducing before they introduce them. So yeah, certainly the gas price thing's it's very topical at the moment.

Okay I'd like to talk a bit, bit more about the differences in research environments. 'Cause you're you're a researcher at Stanford University, but you also work in industry at Together AI. Is there a difference between the approach to AI research, like between the two?

James Zou: I think so.

I think first there, there's also a lot of commonalities because I think one of the really nice things about Together AI is the emphasis on research and emphasis on publications and open source, building open source models. Which means that, the work that the team at Together does is actually very much integrated and disseminated among the academic communities and vice versa, right?

They take a lot of ideas from academic communities, and they also have a lot of collaborations with professors and students. So I think that's where, it's really exciting. I think one big difference is that, Is that I think the emphasis on scale and also on efficiency in industry compared to academia, right?

So for example, at Together AI, now the teams there, it's really optimizing the inference engines. That's one part of the team, right? Optimizing inference engines so they can actually serve trillions of tokens, right? And when you talk about things on that scale, right? Then e- efficiency becomes super important, right?

Optimizing infrastructure and optimizing the kernels, optimizing every components of how to do faster inference, right? That becomes really important and really very, Useful. And that's very different from the kinds of considerations often people in academia work on, because they're-- we're working on tends to be, like, smaller scale problems and maybe a little bit because it's not facing directly customers, then things like efficiency is often less of a direct consideration.

Richie Cotton: Okay. Yeah they do feel very complementary there. I like the idea that if you don't have to worry about customers so much, you can explore more the novel side of things in academia, but industry's generally better at making things scale and do things

James Zou: efficiently. Yeah. I think it's super complementary.

And I've actually... think it's actually very nice that there are a lot of interesting research problems that arise when we do think about this industry level scale, right? Like for example, a lot of things that we talk about is, like, how do we start to manage thousands and millions of AI agents and do that effectively, right?

That's something that we're starting to run into on the industry side, right? With partners and customers, and I think that becomes really interesting also on the research side. And similarly, these questions we talk about, how do we ensure alignment and safety and guardrails when we have all these a-agents running around, right?

How do we do scalable oversights? That's something that's really important as we think about adoption and deployment. But that also r-r- I think requires a lot of, fundamental research to be able to really do that well.

Richie Cotton: Absolutely. Yes. You need to keep feeding the progress with new ideas.

I like that. Okay. All right to wrap up I always want more people to learn from whose work are you most excited about right now?

James Zou: Ooh a lot of people. Wh- when we talk about these kinds of simulation setups, right? When we were talking about simulated societies, I think some of the, my colleagues at Stanford Michael Bernstein, Percy Liang, they're doing very interesting work, using AI teams to simulate societies in different settings.

I think that's really interesting. More like the AI for science side, right? Using AI to make, to create these AI scientist agents to make discoveries, right? Yeah, I think there's some... Even just like in the last week, I think there were some really nice publications from the Google DeepMind team, so on these AI co-scientist agents, and I think there's also a lot of academic research groups that are building these AI scientist agents.

Richie Cotton: Tons of exciting work . Oh, God, it is difficult to to narrow it down, isn't it? But yeah that's good to know about all of your colleagues' research and some of the interesting stuff coming out of Google. Okay nice. Thank you so much for your time, James.

James Zou: Thanks for having me. Really enjoyed the conversation.

主题

AI Agents

Artificial Intelligence

有关的

podcasts

Enterprise AI Agents with Jun Qian, VP of Generative AI Services at Oracle

Richie and Jun explore the evolution of AI agents, the unique features of ChatGPT, advancements in chatbot technology, the importance of data management and security in AI, the future of AI in computing and robotics, and much more.

podcasts

Data Science Trends from 2 Kaggle Grandmasters with Jean-Francois Puget, Distinguished Engineer at NVIDIA & Chris Deotte, Senior Data Scientist at NVIDIA

Richie, Jean-Francois, and Chris explore the role of AI agents in data science, the impact of GPU acceleration, the evolution of competitive data science techniques, model evaluation, communication skills, the future of data science roles, and much more.

podcasts

How AI Agents Will Work While You Sleep with Ruslan Salakhutdinov, Professor at Carnegie Mellon

Richie and Russ explore the most exciting use cases of AI agents today, long horizon tasks, the credit assignment problem, multi-agent systems, eliable human-in-the-loop workflows, agent safety and guardrails, and much more.

podcasts

AI Agents at Work: What Actually Breaks (and How to Fix It) with Danielle Crop, EVP Digital Strategy & Alliances at WNS

Richie and Danielle explore AI agents at work, experimentation with guardrails, data privacy, access, OpenClaw automation wins and failures, token costs, tying AI plans to P&L strategy, how data teams handle unstructured data governance, and much more.

podcasts

Building Trust in AI Agents with Shane Murray, Senior Vice President of Digital Platform Analytics at Versant Media

Richie and Shane explore AI disasters and success stories, the concept of being AI-ready, essential roles and skills for AI projects, data quality's impact on AI, and much more.

podcasts

Building & Managing Human+Agent Hybrid Teams with Karen Ng, Head of Product at HubSpot

Richie and Karen explore the evolving role of AI agents in sales, marketing, and support, the distinction between chatbots, co-pilots, and autonomous agents, the importance of data quality and context, the concept of hybrid teams, the future of AI-driven business processes, and much more.

查看更多查看更多