Developing Better Predictive Models with Graph Transformers with Jure Leskovec, Pioneer of Graph Transformers, Professor at Stanford

Richie and Jure explore foundation models for enterprise data, the limitations of current AI models in predictive tasks, the potential of graph transformers for business data, relational foundation models for ML workflows, and much more.

4 août 2025

Guest

Jure Leskovec

Jure Leskovec is a Professor of Computer Science at Stanford University, where he is affiliated with the Stanford AI Lab, the Machine Learning Group, and the Center for Research on Foundation Models.

Previously, he served as Chief Scientist at Pinterest and held a research role at the Chan Zuckerberg Biohub. He is also a co-founder of Kumo.AI, a machine learning startup. Leskovec has contributed significantly to the development of Graph Neural Networks and co-authored PyG, a widely-used library in the field. Research from his lab has supported public health efforts during the COVID-19 pandemic and informed product development at companies including Facebook, Pinterest, Uber, YouTube, and Amazon.

Host

Richie Cotton

Key Quotes

The best way to make decisions is through making them via predictions, forecasts. You want a forecast, and then based on that forecast, you make a decision. The demand for predictions, demand for forecasts, demand for decisions is only going to increase.

The power of data, private data, is becoming more important.If everyone is using the same pre-trained model, then the only way we make differences among ourselves is by the data we have.

Key Quotes

Develop a foundation model specifically for enterprise data to handle structured and semi-structured data, enabling more effective decision-making through predictions and forecasts.

Utilize graph transformers to learn directly from raw data without manual feature engineering, allowing for faster and more accurate model development.

Implement relational foundation models to achieve human-level accuracy in predictions without the need for extensive training, enabling rapid ad hoc predictions.

Links From The Show

Jure’s Publications

Kumo AI

Course - Transformer Models with PyTorch

Transcript

Richie Cotton: Hi Jure, welcome to the show.

Jure Leskovec: Yeah, hi. Glad to be here.

Richie Cotton: There are a lot of great foundation models already, so can you just tell me why are you building another one?

Jure Leskovec: That's a great question. Definitely there are, foundation models like the large large language models. We also have protein language models and DNA language models on the biology and a bunch of things on the computer vision.

But what we are fundamentally missing. Is a foundation model for for enterprise data, for business data and that type of data. There are two differences when you think about how a foundation model for enterprise data would look like, the first thing is that enterprise data is usually organizing this kind of structured, semi-structured way of having different data in different tables.

And then we have. Pointers, primary foreign key relations between these different tables. So that's, I would say the first big difference is that this is not now a sequence anymore, a sequence of tokens, but it's a much more complex structure. And, we'll get there during the podcast, I assume.

And then I think the second difference is about what kind of tasks you want to be performing on top of this type of data and in, in enterprise data. The key task to perform. Is decision making is right. We are collecting this data to be able to make decisions. Now, if you say what is a decision?

Decision is somethi... See more

ng that affects the future. So the best way, in some sense, to make an optimal decision is to anticipate, predict, forecast, the effect of the decision to today, on, on the future. So you need to be able to make what we call predictions. So those, that's why you need another foundation model because this is different type of data and different types of tasks than let's say that large language models that have been amazing breakthrough in technology are able to do.

So this is different. LLMs cannot do these things.

Richie Cotton: Okay. So I think actually, predictions are gonna be very familiar to anyone who's done machine learning before. There's lots of different kind of models for machine link, like including neural networks, but also four vector machines and random forest and all these kind of things.

Talk me through what's your new model type and how is it different from these traditional machine learning approaches?

Jure Leskovec: That's a great question. And maybe another way to, to say is. Is to say the following, right? And I'll get to your answer to the question is the AI revolution has taken us by storm, right?

Computer vision is never going to be the same because of convolutional neural networks like learning directly on top of pixels, right? Large language models, natural language processing will never be the same because now we learn directly from tokens and all kinds of tasks and abilities emerge, right?

But if you think about predictive modeling, about machine learning, that hasn't really been. Touched by the AI revolution that still stuck 30 years back, right? Like decision trees, I think were invented in 1980s, or even earlier before I was born. Perceptrons, which, Sure. Even earlier and so on, so the machine learning, the predictive part hasn't really been touched with the latest wave of ai. And what I'm excited about is to bring kind of the AI revolution to machine learning. That's why we talked about, okay, why we need another foundation model because we need a. Foundation model for machine learning.

That's the exciting thing. And then how is the foundation model for machine learning different from, let's say the classical ml, MLL GOs that you talked about is two separate differences. First difference is that it can learn directly on your raw data. So you don't have to do manual feature engineering.

And the second difference is that it comes pre-trained. So you can just point it to the data, specify the task, and you get a prediction. So you don't even have to train the model for a specific task.

Richie Cotton: Okay. So no training, no feature engineering. This all sounds like a, a dream come true for the machine learning scientists.

So I'd love to get into the technical details of like how the workflow changes and things like that later. But you talked about, some of the bus enterprise and business use cases, you just make it concrete. Just give some examples of what these new, graph transform models are suitable for,

Jure Leskovec: I would say any kind of predictive task.

And then you can, put a different kind of if, when you train a different loss function on top of them. But this would be anything from classic forecasting models to fraud. Fraud models to customer behavior models about lifetime value, churn. Any kind of risk models around, I know in, let's say in financial insurance industry, what kind of risks are going to happen?

Sales leads scoring Next best action. Personalization, recomme Commander systems we see them used in healthcare to predict risk of readmission. So all these kind of classical, let's call them, machine learning, predictive problems from, predicting a binary value, which is classification, multi-class, multi label, things like that.

Any kind of regression time series temporal type problems. And then any kind of set prediction, ranking problems for personalization, search result, ranking and so on and

Richie Cotton: so forth. Okay, so this really is like the full suite of anything you can do with traditional machine learning.

It's like you can then do with these new models

Jure Leskovec: pretty much at I would say at the first order of a resolution in some sense. Definitely at the second order there might be some nuances.

Richie Cotton: Okay. Alright, nice. Can you explain to me what the business benefits are that you might be expecting? Is it just about being able to create models quicker?

'cause you don't have to do feature engineering? Or is it about you getting, get better accuracy? What, what should be looking for to see what are the improvements?

Jure Leskovec: Yes. Good question and I will just acknowledge right, that. That here I'm promising a lot, and that audience should be almost like, Hey, what's going on?

This guy sounds crazy, right? So if people are at listening carefully they should be skeptical, right? What are the benefits we see? I would say there is maybe 2, 2, 2 nuances to separate out. One is the benefits of the pre-trained model, the benefit of the pre-trained model that doesn't even have to be pre-trained on, let's say.

Customer specific data or user specific data is that you are able to get a prediction for any kind of predictive problem in less than one second. So this means that you don't have to spend time feature engineering training, data sets and all that, right? So you go from months, weeks of work to one second worth of work, connect to the data, make a prediction, and then if you now decide to say, okay, actually, but I really care about and in this case, with the pre-trained model, the accuracy.

Is comparable to what senior human engineers, data scientists are able to build manually. So you will get a human level accuracy model with a pre accuracy prediction with a pre-trained model. Now, if you wanna push that further, you can then fine tune or the model on your data for your task.

And if you do that, then you get another 10, 20, 30% more accurate. And that means basically you get into the superhuman regime of. Of accuracy. And that's very valuable. Before we were asking about use cases, right? If you're doing fraud detection, direct business impact, if you are doing personalization recommendation, imagine your di direct business impact if you're doing ranking of products, direct impact, if you're doing any kind of ad matching, like user ad matching.

We see great results where, people are, or engineers are able to get years worth of improvement in a couple of weeks.

Richie Cotton: Yeah certainly there was some bold claims of getting superhuman levels of performance and being able to answer questions within like a couple of seconds.

Do you have any examples from real customers who've been solving real problems? Just to back this up,

Jure Leskovec: so for example where we've seen a great results is on ad matching, for example, user ad matching. So predicting what ad will be relevant for what user. We have been working with Reddit. Which is, humongous large social media platform.

And I would say that in a matter of a couple of months, they were able to make about three years worth of modeling progress. What do I mean by that Is they have an amazing team. They've been building this at click prediction model for, for, I know several years. It's the flagship type of model.

And, every year they're able to increase the performance of that model for some delta, for a couple of percentage points. By basically taking the, taking this technology and reigning the models using this new technology, they were able to push forward for about. Three years worth of progress in, in a matter of weeks.

We've seen similarly amazing results on fraud detection use cases on let's say sales lead scoring use cases. This one is public as well. So actually in collaboration with the Databricks sales team, they are trying to predict which of the sales leads is actually going to convert.

I can actually say their definition of coercion is, will I be able to meet. With that potential customer, right? Will I be able to schedule a meeting with schedule a meeting and meet in the next one month? And if you think about as a data scientist it's like what features predict whether you and I will be able to agree on a time to meet and, you and I are going to show up.

It's unclear, right? As a human data scientist, you're like, what features should I generate? So the benefit of this approach is that it can go all the way down to the data and basically discover those signals that turn out to be very predictive for, whether you and I are going are going to the, to be able to meet in the next 30 days.

So that was a really good good use case and yeah. I can, I probably, think of more as we go along.

Richie Cotton: That's impressive stuff and especially like working with Databricks because I know all the people I've worked with at Databricks very smart. So like these are top called the data scientists and if they're wanting if those humans can't figure out the features, then having the machines do it better, that's a pretty impressive benchmark I guess

Jure Leskovec: I can explain.

I would say maybe I'm not, say when we go and discuss maybe a bit more about the technology then it'll be clear how this is different. You ask me about how does it change the workflow. The data scientist is needed and data science jobs are not going away. I think where the difference is that the goal of a data scientist is to really model the data and have the business impact and.

You don't have to be bogged in data cleaning, feature engineering, and creating training data sets to get to your impact, right? With this new graph, transformer based based technologies and platforms, you as a data scientist can get to that goal of business impact to having a performance model much faster.

And you can get there faster with better accuracy. So the, but the human. The data scientist modeling the data, thinking about how to take a business problem and translate it into a predictive problem, what data is needed, how is the problem formulated, I think is crucial. And today data scientists are maybe spending 5% of their time doing that, and the rest is doing, is doing grant work.

So that I think is the, is the way the world changes is being able to learn over more data with a higher fidelity much faster, right? So that models can be built in a matter of hours or days rather than weeks and months.

Richie Cotton: Okay? Certainly like having that speed up seems pretty incredible.

Yeah, you mentioned that it would be helpful to understand the technology a bit get you to explain what a graph transformers, maybe We'll, do you wanna build it up? Let's start with just like what are graphs, networks and why they're useful.

Jure Leskovec: Yes. Great. Great point. So generally when we think of a graph or a network, the first association we get.

It's a social network, right? Like we have people and they are connected by certain type of edge. You and I have sent email to each other. You and I are now talking. I know maybe we are connected on LinkedIn and things like that, right? So you have this set of notes and then the relationships between them.

And this graph learning technology has been super helpful and useful in this to model this kind of interconnected data. One type of interconnected data that we all have, but generally don't think of it that way, is a database, right? Enterprise data as it results in multiple tables. Is also a graph, right?

You could have my customer table, I would have my products table, and then I have, let's say, transactions table, right? So a user through transactions buys a product. That's the sim, simple three three table schema. But the schema could also have another table that let's say is website clicks. So a user can click on a product.

A user can purchase a product, right? And now I have this, let's say schema of four tables. That is a graph. So I can take that and represent that as a, what we call a temporal heterogeneous graph, right? Every user is a node, every product is a node, every click is a node, and every purchase is a node. So now a user is connected to the click that is connected to the product, and of course the click has a time stack.

And every pro, every user is also connected to the purchase. That also has a timestamp and a bunch of other attributes back to the product, right? So now you have this heterogeneous, because you have different node types. You have the user. No type, the click no type the purchase note type and the product note type, right?

And these things are interconnected. So now but now what is the benefit is in traditional machine learning, what we are doing is we are joining these tables. I would take the user table and the transactions table. I join it, and then I say. The way I, and then I need to aggregate over the user. So I say, how many transactions did they have?

How many in the morning? How many in the afternoon? How many over the last five days? Six days, seven days on holidays. On odd number days. Even number days, it can go forever. But the idea is that rather than needing to join tables manually, they're now interconnected through this chemo, through this primary foreign key relations.

But now when you train or when you build a graph, transformer on, on top of this, a graph transformer is a transformer architecture that we all know how it learns over a set of tokens, transformer architecture, we say it learns over a sequence of tokens, but that's false. What does it learn over is a multiset of tokens, right?

So basically it's tokens and there is a multiset, meaning each token can appear multiple times inside that context window. Alright? So it's learning over a set, but then how does, how do we know what the sequences is? Because we have positional encodings, and for every token we describe. In that set, what is its position in the sequence, right?

So now what is a graph transformer? A graph transformer is a generalization of that where you can learn over an interconnected graph, and the way you learn over that is you say, okay, I have now a set or a multiset of nodes, but then my positional encodings are much. Complex because they tell me where in the graph each node is.

And of course, building, building the infrastructure that allows you to do this is quite hard because text is linear and it's easy kind of to pump it through. But the graphs, you cannot shard it. It's harder to work with. So that's what's called a graph transformer. So now what did we say so far? We said the database is a graph and now that I have a graphical representation of a database, I can apply a graph transformer on top of it and then you would, you could say, so why should I care?

Why does it matter? There are two reasons why this matters. First is that now you basically. Can apply in some sense, deep learning. You can apply deep neural networks to learn directly on your raw database representation, right? So the same way as a convolution neural network learns over the pixels now.

Now this graph transformer, or. Even like a graph neural network, which is a bit different type of models, but for people who are familiar with those, they can learn directly on top of your raw data. So you don't have to do feature engineering, you don't have to be joining tables manually anymore because I can now learn over a set of interconnected tables.

So the process of building the model or tuning the model is much faster. And then the second benefit is that kind of the fidelity of this attention mechanism over the graph is so much more fine-grained than what we humans are able to come up with our joints and aggregations. That generally it leads to a higher accuracy.

Because it's a attained, rather than saying, let's take the user, the customer table and transactions table and join. You are now basically saying, let's attend our all the customer transactions. And that attention mechanism can figure out, that early things and late thing, late transactions or morning transactions or whatever is the pattern.

It can extract it from the raw transactions. So rather than. Attend. Basically joining and creating one signal. You are now attending over all the events, and you can not only do this at one level, but you can do this recursively. So you go from the user to transactions to products, back to transactions to other people, to transactions, to products.

So you can learn all collaborative filtering effects and so on and so forth, right? The model has that capability. Sorry if I'm talking a lot, but maybe I have one more important thing to say. That is that, in the old days we had a lot about auto ml, right? And I think data scientists have been burned historically by this auto ml, but this is fundamentally different than auto ml, right?

Auto ML was about let's just join everything with everything aggregate, generate a gazillion of features, throw them against the wall, see what sticks. This doesn't scale is brute force and all that, right? The approach I'm advocating for is fundamentally different. It basically follows this modern AI trend, have a powerful neural network learning directly on the raw data.

The neural network is so kind of power per full and capable that it can extract. Detect new types of patterns that we as humans are maybe not able to do. That's maybe a short, long introduction to this.

Richie Cotton: Okay. Is a nice explanation actually, I have to say. The thing that surprised me most is the idea that just any standard SQL relational database that can be thought of as a graph.

And I guess, yeah I've spent many times just staring at those schema diagrams, not really thought, oh, yeah, okay. This is a graph here, but yeah, it really is. The question here is around scaling then. So if you've got, a network of like your whole company database. Lots of different connections.

Does the do graph transformers scale, then will, it sounds like it's gonna be very computationally expensive to think about this whole network in Mongo.

Jure Leskovec: That, that's a great, that's a great point. And I would say nothing is magic. Nothing comes for free. So maybe saying, oh, now you can bring your, I know, thousand tables, schema with bazillion of records in and learn over that, that, that's a bit too optimistic.

So when I say modeling, I really mean, we've been able to scale these to, let's say 50, 60 tables tens of billions of records, right? If you think about the scale of Reddit, we are doing fraud. Another, for example, large scale use cases some food recommendation with DoorDash type for ordering data, which is also massive, or some fraud detection on the entire Bitcoin blockchain.

Is a project we are working with with Coinbase, but even there, right? This is not all the data. You select what, as a data scientist, as a data modeler, what tables you wanna use, what data ranges you wanna use when you build this model. So there is some iteration to really discover what data is useful, what is not useful.

Of course, if you throw everything in. In principle, we can learn all that, but it'll be very computationally expensive. The inference time will also be bigger because a lot of data will be used, so maybe it's not the most optimal thing to do by quickly iterating basically. You or a data scientist can build very much more accurate models than before.

Richie Cotton: Okay. So really it's if you're trying to solve a sales problem, you just put the sales tables in. If you're trying to solve like a logistic problem, you just put those operational tables in,

Jure Leskovec: for example. For exactly. For example. Or you can, like many times we see people start with purchasing data, then they say, oh let's add in also website behavior data.

Let's add the. Turns data and then they see how the model performance is climbing up as they integrate more data, and then they can make a kind of informed decision, trained off to see, what's a sweet spot for them in terms of data size, data volume, model accuracy, cost of training, and things like that.

Richie Cotton: Okay. Nice. So I'd love to talk a bit about how you get started with this. So suppose you decide, okay I wanna try one of these graph transforms. Is there anything you need to prepare for? Do you need to do anything with your data in order to make it ready for use of these models?

Jure Leskovec: Yeah, so there are a few a few.

Options. I think that people have here one is go the go, the open source the open source route. The approach here would be to use pi g. Piper geometric pi g.org is a library. It has, let's say, medium. Level of scalability. I wouldn't necessarily say that it's kind of production production ready and requires quite a lot of expertise because the graphs have to be manually crea created, the training data sets, the labels have to be manually created, and then the models are built.

Another option, possibility here is to to use a platform called Kuma. It's industry scale, easy to use robust, where basically all you have to do is connect to your database. Essentially, you just register your, the tables you wanna be using. You specify the predictive.

Problem in a SQL-like language. So we say, oh, I wanna predict this label for this entity. And then basically from that, the system can generate the labels. You can then tweak that as a data scientist, how much you want, and then kick off the model training this models. Are not compared to large language models.

They're actually not humongous and they are able to train in, in a matter of, in a matter of hours, let's say less than a day.

Richie Cotton: Okay. Alright. I, the way you just casually mentioned Kubo as an example, of course. Since you're building the platform I, yeah, I expect it'd be the one you'd recommend.

So it seems the main idea is that you've gotta translate this, the, your relation database into the graph format. And you can either use the open source or Python tools, or you can use your commercial platform for this.

Jure Leskovec: Yeah. To be able to do this. You have to take the data, translate it into the graph, if you have to build the graph yourself manually.

That's a lot of work. The second thing is scaling up graph learning is very hard. This kind of. Commercial, let's say commercial graph databases and things like that, they've been built for a different use case. Because when you are doing graph learning for a given node, you need to try to sample a local neighborhood, like a little circle around it, and being able to do that very fast.

Add the know attributes, the features to it, and send that to the GPU to be, in order for you to do that so fast that the GPU doesn't start getting bored. That's quite hard. And the hard part is that we, it's very hard to sharp the graphs, so you cannot hide throughput by by sharding.

So in my career, I. I was chief scientist with Pinterest for six years, and I built two generations of graph learning platforms there. It took us like three years to build it. Before it was really performant, but it had the material impact on on. Pinterest business across, a large number of different use cases.

So here at Kuma, we took that time, we took that effort to build a platform to put in all the innovation, all the latest architectures, to really allow people to, to train, let's say, predictive models directly on the raw data. And what is nice is you can either get predictions out of the models or you can actually get embeddings.

Which you can then use for any kind of downstream analysis or tasks or even like features in more traditional models and things like that.

Richie Cotton: Do you need to have one model for each of different use cases? Then? So you mentioned the idea that you probably don't want all of your corporate data in one single model because it is in scale as well.

So have you gotta build lots of these things for each use case?

Jure Leskovec: Great question. So maybe at the beginning we glossed over a bit over that, right? So I would say. Historically. In machine learning, we've been model building one model per task. Kumar kind of comes in and disrupts this space by saying, you can still build one model per task, but just point me to the raw data, specify the task.

I will. I will build them, I will build a neural network that learns directly from your data. I will save you time of feature engineering and I will save you ti like reduce the time that you need for data preparation. There is still some data preparation, but cut it down by 95% and I'll give you 20, 30% more accurate model.

Again, depending on the data de depending on the task and things like that. So that's the first part. That's what we've been discussing. Now we have just announced a. A new, which I would say a true scientific breakthrough, which is a pre-trained foundation model, the pre-trained foundation model.

The idea there is that you don't even have to train the model. You just point it to your data. You specify the task in some kind of declarative language, and one second later you get an accurate prediction. And if you think about this now allows you to do. Ad hoc predictions, right? Not a to say, oh, you wanna estimate churn probability, so let me go away, before I would go away for a month or two months to build a production ready churn model.

Now with Kumo, maybe I say I will go away for two weeks to train a or a few days to train a model. Now with foundation model, you can directly get a prediction by just asking the foundation model, and that's really. I think super cool and especially important in the kind of the AI agent world where you need agents to make decisions.

Non hallucinated decisions over private data, and that's where this starts to shine and shows its complementarity to human level, the reasoning. Large language models that we have today.

Richie Cotton: We've gone from the traditional machine, anything of one model per task to the sort of, I guess the, you've got standard graph crunch home.

This is like one model per dataset and with the pre-trained foundation model, you basically just got one model for everything.

Jure Leskovec: That's correct. And that was very surprising to us, right? When we saw this working in practice, it blew us away, right? Like it wasn't clear that you can basically build a dataset agnostic model.

That is able to make predictions and, you could even say how is that possible? Because data set and it's not that it's a single table, it's a model that works across multiple tables and tables can be different from, very narrow to very wide with different number of rows and columns and different information in rows and columns.

And that could be users and purchases, or it could be posts and clicks, or it could be, drugs and phone calls and, patients or whatever it is. So that's the, that was the impressive thing. And what we have built, it's called kumo R fm, like for relational foundation model.

It's available and people can go and try it for free at Kuma rfm ai. The exciting thing is that basically you connect to your data. In a SQL-like language you specify what's the target you want to predict for what entity. So I could say, predict sum of, let's say transaction, do price between today and 30 days in the future.

And gimme this prediction for user ID 27 and it'll, it's going to predict the sum of transaction prices. So a lifetime value of a customer over the next 30 day. Window. Now somebody else can come and say, oh, I don't like 30, I want 27. We can do 27. Somebody can say, oh, I don't want transactions price, I want just count.

So you can say count. And now we are predicting number of transactions over 27 days. So that's a beautiful thing. And what happens in the back in the backend is that basically the private data is represented as a graph. This. Predict based statement generates a number of in context examples that are then the historical in context examples plus the unlabeled example are then sent through the foundation model that basically in a single forward pass of it, in its, I don't like in its neurons, essentially builds a machine learning model.

Based on those Incon examples and gives you a prediction, right? So the amazing thing is that transformer in a single forward pass gets a little training data set plus a unlabeled set of examples and is able to, I know in its brain, or I know how else to say, build a predictive model that gives you accurate prediction for those unlabeled examples.

So it means you can throw in any kind of examples. Any kind of labels, a set of unlabeled examples, and in less than a second figures it out. So there is no, no training no, no gradient updates, anything like that, right? Like sometimes I would demo this and people would be like, wow, you really sped up training?

It's no, there is no training. The model is frozen. What it gets in is a set of in-context examples and then the unlabeled examples, and it generalizes from incon to the unlabeled.

Richie Cotton: Okay. It sounds like powerful stuff and it also sounds. Easier to do than a lot of traditional machine learning.

So does this mean that for example, commercial team members could use it. So if you've got a sales problem you wanna do some sales predictions on your pipeline, then is that something that an account executive could do themselves?

Jure Leskovec: Exactly. I think with this technology, we are truly democratizing access to machine learning, right?

First it's much faster, much less costly, much cheaper in a sense that we don't have to now build humongous teams and wait for a long time together model. So it means today, I would say. High-end Silicon Valley type technology companies are able to find the talent and build these predictive models to huge benefit to them.

I was building them in Pinterest before and so on to huge benefit of these organizations, but not everyone is so lucky or so fortunate to be able to have access to that talent and also be able to invest so much in this. But now with, let's say this. Pre-trained foundational models, we are making this much more broadly accessible to even, to, to people who don't necessarily have a PhD from AI or machine learning, and also allow.

For faster kind of experimentation. So it means that organizations can really iterate on top of their product, on top of solving business problems rather than say, oh, we, we are now stuck somewhere down trying to figure out how to make accurate recommendations. The foundation model can do that and even a business person can then use this to actually have the, to have the impact.

Richie Cotton: Okay, nice. That seems very cool, the fact that. Basically anyone then can ask questions about the data. If they've got questions that require predictions and then go beyond just oh, I'm looking at a dashboard or doing some sort of a simple analytics. If you wanna get started with this, what's an example of a good like simple first project to try?

Jure Leskovec: Yeah. So for example, a simple first project would be like a very simple, nice data set is actually a Kaggle data set from h and m, the clothes retailer. It has this nice schema of customers, transactions and products with a number of columns, right? Descriptions, images, all that.

And it's very easy to build all kinds of models o over that very simplistic schema. You can be doing. You can be predicting whether customers are going to churn. You can be estimating their lifetime value. You can do product sales forecasting. You can do recommendations. You can do any kind of attribute inference, for example, trying to predict genders or fro of customers and things like that.

Richie Cotton: Okay. That sounds cool. Yeah, and certainly I think retail data is fairly simple for, or it's at least something that a lot of people know about. Everyone's done some sort of shopping. So just try not something with a small scale weight, you've got a reasonable intuition for what the answer should be.

That sounds good. I'm curious about the where things can go wrong. So you mentioned before large language models have this big problem with hallucinations. They make stuff up. Are there any equivalent problems then that these relational foundation. Relational Foundational Models app.

Jure Leskovec: Good question.

So maybe a few a few ways to to respond, right? So first thing is this technology is very complimentary to large language models, right? Large language models are amazing at human-like reasoning, but actually are not good at. Predictions or making decisions based on predictions.

So machine learning as such, and predictive modeling as such, is here to stay and is very important. With Kuma, it's getting even easier. Even cheaper and much more accessible, right? More accurate and so on. So now what are some caveats, right? Like one thing for example, that is important in these models is that many times in business data, basically you have time streams of event.

And it is very important that your timestamps are accurate. Because a classical problem in machine learning in the old way when you do feature engineering is that your feature is stale, right? Maybe you update the feature once a month, on the first day of the month, but then you are predicting something on the 25th of the month.

Now the feature is three weeks old. It's like it's out of date. It's not informative anymore. It'll work terrible, right? So this out of date problem or stainless is a is problem in classical machine learning or that you update the feature on the last day of the month, but you're actually making a prediction on 25th.

So then, predicting future from the future, it's a much easier task. So maybe one thing to say is. Now in the, in this relational foundation model approach, you need accurate timestamps because the model is learning from time. And the model is very aware of what is the time of now.

And because it sets what we call the anchor time, the time of now, then it knows what is in the past, what it can look at and what it cannot look at. Now, if your timestamps are wrong, the model is going to look into the future. Not by its fault, but because of the data. So that, for example is one is one problem.

The way we are solving this is that the explainability capabilities of these models are really good because they allow us to go all the way down to raw data so we can ask the model. What are you looking at to make this prediction? And the model can say, I'm looking at this table, I'm looking at this column.

And as a data scientist it's important to understand your data. And if the model says, Hey, this column here is amazingly predictive for this label. Then the data scientists can say, oh, oh, something is wrong here. This shouldn't be so predictive. Let's go and investigate any kind of information leakage.

So that's, I would say is is one example of caveat. The other example of caveat is that these models really shine when your data is complex, right? If you have all your data crammed in one single table. Then we already know how to do ml o over a single table, right? So the delta won't necessarily be that, that large, but if you have data that's nicely spread across multiple tables, then this is when this technology truly starts to shine.

Richie Cotton: Okay? That seems important to know that, time sum incredibly important in terms of data quality. Data quality is like is a huge thing in most areas of machine learning. But are there any other data quality issues that people need to be aware of? Is it different to traditional machine learning?

Good

Jure Leskovec: question. I would say because the, these models learn over the entire relational context. Actually they are very robust to issues, errors null values, missing values, and let's say shaky data. So the amount of robustness we see is actually better than in classical machine learning.

I think the other thing that happens when we do classical machine learning is that we spend so much time feature engineering that. Through that process, we may discover quite a lot of other unintended things that we are then able to fix in terms of issues with the data quality. But I would say in, in, in this approach, what is really important is to know, to know what pri that a user ID in one co column is a user ID in, in another table, in another column.

So having those relational information ma mapped out is is the key. Data can be incomplete. That is not so much the issue. But having a good relational structure, proper primary, foreign key relations that are not corrupted or that, that are present, that really then makes the.

The approach to shine.

Richie Cotton: Okay? So if you screw the sche, you, you've not got sensible primary keys or secondary keys, whatever, that's gonna break the graph and therefore that's gonna reduce the model quality.

Jure Leskovec: Exactly. Then you basically break the flow of information across the tables, and if you break the connection from one table to another, then the information from that other table cannot flow and it's all for nothing.

Richie Cotton: As is related to this, do you need to do data governance differently to if to, if you were working on. I guess traditional machine learning problems or analytics problems to affect your approach to this?

Jure Leskovec: That's a great, that's a great question. I would say it is it is a bit different and it is different in a sense that you really need to understand what are the raw tables.

What rather than doing governance, almost like at the level of features you now need to be doing governance at the level of, at the level of raw data, having, annotated the, correspondences between tables, knowing what columns point to what other columns and so on.

You don't need some. What is maybe interesting is you don't need semantic information, right? You don't need. The model doesn't care. This approach doesn't care about column names and column descriptions and things like that. You don't need that. All you need to know is you need to know the semantic types of columns and have reliable.

Basically have reliable timestamps, primary foreign correlations and some semantic types about the columns. That's about it. So if you can do data governance at that level life is good.

Richie Cotton: Okay. Alright. It sounds relatively straightforward. So one of the thing I'm curious. Is how you go in, go about interpreting these models.

For example, something like linear regression decision tree is really easy to interpret, but the fancier you get with most model types, the hardware is to interpret what's happening underneath them. How do you go about interpreting any predictions made by this foundation model?

Jure Leskovec: There are, let's say two types of interpretability, right?

You can ask for this kind of global level model interpretability or model explainability type type questions. And you can also ask about individual prediction type questions. And the way we have innovated this predictive cap sorry, explain explainability capabilities is I would actually argue that these models are actually more, more explainable than these traditional feature based ones, right?

Because in the feature based ones. All you get is some stack ranking, some shape, values, whatever of those features, right? So if you miss an important signal, you will never know it exists. It'll never be part of your explanation. So instead what we do is we do basically this kind of back propagated gradients where we can trace through the neurons of the model all the way down to row columns and row tables, and then basically we can ask the model for this type of prediction.

What data are you looking at? What tables, what columns, and also for individual level predictions, we can ask the same thing. And here we can trace this all the way down to individual errors, individual columns. So the idea is that how the model tells you, this is what I'm looking at, right? It's very similar.

For example, if you ask computer vision models, how do they explain a computer vision model that maybe let's say is doing human dete detection, right? If the model says, I'm looking here. And then, it's doing the right thing. If the model is saying, oh, I'm looking, just above your head because I'm expecting a light to be turned on the ceiling, the model is doing something wrong, right?

So these types of models are at the same level. They point down to the raw data that was used. That was influential to make that prediction. And from this you can really see, and then again, as a data scientist, you can ask yourself, is this actually, are these the right signals?

I would expect that this match my intuition. What if I remove this? Can the model recover? And things like that, right? So again, I think the data scientist is very important. The knowledge of the data, the knowledge of the domain, and being able to model it is not going away. We can just do that now.

Faster, easier and better.

Richie Cotton: Okay. That sounds pretty good that you do have some of level of explanation. And Chris, do you have to do things like cross validation on your models just to check that they really are working?

Jure Leskovec: The foundation model is pre-trained, so you don't need to do cross validation.

You can give it some unlabeled examples and it'll give you back the predictions. If you now fine tune or build customers per task, then yeah, you. Then the automatically the training data set get generated. Some part of it is used for model training. The other one is for validation, and then there is holdout.

So that, that is still fundamental, right? The goal of ML is to do out of sample generalization. So how do you check generalization is through cross valid. So it doesn't go away. It's a short answer.

Richie Cotton: Okay. Nice. No cross validation needed. That sounds great. Okay. And at the risk of this all getting too nerdy, maybe we'll just bring it back to do you have any final advice for people who want to try out these models?

Yeah, I would say really.

Jure Leskovec: I think it's very exciting time, right? Like I, at the beginning of the of the conversation I said it seems like AI has skipped machine learning, right? And then people got oh, what happens now? Is machine learning still needed? What will happen? And I think what we know now is machine learning is not going away.

It's going to be even more important. Especially if it with the, let's say, agent workflows with our lives industries getting even more digitized, collecting even more data, being able to make business critical decisions on those data is of vital importance. What we've been, let's say, doing in the past, or what people have been doing in the past is, running some SQL queries that analyze the past, and then somebody tries to make a common sense decision how the future might look like.

On the past, what we have seen is that the best way to make decisions is through making them via predictions, via forecasts, right? You wanna forecast and then based on that forecast, you make a decision, right? So I think going on, the demand for predictions, demand for forecast, demand for decisions is only going to increase.

The second thing I think that is going to happen is that the power of data. Private data will become even more important, right? Because if everyone is using the same pre-train model, then the only way we make differences among ourselves is by the data we have. So I think that's another important thing is like, how do we make decisions on our private data?

Not on the data that, you know that the LLM has read on the internet, but something that is truly unique to me, to my organization, to my business, to my customers, because. Every organization right is different. So I think the importance of data is going to go to be even more. And then I think the third thing is that now with this technology that we talked about relation foundation models, relational deep learning as well as let's say Kuma, as an implementation of some of these paradigms.

I think the, a kind of the power of AI is finally coming to machine learning predictive modeling, predictive decision making as well. And it's ex it's truly exciting and I hope we embrace this new world soon and get the data scientists basically the impact that they deserve, right?

No data scientist, I would say is, is dreaming, oh, I wish I could do more data cleaning. Or, is wakes up in the morning and says, oh, let's do some data cleaning today. Or right. They say they don't I, they don't wake up like that. They wake up and they say, I wanna have some business impact.

I wanna solve some hard problems. I wanna build some cool models. With this technology you can focus on start building models. Starting to have business impact rather than say, oh, I have to go feature engineering. It does sound pretty

Richie Cotton: amazing, the idea that you can just build models a lot faster.

Just yeah, make more cool models. It's it's a bit of a machine learning dream. Nice to wrap up. I always want recommendations for people to follow. So can you just tell me whose work are you most interested at the moment?

Jure Leskovec: I would say very excited about the work that is generally happening in the field of graph learning.

Organizations that have been at forefront have been like people people at DeepMind, as well as at other organizations. There is a very cool conference on learning learning on graphs that I think is very exciting. The Pie G community is very vibrant and there is a lot of super cool contributions and I would say a lot of support from NVIDIA and other organizations.

So that's another place that I follow. Because I truly, I'm truly excited about, the, this new generation of AI where we have this kind of human-like decision intelligence, and then we have this private enterprise data, predictive intelligence foundation models and how the two can join hands and really have the impact positive impact on the humanity.

Going forward.

Richie Cotton: Wonderful. Yeah. Lots of exciting things happening some great recommendations there. Alright, wonderful. Thank you for your time, Jure. Great having you on the show.

Jure Leskovec: Yeah, thank you Richie. I enjoyed it a lot.

Sujets

Artificial Intelligence

Machine Learning

AI for Business

Apparenté

blog

Enhancing Large Language Models with Knowledge Graphs

Discover how integrating knowledge graphs with large language models addresses common LLM weaknesses like hallucination and outdated data. Learn how this synergy powers more accurate, real-time, and domain-specific AI applications.

Arun Nanda

14 min

podcast

The Past and Future of Language Models with Andriy Burkov, Author of The Hundred-Page Machine Learning Book

Richie and Andriy explore misconceptions about AI, the evolution of AI, AI research, the role of linear algebra in AI, the resurgence of RNNs, advancements in LLM architectures, the reality of AI agents, and much more.

podcast

Data Science Trends from 2 Kaggle Grandmasters with Jean-Francois Puget, Distinguished Engineer at NVIDIA & Chris Deotte, Senior Data Scientist at NVIDIA

Richie, Jean-Francois, and Chris explore the role of AI agents in data science, the impact of GPU acceleration, the evolution of competitive data science techniques, model evaluation, communication skills, the future of data science roles, and much more.

podcast

The Data to AI Journey with Gerrit Kazmaier, VP & GM of Data Analytics at Google Cloud

Richie and Gerrit explore AI in data tools, the evolution of dashboards, the integration of AI with existing workflows, the challenges and opportunities in SQL code generation, the importance of a unified data platform, and much more.

podcast

GPT-3 and our AI-Powered Future

Sandra Kublik and Shubham Saboo, authors of GPT-3: Building Innovative NLP Products Using Large Language Models shares insights about what makes GPT-3 unique, the transformative use-cases it has ushered in, the technology powering GPT-3, its risks and limits.

podcast

[AI and the Modern Data Stack] How Databricks is Transforming Data Warehousing and AI with Ari Kaplan, Head Evangelist & Robin Sutara, Field CTO at Databricks

Richie, Ari, and Robin explore Databricks, the application of generative AI in improving services operations and providing data insights, data intelligence and lakehouse technology, how AI tools are changing data democratization, the challenges of data governance and management and how Databricks can help, the changing jobs in data and AI, and much more.

Voir plus Voir plus