Skip to main content

Data Science Trends from 2 Kaggle Grandmasters with Jean-Francois Puget, Distinguished Engineer at NVIDIA & Chris Deotte, Senior Data Scientist at NVIDIA

Richie, Jean-Francois, and Chris explore the role of AI agents in data science, the impact of GPU acceleration, the evolution of competitive data science techniques, model evaluation, communication skills, the future of data science roles, and much more.
Feb 24, 2025

Jean-Francois Puget PhD's photo
Guest
Jean-Francois Puget PhD
LinkedIn

Jean-Francois got a PhD in machine learning in the previous millennium. Given the AI winter at the time he worked for a while on mathematical optimization software as dev manager for CPLEX in a startup. He came back to Machine Learning when IBM acquired the startup. Since then he discovered Kaggle and became one of the best Kagglers in the world. He joined NVIDIA 3 years ago and leads the NVIDIA Kaggle Grandmaster team there. He's speaking at the NVIDIA GTC Conference.


Chris Deotte's photo
Guest
Chris Deotte

Chris Deotte is a senior data scientist at NVIDIA. Chris has a Ph.D. in computational science and mathematics with a thesis on optimizing parallel processing. Chris is a Kaggle 4x grandmaster.


Richie helps individuals and organizations get better at using data and AI. He's been a data scientist since before it was called data science, and has written two books and created many DataCamp courses on the subject. He is a host of the DataFramed podcast, and runs DataCamp's webinar program.

Key Quotes

I've worked at NVIDIA six years and I think one of the biggest things I've seen in these years is accelerating all these algorithms on GPU. The significance of that is, besides just obvious speed, is that we're actually, this can allow us to build new hybrid models.

I almost see myself being, you know, a manager of a little office and I'm sending out my agents to on little jobs to do. And then they come back and they give me some code or some EDA or this or that. And that's kind of what I see happening.

Key Takeaways

1

Embrace the integration of AI assistants in your data science workflow to enhance productivity and streamline coding tasks, as these tools are becoming increasingly reliable and efficient.

2

Utilize GPU acceleration to significantly speed up data processing and model training, enabling faster experimentation and iteration in your data projects.

3

Consider the use of hybrid models that combine deep learning with traditional machine learning techniques, as GPU advancements make these approaches more feasible.

Links From The Show

NVIDIA External Link

Transcript

Richie Cotton: Hi there, Chris, and welcome back, Jean Francois. Welcome to the show.

Jean-Francois: Thank you.

Chris: Thank you. Thanks for having us, Richie.

Richie Cotton: Brilliant. So to begin with let's start with the big one. What do you think is going to have the biggest impact on data scientists this year? Chris, do you want to go first?

Chris: so I think we got an exciting year ahead of us. seeing all these chatbots and co pilots AI assistants. So I I think this is going to continue and we're going to see more use cases of, using these LLMs assisting the data science workflow.

Richie Cotton: Okay, yeah, that certainly seems to be like a very big topic at the moment and I'd love to get into that into more depth. But yeah that seems like a solid choice for what's going to happen this year. Jean Francois do you have any predictions on what's going to have the biggest impact?

Jean-Francois: I kind of agree. I believe there is a standard impact, which is happening already, moving, in many cases, from supervised learning, where you aggregate a data set and train a model. So you aggregate data sets, have humans create, the answer to the question you want for every row in the data set and then train a model. And few shot learning, is becoming a thing. It lowers the need for gathering data. You only need a few examples. And especially for text classification, say you want to a spam classifier. Before, you would have to collect thousands of, spam and train a mo... See more

del to, and these, now you can say to a big LLM.

Here are a few spam examples. Now, tell me if that new one is a spam or not. And a stake specification, it works quite well. So, it's minor compared to what Chris says, but, Replacing supervised learning with few shot learning works as well.

Richie Cotton: Okay, yeah, certainly I think the data scientist, that's kind of huge. Because supervised learning has been like a mainstay for decades and decades. So, the fact that you need less data now because the models are more powerful, that's a very interesting twist. on that note, so, Chris, your example was around generative AI, and it does feel like in the last year or two, generative AI stole, like, a lot of the thunder from data science.

Like, it's just captured all the hype. So, do you think? Innovations are still happening in data science. John Francois, do you want to go first this time?

Jean-Francois: so no surprise from an NVIDIA employee. One revolution is bringing GPU acceleration everywhere. And in particular, In data science, in tabular data, there two or three packages that everyone uses, pandas, polars, and scikit learn. And we shipped over the last year, GPU accelerators for pandas and polars.

So you can take a Pandas code or Polar's code and with one line, move it to GPU. So I think this is a game changer because your code running faster, you can experiment faster. For scikit learn, we have equivalent called QML. And overall all these GPU accelerated machine learning and data science is called RAPIDS.

So, easing the move from CPU to GPU code, as I said, one liner I believe will will make more impact than what we did before. That's definitely an innovation. On the algorithms another innovation is It requires GPU. It's the use of deep learning models on data science, so on tabular data.

Every other week, there is someone claiming a new SOTA model for forecasting all tabular data. Maybe they are not SOTA, but they are interesting.

Richie Cotton: certainly the first one about making it easier to use GPUs. I'm sure you got a slight vestige of just talking about that, but I definitely want to get into that more. We've known for a long time that GPUs are much better at doing computations for a lot of data scientists.

It's been a bit of a hassle in terms of getting them set up. So making it easy for them to be used, that does seem like a pretty important innovation. And yeah, I guess LLMs sort of, or at least neural networks are sort of eating every problem. So, that's interesting that they're even invading the sort of tabular data space as well.

All right. Chris are there any other innovations that you wanted to talk about? 

Chris: I'm observing the same ones that JFP said, but I think so I've worked at NVIDIA six years and I think one of the biggest things I've seen in these years is, Accelerating all these algorithms on GPU and the significance of that is besides just obvious speed is this can allow us build new hybrid models.

So I've actually recently in some competition, seeing people fusing deep learning models. together with a machine learning model. example is maybe you're doing an image regression problem, but you use a an image model as a backbone to extract embeddings, and then you regress the embeddings using a support vector machine.

But previously, a support vector machine running on CPU kind of wasn't fast enough to keep up with the iterative cycle of deep learning. So because of the speed now, people can kind of incorporate KNN, you know, KNN to, to compare embeddings to find similar images or similar texts. So you're seeing a lot of more uses of machine learning.

So that's exciting that these new, new ways to approach problems.

Richie Cotton: All right. Um, Yeah, I like that. Mixing models together. Okay.

Chris: can even add a second thing. So also, you're also seeing more automation. So in feature engineering or AutoML, once again, because you're speeding up the process of data framing on GPU, you can, instead of being. selectively picking a few features to try out with the increased speed, you could just have a for loop.

just iteratively try out hundreds of, I mean, literally overnight, hundreds of thousands of features and then just have it evaluate and find which ones work. So the speed's allowing us to do faster experimentation and just automate some of these search procedures.

Richie Cotton: Okay, so that's really interesting, talking about feature engineering, just being able to, I guess, brute force it now by just trying lots of different combinations. So I guess one thing we've been teaching for years on Datacamp is that actually when you're doing feature engineering, you need to think about, like, what the problem is you're solving, but it sounds like that's just going away then.

You just need to just try more things. is that about right? 

Chris: have value. I mean, I use a lot of intuition my work and when I pick features and it's a lot of times, that's how you improve the accuracy of a tabular data model. It comes down to finding that sort of, we call it sometimes the magical feature. but more and more I am using automation.

I recently took first place in a tabular data playground competition on Kaggle. And what I had done was, instead of thinking about the columns and which ones should I combine, and then targeting code, there was these, it was categorical columns. Instead of, you know, think I just ran the code overnight and on GPU on some a 100.

It was so fast. It tried out tens of thousands of combinations. I woke up in the morning. It gave me a list of 100 and they had boosted the model accuracy a lot and just jumped me to first place. So, I used both. so intuition counts for a lot, and I, that's how I found a lot previously, but you also have to use the brute force method to just, just even search things that you wouldn't have thought of. 

Richie Cotton: So, you mentioned the sort of brute force method for feature engineering Jean Francois, you were talking about these sort of combining different models together as well, like combining LLMs with more traditional machine learning, but you're sort of both talking about this sort of stuff.

So, is this just an increase in computational power that's allowing this then? So you make use of the GPUs, you can, you can crunch more things together. what's happening here?

Jean-Francois: using deep learning on something other than GPU a great idea in general, so definitely using GPU enables using transformers and LLMs. For time series, for instance, there is a lot of transformer kind of models that appears to be quite powerful. So, that's one. On traditional models I just concur with Chris I see machine learning and data science as an experimental science.

You have a baseline model you want to improve, so you design an experiment, like adding some features or different parameters or assembling with another one. run your experiment. You evaluate the results, so you need to have a reliable way to evaluate your model, and that's what we learned on Kaggle, I would say. And if your experiment is successful, your new model becomes a new baseline. If not, you look what, can explain why it's worse, and you learn from it, and you design the next experiment. That's how I work. the faster you iterate, the better, because you try more things. Thanks. That's why GPU acceleration is useful, even if your problem may not be that big and training a model takes only half an hour.

Well, if you replace the half hour by a minute, In one hour, you can try 60 things instead of two. So, even in, if it's not computationally demanding, if you accelerate, you will benefit. So, it's really really anything that can make a better use of the data scientist, I would say. That's the scarce resource.

That's the thing you can't compress, the time you have. So the more you do in a given timeframe, the better.

Richie Cotton: So I'd like to go back to something you mentioned earlier on, Chris, is that AI agents sort of one of the big trends of this year. So, what I've seen so far is like a lot of these things, they are not really data science focused, they're all like focused on like particular industries or just kind of, more, you know, general business use cases.

Do you think AI agents are going to take off for data science this year?

Chris: I think so. I mean, I would say that they already are helping. So, daily workflow, I'm actually constantly consulting ChatGBT as a, co pilot to help me with my coding. You know, if I have a function to write, which is just a basic function, I might just say, Hey, write a function sort of list the numbers, and it spits it out.

Or if I have an error in my code, I'll just copy and post the error. So it's already, a lot of ways, an assistant to me. I mean, I'm working together with it. and it's, I mean, it's saving me so much time. So that's already happening. And I just noticed that they're getting better and better.

I noticed that. When they're providing code, they're making fewer and fewer mistakes. I think now it sort of has some chain of thought, maybe, or it's checking its work. It can even have access to tools to verify. So it's, it's continually getting better, and therefore relying on it more, and it's being more useful.

And I, I see this trend just to continue to grow to the point where it's almost like, I almost see myself being, you know, a manager of a little office, and I'm sending out my agents on little jobs to do, and then they come back and they give me some code or some EDA or this or that, and that's kind of what I see happening.

Richie Cotton: do like the idea and certainly, like, the idea of a co pilot is something to help you write code. Very, very important already. Do you see yourself being taken out of the loop where you can just leave the AI to go and solve some data science task and then you come back and maybe just review it at the end?

Jean-Francois: So my tech is quite optimistic. So I'm old enough to have seen, I didn't, I never used punch cards, but when I started computer science we were using compiled languages mostly. and before they were using assembly, then people moved to interpreted languages. and then to frameworks. and each time there is a level of abstraction and I see the use of LLMs is just yet another level of abstraction. You still need to tell the LLM what you want. So it's just programming at a higher level. So designing a software system still stays It's just that instead of having to write lots of code, you automate the writing, but you still have to design the system. And the other bit, so it's more software engineering than data science, it's debugging and maintenance. see people are amazed by all this app code was generated and sometimes putting some machine learning and deep learning models. Automatically, sure, but next, if there is a bug, what do you do? I'm not convinced yet that coding assistants are good at debugging and maintenance. people who think they no longer need to understand coding, well, I don't want to be there when their application has a bug. Exactly. Exactly.

Richie Cotton: No, it's kind of interesting that we'll often talk about the idea of you want AI to automate the boring things that humans don't want to do. But if you're getting AI to write the code in the first place, which is kind of the fun bit, but then the human has to debug it, that's just a worse situation for everyone to think.

Jean-Francois: automates the fun, the, the most fun part because it's the easiest. So. At the same time, it's normal. Automation starts where it's easier.

Richie Cotton: Okay. So, terms of automation, are there any things that you think really shouldn't be automated By AI that they always ought to be done by a human.

Jean-Francois: Uh, Model evaluation for me. Defining how a model should be evaluated. That's tricky. And I said it once already, but I believe what distinguished people like Chris, me, and Kaggle competition grandmasters from most data scientists is We evaluate models reliably. have to, to win competitions. If we don't pick the right model, we don't win.

So we have to learn and , sometimes we, we are wrong and we drop in the ranking. So I don't think this should be automated. It's dangerous. 

Richie Cotton: So, I guess talk me through what is, what's going through your head when you're evaluating a model, like what's this sort of human process that you're doing that you think is important.

Chris: So I think the most important thing. It's just to do it, you know, without bias, without leak. so you just have a local validation. So ultimately, you want your model to make predictions on new unseen test data. that's what you're trying to do. It's gonna be deployed in production somewhere and see new data.

So when you so you need to set up a robust local validation where you're training on one set of data and then, you know, validated on some unseen of data. Typically, we use a cave cross validation. So I think it's just important. They to set it up correctly, make sure that there's no leaks. You're not using any information from your validation folds.

And then yeah, you just do this and then you compute that local validation score. And then if you've set things correctly it mimics sort of the unseen data it's going to see in the future, then oftentimes you'll see a very great correlation between this validation score you're computing and then what, how it will perform unseen data. 

and then after you have that set up, you That's the only number you look at. You don't fudge things. You don't say, Oh, let's just try it on the test data. Oh, let's change that. It's helping the test score. I mean, you have your validation score, and you just vary you know, a lot. Basically, you do an experiment.

If it increases that, you keep it. I mean, with a level of, I guess I will point out there's always a level of variance in validation scores. Maybe if you change the seed, it might, you know, it might change plus or minus a little bit. So you have to kind of determine that. But once you do an experiment, if it shows to be statistically significant, that improved validation score, you keep it and if it doesn't, you don't keep it.

And that you follow that rule and then you do it over and over and that the expression on Kaggle is trust your CV. So many people, if you trust your CV, you will consistently, do better and place better on the second unseen leaderboard. People that are too concerned with watching their score on the first leaderboard and only keeping changes or doing tricks to increase that first leaderboard.

Then there'll be a huge shakeup score will drop. So trust your CV, use that local validation score, make decisions based on it, and then you'll do good.

Richie Cotton: Are there any other skills that you think are important or becoming more important at the moment?

Jean-Francois: So both Christian and I are fans of Kaggle, but we have to admit, It does not exercise all skills to be successful. So is probably the best way to learn how to build a model and evaluate it. doubt. But happens before, so data curation at large. what do you train on? What is a target?

And what happens next? How do you deploy is important, but I believe the most elusive skill is how do you convince a non machine learning people that your model is useful.

Richie Cotton: Yeah, the art of persuasion.

Jean-Francois: Yeah, because saying, oh, my AUC score two more than before. Well, okay. So, if you don't know what AUC score is, you don't know. So, it's, I see it people are impressed by numbers. So, you can, it's easy to just create the wrong expectation by storing a number that are not validated properly.

So, I remember I had to argue, not at NVIDIA, but before, that no, the score on the training data is not relevant to measure the model, you all this kind of stuff. Another senior data scientist was having better numbers than me. He was evaluating on the training data. And it was almost impossible to convince management.

So, so that's what consultants do well normally. But I believe data scientists have to learn how to speak to, in terms of business KPIs or whatever, but not just machine learning.

Richie Cotton: That does seem incredibly important that communication skills, not just to other technical people, but maybe to people who have more of a, a business interest or a less technical focus. Actually, do you have any tips for that? Like, did you learn how to convince your manager of things?

Chris: would say what's helpful is if your model can exhibit some explainability. So I've worked on these projects. And been up against just what JFP describes, whereas I have a model, I have a very good validation score, and I've worked with so many models that I know that I could trust that.

If know, if you set up the validation correctly, it's representative of the unseen text data, then you can confidently say, this is how it's going to perform. Like it's going to do good. We should use it. you know, to a business person, it may look like a black box. Well, why does it make?

Why does it make that prediction? What's it thinking? Is it going to make a mistake tomorrow? I have a natural trust. I know that it'll do what it says it does. But anyway, what helps in those situations is If your model can give some level of explainability, if you sort of say, why is it, you know, maybe it's making some forecasts.

Why is it predicting an increased sales at this one store, but not this other store? So if you can do shaft values or other things, or or if you've got some feature importances, or if you can just show how things affect the model, but anything that kind of like, explains a little bit why it's doing what it's doing, or if you showed a few, if you look at a picture of a decision tree, because if a manager can say, Oh, I see, like if you were to have like a picture of a decision tree, you know, if the store has a drive through window, we will sell more stuff at 9am.

Oh, that makes sense. So this store, that's why this store will sell more in the morning, you know, so if you can kind of explain why I was making some of the decisions again, then they, say, okay, I see, I understand that. So anything that helps their understanding builds their trust, and then they're more likely to use the model and rely on it.

Jean-Francois: would add so, sharing prediction on their data. since I worked on predictive maintenance, so predict problems before they occur in in some compute environment. So we just compare to the reality. So using actual data and using playback with a model pretending it's new data and comparing the predictions to the actuals. the prediction quality is good, this is extremely convincing.

Richie Cotton: So in general it's like, well, you need these sort of better communication skills. Some of it's just about giving people more trust by you know, Explaining stuff. It sounds like in general, it's just like providing more evidence, for what your argument is. Okay. All right.

I like that. And so, beyond the communication skills, I guess the flip side is, are there any skills that you think are becoming less important now?

Chris: Yeah, I would say, so I, you know, I've been asked this question kind of many times with my friends, you know, will my job be replaced or what skills are more important or less important? And I guess I I have a simple my mind. I think that data scientists and employees will all be more like managers.

and we'll be essentially having a team of AI. So any, anything that, that incorporates some of the roles of managers. So basically, with regard to data science, I would think like individual tasks, just maybe manually making features or EDA or. iteratively training models, finding hyper parameters that can easily be automated.

But the overall things like the purpose of the project, I mean, the goal of the project the choice of metric how we deploy it, I mean, sort of like what a kind of like what a project leader or a manager would do. I think those skills will be more important because that's sort of how I see the relationship.

Between the worker and the A. I. D. I. You can send them to a whole bunch of roles and tasks, but then you have multiple agents. This guy did E. A. Did some of this and then you have to synthesize it together in sort of the bigger picture.

So my view. Of how, you have to be more of a multitasker, a generalist, a manager, a project leader, and know how to put it together, assign roles, assign work, and then those skills, I think, will be beneficial.

Richie Cotton: That's absolutely fascinating. The idea that you need to become a manager of your AI. I just imagine you have like a performance review meeting with your AI, but like, oh yeah, good bot. Okay. That's kind of cool. And so, do you think that the data scientist role is still going to be around in 10 years time?

Jean-Francois: I believe so. it really was, as Chris said um, at what data scientists did 10 years ago. If you're still doing it now, as you did 10 years ago, you need to update. and then maybe the small data, Problems would still require a lot of human intervention. when you have lots of data, people would just throw generic deep learning, agents. But in edge cases, so, or so maybe in medical health care, people will resist automation quite a bit. So maybe in regulated industry, but now there are areas where, I don't know, e commerce, e commerce. Probably, it's already quite automated, recommender systems, et cetera. So, now, probably many industry where, you know, data scientists won't be needed, but I still believe there are pockets where human will still be quite important.

I

Richie Cotton: So, we talked before a little bit about how GPUs are making data scientists able to run their code faster and therefore be more productive. So I know that NVIDIA is involved in the RAPIDS project for data science tooling. Do you want to tell me a bit more about what is RAPIDS and how is that changing this year?

Chris: Yeah, so Rapids is a whole suite of libraries, and each one does different things. The central theme being, They're all accelerated code to accomplish different things. work most often with are cuDF and cuML. So it's similar to Pandas. it's the same API as pandas.

it can handle all your data framing needs. The API is similar, but everything happens on GPU. So all of your operations, if you have a bunch of data frame with columns and you're calculating means and group by aggregations, I mean, it happens in a blink of an eye. That's cuDF. and recently they released some zero code change versions, cuDF pandas, cuDF pullers, which means if you already have your code written in pandas, instead of rewriting it in CUDF, you just turn on a flag at the beginning of your code, and then all of the Panda's code will, behind the scenes, use CUDF and be accelerated on GPU.

That's QDF. That's awesome. Another one I use often is QML. That's all your machine learning models. So the API is the same as PS kit Learn and it basically accelerates all your machine learning models, can NS support vector machines, linear regressions dimensionality reduction like UMAP or TSNE.

So all of that stuff is accelerated on GPU and once again. If you run the same model, CPU versus GPU, you could see hundreds of times speed up, and that's exciting because it's allowing me to use new models, as I mentioned earlier that I didn't previously use in competitions or on projects because it was too slow to train.

I can combine them with deep learning, and I can also iterate experiments, you know, really fast, and then that allows me to tune them and make them highly accurate.

Jean-Francois: developed by RAPIDS. It's XGBoost. So, NVIDIA developed the GPU version. XGBoost. is what we call Gradient Boosting. And so one size fits all, almost, algorithm for tabular data. I say almost because linear model SVM and, uh, and linear regression are still useful by XGBoost. And now there's uh, can boost like GBM.

So these are really also interesting to have on GPU.

Richie Cotton: Okay, so these are all just maybe I guess some of the most popular Python packages then for data science. So yeah, you mentioned Pandas, you mentioned Cycler and XGBoost, all incredibly popular things. I like the idea that anyone who's a Python data scientist can know how to use these packages and therefore, yeah, a question of Setting a flag and then you can go in when you code on GPU, which is nice.

I like learning new things, but I don't necessarily want to learn like almost the same code again and have to like tweak things if I'm switching to GPU. So, guess related to this you've both been doing Kaggle for a long time. You're both grandmasters. I'm curious as to how your process has changed for competitive data science recently.

Chris: That's a great question. So I've been, I've been competing on Kaggle now for six years the last five and with NVIDIA. And have things changed? So I think the competition has gotten much, more competitive. much more difficult get gold medals to win at the top. And I think that's because I think previously, if you just knew how to do something correctly by the book, you could do well.

So, for instance, when Laura parameter efficient tuning of LLMs first was released. If you could just, get a Laura adapter, train an LLM, process the data and kind of do everything and set up a correct validation get your training schedule. And if you could just pull it all off, You could typically, jump in the top 50, top 20.

I mean, just kind of knowing how to, knowing how to use the, the different models and, and machines and validate and stuff. But nowadays that's just kind of a prerequisite, you know? And these comps, you have to know the basics, and then you have to add an innovative thing. So more and more, I think, because, the user base is increasing.

There's now 20 million people on Kaggle. So pretty much kind of everybody knows about it. is researchers or practitioners at companies. So, you know, it's the best of the best. And I think what's changing is that you have to know more than just the basics.

And, and specifically, you're going to have to use all the advanced tricks that you learn over time on Kaggle, like maybe pseudo labeling or stacking. And then you also have to add innovation. Read the latest research, but kind of a new trick, maybe modify an architecture, and then sort of a new training technique.

So it's really, it's pushing you to do more and more, I think. You know, each new year we continue to compete.

Richie Cotton: Okay. That's very cool that the sort of baseline level of competitive ability in machine learning has been increasing. I'd love to hear about some more about like some of these advanced tricks that you have, as you mentioned, pseudo labeling and stacking. Do you want to talk us through what those are and.

Why you might care about them.

Jean-Francois: when I started Kaggle, stacking was a big thing. If you knew how to do it, you were getting gold medal So stacking, the idea is, it's a way to assemble models. Ensembling, it means you have several models and you combine somehow their predictions to, This works better than each model.

The simplest way is to average, or if it's a binary prediction, you, you vote. You take the most frequent answer if it's classification. Stacking is the idea of you train a new model on the output of your first set of models. So you, you train several models, And to find the best way to combine results, you train another model, could be XGBoost whatever, or, or linear regression, or a deep learning model. And to do it correctly, you have to be careful because the prediction of, your models They can work depend on the training data. So if you train again on the output of a model, you use the target of the training data in your training. So you have to be careful and use what we call out of fold prediction.

It's a bit complex. So there was a barrier to entry, but now it's common knowledge on Kaggle, so everybody does it. Another related technique is called target encoding, which is very powerful. Chris used it maybe he will say something after me. it's the same ideas to use a target to create features, but you have to do it carefully.

Once you master it, you have yet another advantage over other people. So there are more and more tricks like this, and now they are understood by more and more people to distinguish you need to, well, to find something new. There is also a shift on the, nature of competitions. Nowadays, no surprise, a lot are LLM, Gen AI stuff, before there were image, a lot of image segmentation, and so the nature of competition evolve as well, so we need to evolve.

Richie Cotton: Okay. That's very cool. I like the idea of, I mean, it sounds like stacking gets quite complicated because you've got lot of different models and then using another model to see which one of the original models was any good. But it sounds like you've got all these techniques then to really take you to that sort of advanced level of model performance.

Are these just used for competitive data science or can they be used more generally? Like if you've got a business problem, would you ever make use of this sort of thing? 

Chris: Yes and no. So they clearly work. I mean, they clearly make models more accurate but they obviously make the model more complicated. So, the downside is you increase latency. Instead of inferring one model, you're inferring a bunch of models, you're combining them in some fashion. So often in production, you have latency concerns.

Things need to be quick also in a competition, you want that fourth decimal place, right? You want an extra 0. 00001, but maybe in your particular application that's not important. But if you're dealing with something, maybe in healthcare, maybe it's very important to be very accurately diagnosed something.

But maybe if you're in some other industry, You're building widgets, and it doesn't matter if they're a little bit off. So I think you have to look at your use case, and balance how important is it for you to get a more accurate model if it's only a tiny, tiny bit, and how important is well, the complexity of maintaining the code.

You know, you need employers to constantly train and maintain the code base, and then that's two, and then three. There's even in, test time or inference time, the latency of making the predictions, you have to kind of weight all those factors. and I would say that I, I have used some of these, but I've never used them to the level I do on Kaggle.

I mean, I, I've had some Kaggle solutions. I'm almost embarrassed to say so sometimes the solutions it's a code competition means you submit your code and then it has to run Within constraints that include time and resources. There's another competition, which You just upload the predictions so you can do all the computation as much as you want locally, and then you just upload your predictions in a CSV file.

So in that second type of competition, I have literally sometimes ensembled a hundred thousand models because you can. And it's not that I'm just staying up all night, but you work for the whole month. And every day you might add like a new model and then you just save the model weights.

And every day you're making new models. So every day you're, you're logging another dozens of hours of compute. And then on the final day, you average everything you did for the last month. And the result is the average of, kid you not, I've exceeded thousands of models. And it's, given me that edge to win.

But you would obviously never, ever. use a model in production that literally takes a month of training time, right? So see the trade off accuracy versus the other resources you're managing.

Richie Cotton: Yeah, absolutely. I can see how you try and run something in production. You don't want it to take a month to give you a response of like, of, of what's happening. So, but that's very cool that there is this sort of difference, like in terms of. Like exactly how precise you need to be with your predictions or how accurate you need to be with your predictions.

But yeah some of those techniques are still going to be useful, maybe just like, to a lesser degree in a business context.

Chris: I'll add one quick comment. There are some techniques like pseudo labeling, which is an advanced technique used during the training process. But afterwards, you still only have one model. So ensembling and stacking, you might be making lots of models and making things more complicated for inference. But there's other techniques that we could list dozens more, which just help you make a more accurate model.

But at the end of the day, you still have one model. So these you definitely use. Why not, right? Because you can do these things. So, I will mention there is a lot of other techniques you learn on Kaggle, which help model accuracy and don't make the end result more complicated or increase time.

So I don't want to make it sound like, you know, you're learning all this weird stuff that you never use. So a lot of it you use, but there are some things that you do have to balance the tradeoff.

Jean-Francois: would add one that is quite recent, it's the test time training. So for instance, for forecasting competitions, you submit a code, but every month it's evaluated and every month you have one more month of data. So in your, in France code, you can update the model on the new months. Oh, there was a competition, the Arc prize, Arc AGI prize, and same I participated and the key was to have a model that could be trained on the test data in addition to pre training.

So, as, as Chris, we are pushing, well, we, the community is pushing the envelope and it's harder and harder.

Richie Cotton: I do like the idea that, you know, there's still being progress made in what seems like, well, you know, people, people aren't doing this competition for a while now, but there's still things going on. So it sounds like, For listeners who are interested in kind of learning some new techniques to get to advanced machine learning, sounds like pseudo labeling and test time training, model stacking.

These are things that people should look into. there any other things I've missed there?

Chris: Yeah, I would add a uh, target encoding. this is a technique used to generate features. So, you know, make new columns in the tabular dataset for the purpose of making a model more accurate. that probably, yeah, the most powerful technique, and it's very effective as target encoding.

And I'll even say a little plug we're actually teaching that at the upcoming NVIDIA's hosting a GTC conference in March. And a bunch of KGmon, my co workers, and some other NVIDIAs were given a workshop. And we're teaching exactly how to do this targeting coding technique and some other techniques.

And I'll even show you how you incorporate that into a model. We demonstrate how it makes models more accurate. So it's a hands on workshop. you follow along with the code. So, well, I didn't know about that and invite anybody who wants to join. I think you can join it both virtually or in person in in San Jose, California.

So

Richie Cotton: Yeah, that's me a lot of fun. I have to say, the last two years, the GTC conferences has clashed with DataCamp's Radar conference. This time we've made sure to have it on different weeks so people can attend both. So yeah, Radar's the following week after GTC this year. Brilliant. Okay, so, I'm curious now as to whether there are any things that you think have like just recently become possible or are just becoming possible in data science that previously were very difficult.

Jean-Francois: So one thing very fresh is. Reinforcement a cost and effective. So DeepSeq R1 model that just came is a good example of a technique that was published by Sam DeepSeq a year ago. But the model itself is interesting, but the recipe to use has been used by many people just to improve models in all sorts of situations.

I have yet to see it applied to tabular data, but maybe this will this will happen. So we see reinforcement learning applied for the first time in a number of places effectively. And that's really recent. It's like last week.

Richie Cotton: that's interesting, because I always feel like reinforcement learning is one of those, it's like the unloved cousin of uh, supervised learning and unsupervised learning, like, is very, very useful, but it's sort of the third branch of machine learning that gets a lot less attention.

So that's interesting that DeepSeq sort of bringing it into a bit more of the, the public eye. all right. And so, just to wrap up what are you most excited about in the world of data science?

Chris: really enjoying of these LM's and chatbots and stuff. And I think I'm excited to just, continually see them getting better. So, and I'm noticing that, but I'm looking forward particularly, I'm enjoying multi modal. So that's where nowadays, addition. So let's say you're asking chat GBT or, or, or some other thing, a question.

Instead of just typing in text, you can upload images or upload a PDF or upload documents and, you know, it can ingest those documents. And then furthermore it has the ability to use tools to process things. And then when it gives its answer, it doesn't need to be just text, you know, generative ai, it can make images or plots or graphs.

And obviously you can incorporate voice, you know, audio input or audio out. So. I think I'm really excited to see all these different modalities of the inputs and the outputs. and it's just allowing the models to do much more. And I guess it's so exciting for me because The reason I'm in data science is as a small kid, I've always seen so many patterns when I look around the world, right? 

As a small kid, even before I could talk, you'd see a lot of people entering a room and then I'd even be like, Oh, know those people are going to leave the room. You just sense things, right? The body language of this. Like life is all about recognizing patterns. That's what it's all about.

And it's exciting. I just want to see how powerful will these AI models get? Will they be able to, watch me, my face and say, Oh, Chris is going to, get angry in the next 10 minutes because they saw me twinkle my eye. I'm like, really looking forward to see, many patterns will they see in the visual world, in the audio world, in this world?

And I think that they're gonna teach humans a lot because they're gonna, you know, think and see and process patterns different than we are. And I'm really looking forward to kind of seeing the insights the connections everything that they share with us.

Richie Cotton: That sounds very cool. Yeah. Just being able to learn a bit more about yourself just from looking at it through a different lens to a human one. And Sean Pronto what are you excited about?

Jean-Francois: Coming back to your question, will we need data scientists? anymore, when Deep Blue won the chess, what made the best chess player, Kasparov, was quite long time ago, people could think, okay, chess is done. No, no more chess. But what happened is The opposite, professional players started using computer, so it was AI and it still is AI, to train, and human players became much stronger than before.

And professional chess competitions are more interesting, so it's the opposite. And I expect the same to happen in data science and in software engineering and so on. The quality the output will improve and we will still have human in the loop. I'm really convinced. So I'm really looking forward so there is a focus on coding and not as much focus on data science, but I'm looking forward to see data science assistance coming. And maybe coming from us, who knows, 

Richie Cotton: Okay. Yeah, certainly the idea of having assistants to help you out with the job. Everyone's a manager now that does seem like a promising future, at least if you like managing things or like managing uh,

Jean-Francois: a manager. I'm, well,

Richie Cotton: You just need to make sure the AI doesn't have the opportunity to complain when you when you're harshing, harshly managing it. Okay, cool. All right, final question how do I get my hands on one of those fancy new 50 series GPUs you've made?

Chris: I wish.

Jean-Francois: asked me the same question.

Richie Cotton: Nice. All right, cool. Yeah, thank you so much for your time, Chris and Jean Francois.

Jean-Francois: Thank you.

Chris: Awesome. Thanks a lot, Richie.

Topics
Related

podcast

Becoming a Kaggle Grandmaster

Jean-Francois Puget is a Distinguished Engineer at NVIDIA and a 3x Kaggle Grandmaster. We talk competitive machine learning and computing on GPUs.
Adel Nehme's photo

Adel Nehme

50 min

podcast

Data Trends & Predictions 2024 with DataCamp's CEO & COO, Jo Cornelissen & Martijn Theuwissen

Richie, Jo and Martijn discuss generative AI's mainstream impact in 2023, trends in AI and software development, how the programming languages for data are evolving, new roles in data & AI, and their predictions for 2024.
Richie Cotton's photo

Richie Cotton

32 min

podcast

Reviewing Our Data Trends & Predictions of 2024 with DataCamp's CEO & COO, Jonathan Cornelissen & Martijn Theuwissen

Richie, Jonathan, and Martijn review the mainstream adoption of GenAI, the rise of AI literacy as a critical skill, the emergence of AI engineers, evolving trends in programming languages, why AI hype continues to thrive and much more.
Richie Cotton's photo

Richie Cotton

31 min

podcast

The Data to AI Journey with Gerrit Kazmaier, VP & GM of Data Analytics at Google Cloud

Richie and Gerrit explore AI in data tools, the evolution of dashboards, the integration of AI with existing workflows, the challenges and opportunities in SQL code generation, the importance of a unified data platform, and much more.
Richie Cotton's photo

Richie Cotton

55 min

podcast

Data Science & AI in the Gaming Industry

Marie and Adel discuss how data science can be used in gaming and the unique challenges data teams face.
Adel Nehme's photo

Adel Nehme

38 min

podcast

Data Trends & Predictions 2025 with DataCamp's CEO & COO, Jonathan Cornelissen & Martijn Theuwissen

Richie, Jonathan, and Martijn explore incumbent LLM providers and their disruptors, AI reasoning, the rise of short-form video AI, the challenges Europe faces in keeping pace with the US and China in AI innovation and much more.
Richie Cotton's photo

Richie Cotton

44 min

See MoreSee More