Becoming a Kaggle Grandmaster
Jean-Francois got a PhD in machine learning in the previous millennium. Given the AI winter at the time he worked for a while on mathematical optimization software as dev manager for CPLEX in a startup. He came back to Machine Learning when IBM acquired the startup. Since then he discovered Kaggle and became one of the best Kagglers in the world. He joined NVIDIA 3 years ago and leads the NVIDIA Kaggle Grandmaster team there. He's speaking at the NVIDIA GTC Conference.
Adel is a Data Science educator, speaker, and Evangelist at DataCamp where he has released various courses and live training on data analysis, machine learning, and data engineering. He is passionate about spreading data skills and data literacy throughout organizations and the intersection of technology and society. He has an MSc in Data Science and Business Analytics. In his free time, you can find him hanging out with his cat Louis.
I strongly believe ML is science. It's an experimental science, like parts of physics. So I approach a Kaggle competition with a scientific approach, and everything is an experiment. So for instance, if I have an idea for a new data processing process, a new feature or new model architecture, I must have a baseline, something I know I can trust. From there I can run an experiment with the new data processing or new model architecture change.
I run my experiment, and then I look at the results. Is it better or worse? Sometimes I dig a bit in the output to figure out where it is better, where it is worse, and from there I move. I accept the change and it becomes the new baseline etc.
For this you need to have what we call a good cross validation, so the bread and butter of practical ML is cross validation. The point is, you split your training data, you keep some of it to validate your model, you train on the rest, and when it is trained, you predict on the validation, and you compare with the ground truth, with the target.
You should never evaluate your model on the training data, that’s one of the most common errors I see. It’s surprisingly common.
The real point is, evaluate your assumptions, evaluate your modifications. You have to be consistent with running experiments.
Someone I know did a customer churn problem for a subscription company. The model was predicting which customers were most likely to not renew their subscription. So what did they do? So then they say okay, let's run the model on the customer base, and we'll have the calls from the support team, to the people most at risk to propose to them a rebate. The problem is many calls went like this:
“Hello Customer. I'm from company X”
“Great, I wanted to cancel my subscription.”
So in fact, they accelerated the subscription churn because they targeted the right people, but not with the right answer. So this is an extreme, but it really highlights why focussing on how you use your model is as important as getting accurate results. Assume your data is good. How do you use it effectively to improve the business?
Approach machine learning problems with scientific rigor, understand your baselines and ground truths, and test your hypothesis’ one by one to clearly understand what effect your experiment has on your result
Kaggle competitions are useful for approaching problems in a way that might be different from how you would approach them in a work setting, it breeds innovation through freedom
Always keep the business problem top of mind when creating and iterating on your machine learning work, ask yourself “If my predictions are 100% accurate, what effect will they have on the business problem?”
Welcome to Data Framed. This is Richie. Today's guest is Jean Francois Pujet PhD, a distinguished engineer at NVIDIA. In this episode, we've got two topics to cover. That means we'll be talking about doing data science with the NVIDIA stack, meaning computing with GPU's, but we'll also get into the software that acc.
Jean-Francois' claim to fame is that he's been in the top 10 of the Kaggle machine learning competition leaderboard for the last few years. So we'll delve into the world of competitive machine learning to see how to become a grand master.
Hi, Jean Francois. Thank you for joining me today. Just to begin with, can you tell the audience a little bit about what you do and what it means to be a distinguished engineer?
People believe that to get more influence, better compensation, whatever you need to move to management. While in some tech companies, including NVIDIA, was true at IBM, my previous employer they let people grow as individual contributors.
So distinguished engineer means I'm a good individual contributor. I manage a small team of individual contributors, but it does not take me a real time. So what I do, mostly machine learning models. Be it as part of machine learning competitions and Kaggle, for instance, to showcase NVIDIA Stack and NVIDIA as a brand maybe but also for internal projects or NVIDIA partners.
But my job is to build good, possibly good machine learning models.
And can you give some examples of different projects... See more
So for instance on Kaggle the last few competitions I did I can discuss openly Kaggle because it's public, the internal projects. I'm working on some very exciting ones, but people will have to come to GTC soon to learn about those projects. So a recent machine learning competition, for instance, was to predict some protein property from the protein sequence.
And for this, we use models like Alphafold and other models that were quite hyped recently. So it's a breakthrough in the computational biology. Previously it was a natural language processing not the kind of pt, it was more tech classification for specific topic for medical examination. So it, can vary a lot.
I also worked on diagnosing from medical images, from radio images or microscopic images. Could be also time-code forecasting. So time series forecasting, sales forecasting, or for instance here we, there is a competition where we have to predict physically the number of business websites or students by US County.
So that's time series focusing. So you see it's very- it varies a lot. It's a cross industry. So the only common piece, and that's a bit surprising, if you win the same mathematical technology, machine learning can be used across a lot of use case and and industry.
Okay. So it seems like you're doing a pretty broad range of machine learning projects just all over the place in terms of different technologies and different industries that you working with. So you mentioned I'd like to get bit depth about that the top rank of competitive abilities, I believe uh, Kaggle. So, can you tell me how you got started using Kaggle?
I started my professional life when I was student. I was fascinated by ai. In particular, machine learning. So I did a PhD in machine learning. It was a long time ago, so before the planning wave, et cetera. I always was interested in it.
After my PhD, I moved to a startup doing something else, mathematical optimization and like in 2008 or so, IBM acquired the company I was at, and they noticed I had a background in machine learning, wanted to invest in AI and data science.
So I was asked to do more than mathematical optimization. And I was looking for a way to refresh my knowledge of machine learning. So, I started rereading papers, academic papers, but especially at that time, it's less true now, but academia was a bit remote from practical use. So I looked and I found Kaggle, a site where people could compete and people were using whatever tools they could to get the best possible result. And there was no preconception. As long as something works, it was used. So I started the Reading Kaggle Top Solutions and using it for my job at IBM and by looking at what tools people were using.
So I saw the emergence of Caras, for instance, of Exit Boost, which is now very popular, but it started on Kaggle. I witnessed the planning frenzy there, so it was useful for my job. But after a while, I say maybe I should try myself. So I remember Hunter, my first competition. I say, you will see what you will see.
Guys, I'm a PhD in machine learning. I would crush you all. So I was doing quite well. And film the results on the hidden test set or private data set on Pega and I dropped from top 100 to like 2000. So I say, okay, I need to learn. My knowledge is not really practical. So I started learning.
I enjoyed it. and after a while I became one of top 10 on Keel and keel Grand Master. So I keep doing it even today.
That's a pretty impressive achievement being top 10 in the world out of, I'm not sure how many, is it? Hundreds of thousands of people who participate in these
Jean-Francois Puget: Kaggle has 15 million users. Not everybody enters competition. So people are who got the rank on competition. I think they are in the tens of thousand, which means many more entered, but got no results. So yeah it's quite a lot of people.
But that's a very impressive achievement. And so, I'd like to hear a little bit about the secret to your success. So how have you managed to get to that high ranking position?
I would say it's a combination it's to have a scientific approach. I was trained as a scientist. I was good at physics and math in France. I achieved the best possible math result. That's good. I even got to Physics Olympia representing France. So I have a good scientific back.
The scientific method is I could say in nutshell is you check your assumptions always. So you have assumptions. You design experiments such that the result will tell you if your assumption is right or wrong. And I do, I strongly believe machine learning is the science. It's an experimental science like physics and like parts of physics.
And so I approach a competition with a scientific approach. And everything is an experiment. So for instance, if I have an idea of a new data processing, a new feature, or a new model, architecture, whatever, I must have a baseline, something I know I trust, and then I run an experiment with a new data processing, with a new model, architecture change, whatever.
I run my experiment and then I look at the result and is it better or worse? Sometimes I dig a bit in the output to, to understand where it's better it's worse. And from there either I accept the change and it becomes value, new baseline, et cetera. And for this, you need to have a good, what we call a good cross validation setting.
So the bread and butter of practical machine learning is cross validation, so can be K fold, cross validation. If it's time series, it's time code cross validation. But the point is you split your training data, you keep some of it to validate your model. You train under the rest.
And when it is trended, you predict and the validation and you compare with the prompts, with the target and the validation. And K four means you do this K time with k different splits. So there are variants, but really you don't evaluate your model on the training data. That's most common error I see. It's surprisingly common. And that's something Kaggle teaches you, where to evaluate your model. So the, the real point is evaluate your assumptions. Evaluate whatever modification you make to your code to make one modification at a time. That's also something I've seen people modify three things.
Oh it's better, but maybe one is detrimental or the modified three things result is worse. They will discard the three thingss. Maybe one of them was good, but it's offset by the other. So you have to be consistent to run, experiment. And if you run, experiment correctly, record the settings.
You can reproduce what you do, that's also important.
I really like this idea that you should treat machine learning as an experimental science, because I think quite often you find that people, they learn about things like testing in a statistics class, and then machine learning class is separate and they don't apply those ideas that actually I should be doing experiments when I'm machine learning.
So I really like that idea. And you mentioned that things like cross-validation are really.
So there, there was a course I was recommending a lot, in Coursera, but it's outdated now. It's with matlab. but still, he was teaching how to evaluate modes and I saw people just forget what he says because they were using a different kind of model, say a deep planning model. I see you've taken this course.
Yes. Why don't you use cross validation? Oh, does it apply to deep planning? Yeah. It's a methodology. It does not depend on the type of model,
That is interesting. Once you switch from just regular machine learning to deep learning, people forget all the stuff that they learned in the original machine learning models. Do you have any tips for how you go about winning? Like what are they techniques or things that you use in order to get better predict?
So if you have a good cross validation setting so you can rely on what you do, and the next thing to avoid is overfitting. Even if you use cross validation, if you use it a lot, same splits over and over again, you end up overfitting to it. So you need to use to be conservative. Make sure you don't select something just because it was lucky.
So there is a tendency of people that refers you publicly. So it's a trend test split fix one across the competition. And people rely too much on this fixed trend test split, so they will overfit to the public test. So I almost never use a public test on table. Almost never. Or let's say I use it as an additional fault if I use a fivefold cross validation on my training data.
The public test is a fit, is a sixfold but no more. And the second one is to have no preconception and to quickly create a baseline linear model. Usually if it's ular data, Simple cnn, if it's a computer vision just running a pre-train transformer if it's an very quickly have a complete pipeline where you can submit and create a solution, and then you improve gradually and you have no preconception.
Always wonder when in all I've said, oh, I have this parameter fixed. Why not try to vary it? Why not? not be shy. I see people they ask in the forum, do you think this could work? Don't ask. Just try it and see what happens. You will learn something always. So it's really a good performance is just from a solid use of a scientific method. Sometimes people have a great idea that nobody else has. It happens, it happen to me as well from time to time, but that's less and less frequent because the level of people is increasing the knowledge. For instance I did well in NLP competition because I started using prompts. There were some papers coming with prompt engineering. There was an NLP competition at the same time. So I just did some prompt engineering before it came really known. So that was a good advantage. But I would say the key is to perform the right experiment. And what does it mean? It depends on the competition and you get some uh, that's the knowledge, the training we get from Kaggle I would say. So practice, don't be shy. Test your hypothesis and be conservative.
That's really interesting, the idea that it's very difficult to predict which theory's gonna give you the best result. So the only thing you can do is just try lots of things and see what is the data show the results actually any good or not? Cause that seems very different to a lot of sciences.
But I'd like to talk a bit about the Kaggle Grand Masters of ENVIDIA. So, this is your team of competitors, I presume. So it is not just you, it's a group of several of you who are competing.
Yeah, we're called, KGMoN. And it sounds like Pokemon, so it's not by chance, so our Grand NVIDIA, we are eight of us plus me. So it's nine. So it's not a big team. Now about 150, I believe, competition Grand Master. So, there are not many companies, there are few companies having a Grand Master team as well so it depends on what you want to achieve, but I do believe in small teams of very good people and they all do the same as me. it's like people having a PhD. So I would come back to this, but Kaggle grand Masters, they know how to work effectively, otherwise they would not be grand masters.
So having good work habits is key, which means they don't need much management. So I don't see myself really as a manager coordinator or maybe, but most of my job is uh, individual contribution. And they all do the same as me. A mix of competitions and internal projects.
I want to come back to PhDs. I believe the one thing people learn during a PhD is autonomy. A good PhD student does not need to be told what to do every day, and they know how to complete a complex project in the end. And Kaggle competition are the same. They are complex projects. Time box, usually three months. And to do well, you need to complete your project on time. So that's also something that is good about carrier brand Masters. They work fast and they meet deadlines
So I think a lot of people listening are gonna be thinking that sounds like a cushy job, being able to just participate in casual competitions while I'm at work, and they're gonna want to know how do I get this for myself? So, can you talk about how you persuaded management let you do this your career?
When I started Kaggling at IBM, maybe I was spending on average one or two hours of work every day on it, which is already good. but most of my casual time was evenings, weekends, holidays. It was a hobby, a passion. So it's like people going to casino for gambling.
I believe it's the same. It's a legal drug except you don't lose money here. And to become a grand master you need to spend a lot of time. It's fierce competition. a lot of people, they became Grand Master because they are students, PhDs in machine learning usually. And once they are grand master, they get a job and they stop kaggling because they don't have enough time later. When I was hired at NVIDIA, remember I was always on Kaggle. So I became good at Kaggle. Before it became my work. So I would say just invest time if you can, if you have those skills and the motivation become a grand master.
And then you will find jobs like mine at NVIDIA or at few other companies so obviously I answered the NVIDIA job add with Kaggle Grand Master as prereq, but I see on LinkedIn instance. So we see from time to time companies asking for er.
And what does NVIDIA get out of you being a grand master? Like what's the benefit to the rest of the company?
In a competition, someone shared a notebook that accelerated append and pipeline using portal and this and that. And I looked at it, I said, let's see what we can do on GPU.
So I use the NVIDIA data frames part of Rapid. I used CudML can, so I used NVIDIA, all NVIDIA. Recorded the other notebook and got I believe uh, 17 x speed up. So as a result, people know that if they use GPU, they will get better performance. If I had not done it, people would say, oh if Pandas is too slow, just use P. Which is an interesting advice. And yes, P is more efficient in general than pandas, but QF is even faster. Another thing we did, there is another competition medical imaging. So DICOM image, it's it's a standard format in medical image and in the competition only people who are not using GPU to decode the images, but NVIDIA had to took kit.
So some people on my team, they tried the NVIDIA toolkit, saw that it did not handle some of the formats. They worked with NVIDIA product team and last month they released a notebook with an early access of the new toolkit. And as a result, images can be decoded on GPU in Kaggle. And same, the speed up is at least 10 x I believe.
It's fast. It's more than that. So we showcase NVIDIA tooling.
That's really interesting. And I actually I'd love to talk more about the NVIDIA tooling. So, of course GPUs are perhaps your biggest line of business. So can you just tell me a bit about what sort of machine learning problems are particularly suitable for a g.
I would say it's so if the planning is a way to go, think of GPU, that's your first advice. So if you are in computer vision, so image classification or object location video tagging, whatever, NLP since birth paper since transformers to cover, it's again the planning. With some pre-processing called fast four Transform, it's enabled to computer vision models. So for these three class of data, which people call unstructured data usually using an accelerator and especially GPU is a way to go compared to cpu for tabular data. So say you have sales, you have past sales, and you need to predict the future.
I'd say depend on the size of data. Sometimes, you have people, they have a hundred location, they have five years monthly data. So it's like 60 times, 30 data point 60 times 100 data, 600. If you use Exit Boost, for instance you may not need GPU. That's fine. So small data use whatever you want, but forte data.
For instance a recent competition, it was a recommender system. We had 18 million user 1.8 million products and a hundred millions interactions running. So doing data processing and modeling B Boost or something else using Rapid on GPU, we key the speed up is enormous. It's 50 or more. So again, if we go back to what I was saying, the key is to perform experiments quickly and effectively.
So we, if. As soon as you can accelerate with gpu, you will run more experiments. So within a day you will test more hypothesis and you will make progress much more effectively.
So it seems like most of these examples you gave where the GP is gonna be fascinated, these are examples where the code can be easily paralyzed. So you're doing multiple things independently. Is that correct?
Well G does the frameworks do it. For instance, let's say even for data processing, if you spend As, and you want to compute, I dunno, one column by, you do a group by, for instance, you want the mean of the spending by user. In Panas it you would group by users and if compute the mean, but it will iterate through the users one by.
See control with QDF and G P U, it'll be run and paralleled for you. So you don't need to write a parallel code. The code is paralyzed. So this way you can get hundred times speed up just because it'll process hundreds of users at a time in. So that's how you get the speed up for the planning. The bulk of computation is metrics, multiplication, tens, multiplication. And then GP are designed to do this act in, so they the memory and do the of two parts of the metrics in one cycle. CPU, they do have some parallel But GPU are massively powered.
So when you can use massive power, GPU a great idea. Most of the computations can be powered on GPU, so you just select g. It's one parameter in exhibit boosts, and your code runs on GPU using the GPU . But you don't need to change your code. That's the key.
That sounds like a really useful thing is like not having to write completely different code when you're switching to GPUs. so the NVIDIA software stack for doing all this data science and GPUs, so that's rapids and can you just tell us a little bit about what you can do with Rapids and who is.
So I'd say Rapid is fairly comprehensive. The motivation was to get a GPU accelerated version of pandas and psych. So you have a package called qf soda data frames df, which is similar to Pandas penance except the data processing is done on G P U. But the API is really similar to Pandas, to the point that now.
When I have a PANDAS code that is too slow, the first thing I do is import qdf as pd, and then I run my code and most of trans as is. And we are working with Rapids team to, to reduce the case where behavior is different. And for psychic learn, the rapid equivalent is called CuML so machine learning so not every algorithm is implemented yet, but a lot.
And the API is really similar to the point that QML documentation refers to psych learn documentation. So that the goal is really that it's easy to translate pipelines. And then over the recently, many other packages have been added to rapid, like signal CuGraph. I have less experience with this. So they are a bit more specialized, but that they are tooling. So there's always the same ideas to see what is needed to move pipeline from CPU to GPU for the planning.
We, there is no package, no framework from NVIDIA because tons of PyTorch and others did the work correctly. So we support these frameworks. There is a backend called c n that this framework use, but users don't need to worry about it. So personally, I use pto. I know it uses c n under the hood, but I just use pto. So for that reason, given the deep planning framework, were already on GPU. Rapid itself is not dealing with deep planning, but we know, and it's part of the feedback we gave the masters that often it is useful to combine deep learning with other machine learning models. So work has been done and recently QDF team has released a way to share memory between QF and pito.
So you can prepare your data with qdf and when it's ready issues by Pieto without memory copy, and it's all on gpu. So the full pipeline is on G P U.
Alright, so it seems like CuDF is perhaps the most interesting part of Rapid for data scientists and machine learning scientists, so it's a high performance panda alternative, but there are about a dozen of these different high performance pan alternatives around. So how does CuDF
compare to things like vex and moin and koalas and all them.
JF Puget: So they get speed up by distributing computation because for people who listen to us, especially data scientists, Some may not know yet. They soon learn it as soon as you use Python. That Python is Mono threads, because of something called the global interlock Gil. So Python is mono thread, which mean that if you want to use parallelism in Python, either you call, say a C or c plus library that does it for you, like Kuda or you implement multi-processing or you distribute across machines. so now are some distributed data frames and you could have mentioned Spark as well. Our preferred way at NVIDIA is called desk. SOEs is a distributed computing system, A bit similar to Spark but it's more patent friendly, I would say. And there is a desk could f for those who want to distribute. And one reason to distribute is when your data has to be to fit in one machine memory and GPU memory is increasing, but it's limited still.
So that CuDF is a way to distribute processing across multi GPU. And then when it comes to benchmarking, as I said, each time I tried pdf f it was faster than anything. It's because really GPU are so super powerful, so massively pared that it's really hard to compete. The only thing that would limit application is the fact that the GPU memory is limited.
So for days looking at desk . But that's I would say if if you can fit in the GP memory, in the memory of your GPUs using the it's hard to be
So it seems like CuDF is a pretty high performance thing and maybe worth checking out if your pandas code is running too slow. But I wanna circle back now to talking about your Kaggle competitions and how it relates to more standard machine learning work in a business context.
So, do you find difference between competitive data science and machine learning at work?
Yeah. So, it addresses some valid criticism of Kaggle which is, when at work you, maybe not just a data scientist, but the company, the organization using machine learning must cover a full life cycle that starts with framing a problem as machine learning, gathering data for it. Since most of the p cable machine learning is supervised learning, you need to annotate this data to get training data.
Then you have data procuration, modeling, model evaluation, and once models are evaluated properly, you put them in some production system or behind a dashboard or whatever. You connected to an e-commerce site for recommendation, whatever your use case is, and then you need to monitor the modeling production, detect if performance is going down, which may mean you need to retrain because something has changed in your environment. There is a full life cycle and Kaggle does not cover all of it. When you, in a Kaggle competition, you have curated data, you have annotated data, you have a metric. So the problem is already defined for you. And once you've only got to train the model, you submit predictions to Kaggle or you give your prediction code, but it's applied to test data and that's it. So you don't deploy, you don't need to worry about downstream. So Kaggle is only part of the machine learning pipeline, but for this part, it teaches you the right methodology.
Which is what I explained before, experiment based, et cetera. So I would say Kaggle is great to learn about modeling and model evaluation, but it's good only for this to someone who never worked on real life an only on Kaggle is not a full fledged data scientist. People need to get experience in okay, 'how do I even apply machine learning to this business problem?
Where do I get the data to working with people? How do I annotate it? How do I get labels and downstream as well?' So downstream is more understood, I would say. There is this ML engineer profession that has emerged that can operationalize the model. So we find more and more ML engineers, but I would say the upstream part, framing the problem as machine learning, getting data reliably, creating it, et cetera, it's still a kind of art and maybe underestimated at this point.
So that is interesting, the idea that the competition only focuses on the sort of the middle part of the machine learning life cycle around making predictions. But you don't get the start bit about frame the problem collecting the data, and the end bit about how do I deploy this? Or how do I actually use the model?
So it seems like a big part of this is about not having to align your model with some kind of business goal. So do you have any advice for people on how to do that? Like how do you make sure that the machine learning work you're doing is gonna align with some kind of business
That's a great question. And actually, When I'm asked to help on a machine learning project. If I'm not at the start, I ask people to imagine, assume that your model is perfect. It makes perfect predictions. How do you use it and for instance if it's forecasting, you can play back.
So assume you had perfect predictions. How would this have impacted your business? You know how to use a perfect prediction. You predict exactly the target, what would happen. And not surprisingly for me, but Most of the time people have no clue. So I say you need to design your business process, tooling, whatever, so that it can consume the output of your model.
It's straightforward. I've seen once, I used to be active on Twitter, but I remember once saying they work at pharmaceutical company, they don't see which one, and they worked based on feedback about one medication produced by that company to predict when the medicine was most effective. And they did a good job. So with their machine learning model based on a patient uh, features, The model could predict if it was worse using the medicine or not. So it could be a good help for medical doctors, when they presented the result to their management, the project was shut down. So I guess it's because the pressure to sell the medicine, even when it's not effective, So I'm not going to discuss the pharmaceutical industry incentives, but I want to point that the people working on the machine learning project should have asked, should have present, should have asked the stakeholders, what if we succeed, how would we use the model?
Is it worth doing? And maybe someone would've said, no we have no interest in doing this. Instead they spent one year a team, so. Just check that you are doing something useful when you stop, not when you're done.
So that leads to an interesting point about like how do you measure the success of a machine learning project? So I think like the Kaggle ideal is machine learning works best when you have the best predictions, but in real life that's not always the case. So can you talk about what constitutes success for machine learning?
Yeah. So in Kaggle, most of the time what matters matures is how good a metric can become on the test data and this leads sometimes to complex solutions with lots of eds being assembled and several stages and whole stacking and it's too complex to be used. So, Kaggle is trying to limit the complexity, but in short you want to balance the quality of the predictions with the cost of maintenance, the cost of implementation.
So you want, you prefer to have one model that is a bit less performant and complex and sample you could get on Kaggle but which is simple to implement, simple to retrain, you can maybe automate everything, et cetera. So, complexity of the model, complexity of training the Moderna is a key factor.
The other point is the metric is a proxy to the business problem. So it's not because you get a good metric that you improve your business. So let me give you an anecdote that I read. I don't know if it's true or not, maybe it's too good to be true, but someone I know claimed to have worked on a support organization and did a customer churn problem for a subscription company like a Tel-co or TV or cable or whatever.
So his model was predicting which customers were most likely to not renew subscription. So what they did, and this is a classical example you see in many machine learning textbooks. So then they say, okay let's run it on the customer base and we'll have. The call center, the support team call the people most at risk to proper them an incentive, a rebate or what have you.
The problem is many calls went like this.
'So, hello, Mr. Customer. I'm from company X.'
'Uh, Great. I wanted to cancel my subscription. Let's do it.'
So, in fact they accelerated the subscription because they targeted the right people, but not with the right answer. So this is an extreme, but it really highlights what I see. Assume your model is good. How do you use it effectively to improve the business? The other thing is to measure. If you don't know upfront You have, I'll do A/Btesting as you cited before. Say you run the previous process for half of whatever thing you apply to your users, your machines, your whatever.
And the other half you use the process with machine learning and you monitor and you see there is a difference. And in which way hopefully the part with machine learning works better. So you will use it more, but always keep a small fraction without machine learning so that you can see, you can detect if to point the machine learning system no longer works well. And this can happen if the underlying conditions are changing. So monitor what happens. I've seen a presentation in an industry form someone right before me, and they were describing I believe it was a Recomme commander system. And the results were not good. But they only discovered it like nine months after deployment because they were not monitoring.
And as soon as it, they started monitoring, they noticed that the sales of promoted items were not increasing. And in fact, they did not include promotions in the training data. So the mother was insensitive to promotions. So it predicted that some products were popular for. The wrong reason. They were popular because they were promoted, but the system did not have the data.
So it invented the reason for which that would explain why the product were popular fitting. once they noticed it, they retrained the model with past promotion data, and all of a sudden sales started to increase, but they had to monitor and seen practice. So It's same as I said.
always check your assumption. You assume you have a good model and you have good reason. You have done cross validation all, all, all of that. Check that it is really good in.
Okay. Those are really great stories of machine learning disasters. The one in the one about churn in particular is really terrifying to me at DataCamp, we're primarily a subscription business, so customer churn is something we live in of. So the idea that you could do a machine learning project and then make it worse is absolutely horrifying.
So, I'd like to talk about productivity at little. it seems like, particularly with your competitive machine learning background, you've got good at doing models very quickly. So do you have any productivity tips for how you do machine learning faster?
The key is to have a modular pipeline that is easy to maintain to modify, to log, so, Now I'm used to log things, to have something more modular controlled by configuration file, et cetera. So it's, I would say it's standout software development practice. But data scientists are not developers, so that's something I really believe is true, and for those that have no experience in software development, they have to learn it. And unfortunately, it's a bit the hard way. There are no programming courses for data scientists, not really. So people need to be able to version their code to have a clear distinction between configuration and a baseline script.
All sorts of things. But the goal is to automate as much as possible. And then there are tools like Weight and Bias or Neptune AI that help you track experiments, for instance. So there is more and more tooling that comes, but really the goal is to automate most of the things and focus on your idea.
You have a new idea, you should just write a bit of code or change some configuration, run it, get the results logged. Easy to compare with other experiments. So the keys really to remove the need to do manual work, manual meaning typing to get a new result. So once I start with notebooks, they're very good, but as soon as possible I move to patent scripts with configuration facts cause it's faster to it here.
Okay, so. Any kind of manual test trying to automate as much as possible. That seems like, a great productivity tip. And do you have any advice for if you have to do machine learning projects in a very short time? So how do you do like, very fast projects that just a couple of days or maybe a week or two?
So if it's only few days, so it's if you are clear, it's amazing what you can do in a few hours
Really, if you only have two hours, I would use. An automated tool. So for instance, I've tried one called Auto Blue, I think it's from Amazon. it's quite impressive in but it's not the only one. So auto ml, if you have few hours, I would go with a, with an auto tool. If you have weeks, then you Beto to ML because you can include additional data for instance that is relevant.
You can include domain knowledge that autoML system cannot devise. So you work more on the data, et cetera. So you can be ml but if you only have few hours, I would run an auto ml and if it's tabular data, I would even start simpler. I would run a linear regression or logistic regression. If I have a bit more time, I would run exit boost.
if you have few words, you can do something complex, So use a simple model.
So start simple and if you have more time, you can always make things more complex Later on. I'd like to talk a bit about collaboration since that's a big part of productivity. So do you have any tips on doing machine learning as a team?
There are two cases. One is when we have to deliver a common code, and the other is when we have to deliver predictions. The second case is more for , where you don't care about productionizing your, if you only need to ship predictions. We collaborate by exchanging. Data. So data sets prediction on these data sets. If it's a common code, we have to use something like GitHub, GitLab and use software engineering technique for communication. I often use Slack because of time difference. My team and Kaggle teams, and I always worked since, with remote teams to find something. And my team have one person in Japan, one in Germany, two in France, one in Brazil, three in the US. I hope I'm not forgetting someone. I will be trashed. But you see, so even the time difference it's hard to have everybody on a web conference. We do it, but not often. So we rely on asynchronous communication. So, but commit, upload, download in a common directory and slack. Slack people can use other, but the point is it's asynchronous.
So we write our IDs, our result, the other guy comes and relate to our response. So it's like a remote development team, like an open source project. It's quite different from. Some dev organization where everybody's in the same office have been used before the pandemic.
So after the pandemic, more people now are working remotely. But that's what I've been doing that may be why I did not need to relocate to Silicon Valley, being able to work remote.
Yeah. It does seem that communication is just a huge part of productivity and having this idea of asynchronous communication where things are written down is incredibly important. Particularly when you. Are in a remote team in different time zones.
Alright. Before we wrap up I'd like to talk about conferences.
So we've had a, we've got a bit of a clash since both DataCamp and NVIDIA have data science conferences going on with a partial overlapping date. So the data camp conference is called Radar that's on the 22nd and 23rd of March. And NVIDIA has a rival GTC conference with a few dates overlap.
So, can you tell us a little bit about what's going on at gtc?
Yeah, so GTC runs once in spring, once in fall. That's the NVIDIA conference. So we have keynotes by our CEO. He usually announces new product services. Then you have tons of sessions by industry, by use case, more or less technical ranging from research to very applied.
And we do have a couple of sessions from us, from K. So if you're using GPUs, attending gtc, so to avoid clash with concurrent are always available in replay. well, first gtc, the main conference is free and you can register, watch when you're ready. So, for instance, being in France, I can't watch everything.
Nice. I just use replace for what I'm interested in, but it's really how to get the latest news. And it's not just on data science as people know, NVIDIA is great for gaming and other use of gpu. So whatever your interests, trusts if you use dpu, that's the conference to attend.
That actually seems like, a good sort of practical diplomatic approach if you stuck between trying to decide which conference to go to since they're both virtual. You can register for both and then watch whichever sessions appeal to you on the recordings later on. So, just to finish what are you most excited about in data science and machine learning?
It's evolving so fast, you have to learn all the time. So I like it. I don't know what will be hot next year. I don't know what is doable. so there is a frenzy about et cetera. So I'm listening to that. Not sure why So what excites me, it's a progress. I spoke lot about Rapids, but this year I used it more than before. And the speed up are incredible. Incredible. So that's one thing. The other is the ability, when I started there was a clear divide between statisticians learning deeplearning, And now this is the barriers are being removed maybe because everybody moves to using Python.
So it's great when we see unexpected use of one technique to a place where it was not used. France answer, colleague of mine, won an image classification competition without training any the planning model and running SVM regression, machine regression. So he runs machine regression on the predictions of the planning models without training any the planning model.
That's surprising. so. I know I will have surprise all the time. That's what I love.
That's a great answer. And I do think that having people pushed into different situations like people moving to Python from a different language or coming from statistic to machine learning they do show up. Lots of interesting opportunities and innovation. Alright, super.
Thank you for your time. Lots of really great insights and yeah, thank you for coming on the show. Jean Fran.
Thank you for inviting me. I enjoyed it.