Demonstrate your understanding of AI Engineering!
📖 Background
Calling AI Engineers!
DataCamp's AI Engineer for Data Scientists certification is a great way to showcase your data skills. Demonstrate to employers that you can implement modelling approaches, create prototype GenAI systems, and more!
Take a look at our get started page to understand how this certification can help you with your goals.
🎓 Step 1: Get Certified!
Not yet certified? Get started now with the AI Engineer for Data Scientists Associate certification.
You need to complete this certification before the competition closes.
📝 Step 2: Explain the importance of splitting a dataset
⭐️ A beginner-friendly explanation of training, validation, and test sets in ML.
Why We Split Data in Machine Learning ?
Imagine you're teaching a child how to add and subtract. You show them examples like 2 + 3 = 5 and 7 - 4 = 3, and they practice these over and over.
Now, to check if they really understand, would you test them with the same questions they practiced? Probably not—because they might have just memorized the answers.
Instead, you'd give them new problems, like 5 + 6 or 9 - 2, to see if they’ve truly learned how to add and subtract—not just remembered examples.
That’s exactly why, in machine learning, we split our data into three parts:
- Training set: Like practice questions, it teaches the model.
- Validation set: Like warm-up quizzes, it helps adjust and improve the model.
- Test set: Like the final exam, it contains completely new questions the model has never seen.
We perform this split before training, and after preparing the data. It is an important step to make sure the model can handle new, real-world data.
- If we skip this step or do it poorly, the model might just "memorize" the training data.
Splitting the data ensures we build models that learn, not just remember—just like a child who truly understands math, not just specific problems.
For example, imagine building a model to predict house prices based on location and size. If we train and test it on the same data, it might get 100% accuracy—because it already “knows” the answers. But when used on new houses, it performs poorly because it never really learned the patterns—just memorized the data.
That’s why proper data splitting is not just a technical step—it’s essential for real learning and trustworthy results.