Demonstrate your understanding of AI Engineering!
📖 Background
Calling AI Engineers!
DataCamp's AI Engineer for Data Scientists certification is a great way to showcase your data skills. Demonstrate to employers that you can implement modelling approaches, create prototype GenAI systems, and more!
Take a look at our get started page to understand how this certification can help you with your goals.
🎓 Step 1: Get Certified!
Not yet certified? Get started now with the AI Engineer for Data Scientists Associate certification.
You need to complete this certification before the competition closes.
Paste the link to your newly earned AI Engineer for Data Scientists Certification here
📝 Step 2: Explain the importance of splitting a dataset
One of your colleagues is new to AI engineering and wants to understand why splitting a dataset is important for machine learning.
In less than 300 words, explain:
- Why it’s important to split data into training, validation, and testing sets.
- When in the machine learning workflow you should perform the split.
- What could happen if you don't split your data properly, with an example from your own experience (or a hypothetical one).
Keep to the word count! Submissions over 300 words will not be reviewed.
Why Splitting a Dataset Is Important in Machine Learning
When building a machine learning model, it’s essential to split your dataset into a training set, validation set, and test set. This helps ensure the model performs well not just on data it has seen, but also on new, unseen data—just like it would in the real world.
- The training set teaches the model. It learns patterns, trends, and relationships from this data.
- The validation set helps fine-tune the model. It’s used to try different settings and prevent overfitting—when a model learns the training data too well and performs poorly on new data.
- The test set is like a final exam. It shows how well the trained and tuned model works on entirely new data.
You should split your data early in the machine learning process—right after exploring it. This prevents information from “leaking” between sets and keeps your results honest.
If you don’t split properly, your model might look great but fail in the real world. Imagine training and testing a spam email detector on the same messages. It could hit 99% accuracy, but in production, it would struggle—because it never truly learned how to spot new spam.
In short, proper data splitting helps you build models you can trust—not just in theory, but in practice.
✅ Checklist before publishing into the competition
- Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
- Remove redundant cells like the judging criteria, so the workbook is focused on your story.
- Make sure the workbook reads well and addresses the task you were given.