Demonstrate your understanding of AI Engineering!
Paste the link to your newly earned AI Engineer for Data Scientists Certification here
📝 Step 2: Explain the importance of splitting a dataset
One of your colleagues is new to AI engineering and wants to understand why splitting a dataset is important for machine learning.
In less than 300 words, explain:
- Why it’s important to split data into training, validation, and testing sets.
- When in the machine learning workflow you should perform the split.
- What could happen if you don't split your data properly, with an example from your own experience (or a hypothetical one).
Keep to the word count! Submissions over 300 words will not be reviewed.
- Splitting data into training, validation and testing sets ensures having best performance of a model like aiming as a student for the "Best grade in a real exam".
- Training like learning rules from books to prepare for an exam
- Validation like solving sample exams
- Testing like facing new real exam
-
Split the data into subsets (training,validation and testing) must be done before training the model.
-
Assume you have a dataset of 1000 emails. A common split to achieve a good performance:
- 70% training
- 20% validation
- 10% testing
Example of a Bad Data Split Scenario :
- 95% training
- 5% validation and testing
One of my colleagues once build a spam detection model to classify emails as spam or legitimate using the above split criteria. He was glad that the accuracy on training data is 98%. However, when testing the model on new unseen emails the model fails to classify them correctly (60% accuracy). In machine learning this situation is called Overfitting where one of its reasons a bad split criteria. The model in this case memorizes the rules instead of understanding the concept to classify emails leading to a disaster if deployed in real-world scenario.
Proper data splitting is crucial for a model to be ready for a real-world application.
✅ Checklist before publishing into the competition
- Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
- Remove redundant cells like the judging criteria, so the workbook is focused on your story.
- Make sure the workbook reads well and addresses the task you were given.