Skip to content
0

Demonstrate your understanding of AI Engineering!

Paste the link to your newly earned AI Engineer for Data Scientists Certification here

https://www.datacamp.com/certificate/AEDS0010345614522

📝 Step 2: Explain the importance of splitting a dataset

One of your colleagues is new to AI engineering and wants to understand why splitting a dataset is important for machine learning.

In less than 300 words, explain:

  • Why it’s important to split data into training, validation, and testing sets.
  • When in the machine learning workflow you should perform the split.
  • What could happen if you don't split your data properly, with an example from your own experience (or a hypothetical one).

Keep to the word count! Submissions over 300 words will not be reviewed.

  1. Splitting data into training, validation and testing sets ensures having best performance of a model like aiming as a student for the "Best grade in a real exam".
  • Training like learning rules from books to prepare for an exam
  • Validation like solving sample exams
  • Testing like facing new real exam
  1. Split the data into subsets (training,validation and testing) must be done before training the model.

  2. Assume you have a dataset of 1000 emails. A common split to achieve a good performance:

  • 70% training
  • 20% validation
  • 10% testing

Example of a Bad Data Split Scenario :

  • 95% training
  • 5% validation and testing

One of my colleagues once build a spam detection model to classify emails as spam or legitimate using the above split criteria. He was glad that the accuracy on training data is 98%. However, when testing the model on new unseen emails the model fails to classify them correctly (60% accuracy). In machine learning this situation is called Overfitting where one of its reasons a bad split criteria. The model in this case memorizes the rules instead of understanding the concept to classify emails leading to a disaster if deployed in real-world scenario.

Proper data splitting is crucial for a model to be ready for a real-world application.

✅ Checklist before publishing into the competition

  • Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
  • Remove redundant cells like the judging criteria, so the workbook is focused on your story.
  • Make sure the workbook reads well and addresses the task you were given.