Machine learning is one of the most useful skills in data science. With machine learning, data practitioners are able to make predictions about key datasets, automate workflows, and extract insights. What does the machine learning workflow look like? This infographic presents a simplified view of the machine learning workflow.
Download this infographic by pressing on the image
Project Setup
Understand business goals
Speak with your stakeholders and deeply understand the business goal behind the model being proposed. A deep understanding of your business goals will help you scope the necessary technical solution, data sources to be collected, how to evaluate model performance, and more.
Choose the solution to your problem
Once you have a deep understanding of your problem—focus on which category of models drives the highest impact. See this Machine Learning Cheat Sheet for more information.
Data preparation
Data collection
Collect all the data you need for your models, whether from your own organization, public or paid sources.
Data cleaning
Turn the messy raw data into clean, tidy data ready for analysis. Check out this data cleaning checklist for a primer on data cleaning.
Feature engineering
Manipulate the datasets to create variables (features) that improve your model’s prediction accuracy. Create the same features in both the training set and the testing set.
Split the data
Randomly divide the records in the dataset into a training set and a testing set. For a more reliable assessment of model performance, generate multiple training and testing sets using cross validation
Modeling
Hyperparameter tuning
For each model, use hyperparameter tuning techniques to improve model performance.
Train your models
Fit each model to the training set.
Make predictions
Make predictions on the testing set.
Assess model performance
For each model, calculate performance metrics on the testing set such as accuracy, recall and precision.
Deployment
Deploy the model
Embed the model you chose in dashboards, applications, or wherever you need it.
Monitor model performance
Regularly test the performance of your model as your data changes to avoid model drift
Improve your model
Continously iterate and improve your model post deployment. Replace your model with an updated version to improve performance.