Skip to main content
HomeCheat sheetsMachine Learning

Supervised Machine Learning Cheat Sheet

In this cheat sheet, you'll have a guide around the top supervised machine learning algorithms, their advantages and disadvantages, and use-cases.
Dec 2022  · 5 min read

When working with machine learning models, it's easy to try them all out without understanding what each model does and when to use them. In this cheat sheet, you'll find a handy guide describing the most widely used supervised machine learning models, their advantages, disadvantages, and some key use cases.

---Machine Learning - Supervised---.png

Have this cheat sheet at your fingertips

Download PDF

Supervised Learning

Supervised learning models are models that map inputs to outputs, and attempt to extrapolate patterns learned in past data on unseen data. Supervised learning models can be either regression models, where we try to predict a continuous variable, like stock prices—or classification models, where we try to predict a binary or multi-class variable, like whether a customer will churn or not. In the section below, we'll explain three popular types of supervised learning models: regression-only models, regression and classification models, and classification-only models. 

Regression Only Models

Algorithm Description and Application Advantages Disadvantages
Linear Regression

Linear Regression models a linear relationship between input variables and a continuous numerical output variable. The default loss function is the mean square error (MSE).

  1. Fast training because there are few parameters.
  2. Interpretable/Explainable results by its output coefficients.
  1. Assumes a linear relationship between input and output variables.
  2. Sensitive to outliers.
  3. Typically generalizes worse than ridge or lasso regression.
Polynomial Regression Polynomial Regression models nonlinear relationships between the dependent, and independent variable as the n-th degree polynomial.
  1. Provides a good approximation of the relationship between the dependent and independent variables.
  2. Capable of fitting a wide range of curvature.
  1. Poor interpretability of the coefficients since the underlying variables can be   highly correlated.
  2. The model fit is nonlinear but the regression function is linear.
  3. Prone to overfitting.
Support Vector 
Regression Support Vector Regression (SVR) uses the same principle as SVMs but optimizes the cost function to fit the most straight line (or plane) through the data points. With the kernel trick it can efficiently perform a non-linear regression by implicitly mapping their inputs into high-dimensional feature spaces.
  1. Robust against outliers. 

  2. Effective learning and strong generalization performance.
  3. Different Kernel functions can be specified for the decision function.
  1. Does not perform well with large datasets.
  2. Tends to underfit in cases where the number of variables is much smaller than the number of observations.
Gaussian Process 
Regression Gaussian Process Regression (GPR) uses a Bayesian approach that infers a probability distribution over the possible functions that fit the data. The Gaussian process is a prior that is specified as a multivariate Gaussian distribution.
  1. Provides uncertainty measures on the predictions.
  2. It is a flexible and usable non-linear model which fits many datasets well.
  3. Performs well on small datasets as the GP kernel allows to specify a prior on the function space.
  1. Poor choice of kernel can make convergence slow.
  2. Specifying specific kernels requires deep mathematical understanding.
Robust Regression Robust Regression is an alternative to least squares regression when data is contaminated with outliers. The term “robust” refers to the statistical capability to provide useful information even in the face of outliers.
  1. Designed to overcome some limitations of traditional parametric and non-parametric methods.
  2. Provides better regression coefficient over classical regression methods when outliers are present.
  1. More computationally intensive compared to classical regression methods.
  2. It is not a cure-all for all violations, such as imbalanced data, poor quality data.
  3. If no outliers are present in the data, it may not provide better results than 
classical regression methods.
    Tree-based models

Both Regression and Classification Models

Algorithm Description and Application Advantages Disadvantages
Decision Trees Decision Tree models learn on the data by making decision rules on the variables to separate the classes in a flowchart like a tree data structure. They can be used for both regression and classification.
  1. Explainable and interpretable.
  2. Can handle missing values.
  1. Prone to overfitting.
  2. Can be unstable with minor data drift.
  3. Sensitive to outliers.
Random Forest Random Forest classification models learn using an ensemble of decision trees. The output of the random forest is based on a majority vote of the different decision trees.
  1. Effective learning and better generalization performance.
  2. Can handle moderately large datasets.
  3. Less prone to overfit than decision trees.
  1. Large number of trees can slow down performance.
  2. Predictions are sensitive to outliers.
  3. Hyperparameter tuning can be complex.
Gradient Boosting An ensemble learning method where weak predictive learners are combined to improve accuracy. Popular techniques include XGBoost, LightGBM and more.
  1. Handling of multicollinearity.
  2. Handling of non-linear relationships.
  3. Effective learning and strong generalization performance.
  4. XGBoost is fast and is often used as a benchmark algorithm.
  1. Sensitive to outliers and can therefore cause overfitting.
  2. High complexity due to hyperparameter tuning.
  3. Computationally expensive.
Ridge Regression Ridge Regression penalizes variables with low predictive outcomes by shrinking their coefficients towards zero. It can be used for classification and regression.
  1. Less prone to overfitting.
  2. Best suited when data suffers from multicollinearity.
  3. Explainable & Interpretable.
  1. All the predictors are kept in the final model.
  2. Doesn't perform feature selection.
Lasso Regression Lasso Regression penalizes features that have low predictive outcomes 
by shrinking their coefficients to zero. It can be used for classification 
and regression.
  1. Good generalization performance.
  2. Good at handling datasets where the number of variables is much larger than the number of observations.
  3. No need for feature selection.
  1. Poor interpretability/explainability as it can keep a single variable. 
from a set of highly correlated variables.
AdaBoost Adaptive Boosting uses an ensemble of weak learners that is combined into a weighted sum that represents the final output of the boosted classifier.
  1. Explainable & Interpretable.
  2. Less need for tweaking parameters.
  3. Usually outperforms Random Forest.
  1. Less prone to overfitting as the input variables are not jointly optimized.
  2. Sensitive to noisy data and outliers.

Classification Only Models

Algorithm Description and Application Advantages Disadvantages
SVM In its simplest form, support vector machine is a linear classifier. But with the 
kernel trick, it can efficiently perform a non-linear classification by implicitly 
mapping their inputs into high-dimensional feature spaces. This makes SVM one 
of the best prediction methods.
  1. Effective in cases with a high number of variables.
  2. Number of variables can be larger than the number of samples.
  3. Different Kernel functions can be specified for the decision function.
  1. Sensitive to overfitting, regularization is crucial.
  2. Choosing a “good” kernel function can be difficult.
  3. Computationally expensive for big data due to high training complexity.
  4. Performs poorly if the data is noisy (target classes overlap).
Nearest 
Neighbors Nearest Neighbors predicts the label based on a predefined number of samples closest in distance to the new point.
  1. Successful in situations where the decision boundary is irregular.
  2. Non-parametric approach as it does not make 
any assumption on the underlying data.
  1. Sensitive to noisy and missing data.
  2. Computationally expensive because the entire set of n points for every execution 
is required.
Logistic Regression 
(and its extensions) The logistic regression models a linear relationship between input variables and the response variable. It models the output as binary values (0 or 1) 
rather than numeric values.
  1. Explainable & Interpretable.
  2. Less prone to overfitting using regularization.
  3. Applicable for multi-class predictions.
  1. Makes a strong assumption about the relationship between input and response variables.
  2. Multicollinearity can cause the model to easily overfit without regularization.
Linear Discriminant 
Analysis The linear decision boundary maximizes the separability between the classes by finding a linear combination of features.
  1. Explainable & Interpretable.
  2. Applicable for multi-class predictions.
  1. Multicollinearity can cause the model to overfit.
  2. Assuming that all classes share the same covariance matrix.
  3. Sensitive to outliers.
  4. Doesn't work well with small class sizes.

Have this cheat sheet at your fingertips

Download PDF
Related

What is Natural Language Processing (NLP)? A Comprehensive Guide for Beginners

Explore the transformative world of Natural Language Processing (NLP) with DataCamp’s comprehensive guide for beginners. Dive into the core components, techniques, applications, and challenges of NLP.
Matt Crabtree's photo

Matt Crabtree

11 min

What is Topic Modeling? An Introduction With Examples

Unlock insights from unstructured data with topic modeling. Explore core concepts, techniques like LSA & LDA, practical examples, and more.
Kurtis Pykes 's photo

Kurtis Pykes

13 min

What is Hugging Face? The AI Community's Open-Source Oasis

Explore the transformative world of Hugging Face, the AI community's open-source hub for Machine Learning and Natural Language Processing.
Josep Ferrer's photo

Josep Ferrer

21 min

What is Bagging in Machine Learning? A Guide With Examples

This tutorial provided an overview of the bagging ensemble method in machine learning, including how it works, implementation in Python, comparison to boosting, advantages, and best practices.
Abid Ali Awan's photo

Abid Ali Awan

10 min

Loss Functions in Machine Learning Explained

Explore the crucial role of loss functions in machine learning with our comprehensive guide. Understand the difference between loss and cost functions, delve into various types like MSE and MAE, and learn their applications in ML tasks.
Richmond Alake's photo

Richmond Alake

20 min

What is A Confusion Matrix in Machine Learning? The Model Evaluation Tool Explained

A beginner's tutorial to learning about the Confusion Matrix in machine learning.
Nisha Arya Ahmed's photo

Nisha Arya Ahmed

12 min

See MoreSee More