Skip to main content
HomeTutorialsMachine Learning

Overview of Encoding Methodologies

In this tutorial, you will get a glimpse of encoding techniques along with some advanced references that will help you tackle categorical variables.
Dec 2018  · 5 min read
pencils graphic

"In many practical data science activities, the data set will contain categorical variables. These variables are typically stored as text values". (Practical Business Python) Since machine learning is based on mathematical equations, it would cause a problem when we keep categorical variables as is. Many algorithms support categorical values without further manipulation, but in those cases, it’s still a topic of discussion on whether to encode the variables or not. The algorithms that do not support categorical values, in that case, are left with encoding methodologies. Let’s discuss some methods here.

Encoding Methodologies

Label encoding

In label encoding, we map each category to a number or a label. The labels chosen for the categories have no relationship. So categories that have some ties or are close to each other lose such information after encoding.

Frequency encoding

It is a way to utilize the frequency of the categories as labels. In the cases where the frequency is related somewhat with the target variable, it helps the model to understand and assign the weight in direct and inverse proportion, depending on the nature of the data.

One-hot encoding

In this method, we map each category to a vector that contains 1 and 0 denoting the presence of the feature or not. The number of vectors depends on the categories which we want to keep. For high cardinality features, this method produces a lot of columns that slows down the learning significantly. There is a buzz between one hot encoding and dummy encoding and when to use one. They are much alike except one hot encoding produces the number of columns equal to the number of categories and dummy producing is one less. This should ultimately be handled by the modeler accordingly in the validation process.

Dummy variable trap:

Suppose you have two categories in a categorical feature. After performing one hot encoding, there will be two columns representing the categorical feature, but when you look closely only one vector is good enough in representing both values. Like a categorical feature, they have two values let's say “red” and “blue”. After performing one hot encoding two vectors will be formed. One denoting “red” and the other “blue”, but the information can be learned by using only one vector, and the other will be constant. So we can drop one vector that will lead to less data on which algorithms need to be trained. It helps the model to perform better as well as increased ease in the ability to explain the working of the model. In real life scenarios, we have to cater to a lot of categorical features sometimes with prior knowledge of them with results. Encoding practices do allow you to tackle the input process but also makes the process and learning a massive duty task.

Target or Impact or Likelihood encoding

Target Encoding is similar to label encoding, except here labels are correlated directly with the target. For example, in mean target encoding for each category in the feature label is decided with the mean value of the target variable on a training data. This encoding method brings out the relation between similar categories, but the relations are bounded within the categories and target itself. The advantages of the mean target encoding are that it does not affect the volume of the data and helps in faster learning and the disadvantage is its harder to validate. Regularization is required in the implementation process of this encoding methodology. Visit target encoder in python and R.

Encoding Statistics

Encoding variability describes the variation of encoding of individually inside a category. When we talk about the variability in one hot encoding, the variability depends on the time of implementation in which it decides the number of categories to take that do have sufficient impact on the target. Other encoding methodologies do show a significant variability which is identified at the time of validation.

To calculate Encoding distinguishability, we count the labels of the categories and then compute the standard deviation of the labels. That shows the distinguishability of the encoded categories that they are either far away or clustered in an area.

Keep in mind if you go for a simple encoding style in the process of producing a model, the model tends to be biased for better understanding, please visit this discussion, and for different models, you have to estimate your encoding methodology like for regression you can do mean target encoding. If you have the classification, you can go for relative frequency encoding.

Lastly, I'd like to share a scikit-learn-style transformer that helps in encoding categorical variables using different techniques.

category_encoders

This scikit-learn library provides the following techniques that are currently implemented:

  • Ordinal
  • One-Hot
  • Binary
  • Helmert Contrast
  • Sum Contrast
  • Polynomial Contrast
  • Backward Difference Contrast
  • Hashing
  • BaseN
  • LeaveOneOut
  • Target Encoding

It supports the pandas dataframe as input and can transform data. It can also support sklearn pipelines, for details visit here.

If you would like to learn more on scikit-learn, take DataCamp's Supervised Learning with scikit-learn course.

Topics

Learn more about Machine Learning

Course

Understanding Machine Learning

2 hr
183.4K
An introduction to machine learning with no coding involved.
See DetailsRight Arrow
Start Course
Certification available

Course

Machine Learning with Tree-Based Models in Python

5 hr
84.1K
In this course, you'll learn how to use tree-based models and ensembles for regression and classification using scikit-learn.
See MoreRight Arrow
Related

Navigating the World of MLOps Certifications

Explore the dynamic world of MLOps certifications: key career benefits, certification vs. certificate insights, and how to choose the right path for you.
Adel Nehme's photo

Adel Nehme

10 min

How to Learn Machine Learning in 2024

Discover how to learn machine learning in 2024, including the key skills and technologies you’ll need to master, as well as resources to help you get started.
Adel Nehme's photo

Adel Nehme

15 min

Why ML Projects Fail, and How to Ensure Success with Eric Siegel, Founder of Machine Learning Week, Former Columbia Professor, and Bestselling Author

Adel and Eric explore the reasons why machine learning projects don't make it into production, the BizML Framework or how to bring business stakeholders into the room when building machine learning use cases, what the previous machine learning hype cycle can teach us about generative AI and a lot more.
Adel Nehme's photo

Adel Nehme

47 min

Multilayer Perceptrons in Machine Learning: A Comprehensive Guide

Dive into Multilayer Perceptrons. Unravel the secrets of MLPs in machine learning for advanced pattern recognition, classification, and prediction.
Sejal Jaiswal's photo

Sejal Jaiswal

15 min

A Beginner's Guide to CI/CD for Machine Learning

Learn how to automate model training, evaluation, versioning, and deployment using GitHub Actions with the easiest MLOps guide available online.
Abid Ali Awan's photo

Abid Ali Awan

15 min

OpenCV Tutorial: Unlock the Power of Visual Data Processing

This article provides a comprehensive guide on utilizing the OpenCV library for image and video processing within a Python environment. We dive into the wide range of image processing functionalities OpenCV offers, from basic techniques to more advanced applications.
Richmond Alake's photo

Richmond Alake

13 min

See MoreSee More