Skip to main content

Overview of Encoding Methodologies

In this tutorial, you will get a glimpse of encoding techniques along with some advanced references that will help you tackle categorical variables.
Dec 2018  · 5 min read
pencils graphic

"In many practical data science activities, the data set will contain categorical variables. These variables are typically stored as text values". (Practical Business Python) Since machine learning is based on mathematical equations, it would cause a problem when we keep categorical variables as is. Many algorithms support categorical values without further manipulation, but in those cases, it’s still a topic of discussion on whether to encode the variables or not. The algorithms that do not support categorical values, in that case, are left with encoding methodologies. Let’s discuss some methods here.

Encoding Methodologies

Label encoding

In label encoding, we map each category to a number or a label. The labels chosen for the categories have no relationship. So categories that have some ties or are close to each other lose such information after encoding.

Frequency encoding

It is a way to utilize the frequency of the categories as labels. In the cases where the frequency is related somewhat with the target variable, it helps the model to understand and assign the weight in direct and inverse proportion, depending on the nature of the data.

One-hot encoding

In this method, we map each category to a vector that contains 1 and 0 denoting the presence of the feature or not. The number of vectors depends on the categories which we want to keep. For high cardinality features, this method produces a lot of columns that slows down the learning significantly. There is a buzz between one hot encoding and dummy encoding and when to use one. They are much alike except one hot encoding produces the number of columns equal to the number of categories and dummy producing is one less. This should ultimately be handled by the modeler accordingly in the validation process.

Dummy variable trap:

Suppose you have two categories in a categorical feature. After performing one hot encoding, there will be two columns representing the categorical feature, but when you look closely only one vector is good enough in representing both values. Like a categorical feature, they have two values let's say “red” and “blue”. After performing one hot encoding two vectors will be formed. One denoting “red” and the other “blue”, but the information can be learned by using only one vector, and the other will be constant. So we can drop one vector that will lead to less data on which algorithms need to be trained. It helps the model to perform better as well as increased ease in the ability to explain the working of the model. In real life scenarios, we have to cater to a lot of categorical features sometimes with prior knowledge of them with results. Encoding practices do allow you to tackle the input process but also makes the process and learning a massive duty task.

Target or Impact or Likelihood encoding

Target Encoding is similar to label encoding, except here labels are correlated directly with the target. For example, in mean target encoding for each category in the feature label is decided with the mean value of the target variable on a training data. This encoding method brings out the relation between similar categories, but the relations are bounded within the categories and target itself. The advantages of the mean target encoding are that it does not affect the volume of the data and helps in faster learning and the disadvantage is its harder to validate. Regularization is required in the implementation process of this encoding methodology. Visit target encoder in python and R.

Encoding Statistics

Encoding variability describes the variation of encoding of individually inside a category. When we talk about the variability in one hot encoding, the variability depends on the time of implementation in which it decides the number of categories to take that do have sufficient impact on the target. Other encoding methodologies do show a significant variability which is identified at the time of validation.

To calculate Encoding distinguishability, we count the labels of the categories and then compute the standard deviation of the labels. That shows the distinguishability of the encoded categories that they are either far away or clustered in an area.

Keep in mind if you go for a simple encoding style in the process of producing a model, the model tends to be biased for better understanding, please visit this discussion, and for different models, you have to estimate your encoding methodology like for regression you can do mean target encoding. If you have the classification, you can go for relative frequency encoding.

Lastly, I'd like to share a scikit-learn-style transformer that helps in encoding categorical variables using different techniques.

category_encoders

This scikit-learn library provides the following techniques that are currently implemented:

  • Ordinal
  • One-Hot
  • Binary
  • Helmert Contrast
  • Sum Contrast
  • Polynomial Contrast
  • Backward Difference Contrast
  • Hashing
  • BaseN
  • LeaveOneOut
  • Target Encoding

It supports the pandas dataframe as input and can transform data. It can also support sklearn pipelines, for details visit here.

If you would like to learn more on scikit-learn, take DataCamp's Supervised Learning with scikit-learn course.

Machine Learning with scikit-learn

Beginner
4 hours
315,980
Learn how to build and tune predictive models and evaluate how well they'll perform on unseen data.
See DetailsRight Arrow
Start Course

Machine Learning for Time Series Data in Python

Beginner
4 hours
34,267
This course focuses on feature engineering and machine learning for time series data.

Machine Learning with Tree-Based Models in Python

Beginner
5 hours
65,922
In this course, you'll learn how to use tree-based models and ensembles for regression and classification using scikit-learn.
See all coursesRight Arrow
Related
MachineLearningLifecycle

The Machine Learning Life Cycle Explained

Learn about the steps involved in a standard machine learning project as we explore the ins and outs of the machine learning lifecycle using CRISP-ML(Q).
Abid Ali Awan's photo

Abid Ali Awan

An Introduction to Papers With Code

Discover what Papers With Code is and learn a new way of exploring research papers on cutting-edge machine learning technologies.
Abid Ali Awan's photo

Abid Ali Awan

10 min

Top Machine Learning Use-Cases and Algorithms

Machine learning is arguably responsible for data science and artificial intelligence’s most prominent and visible use cases. In this article, learn about machine learning, some of its prominent use cases and algorithms, and how you can get started.
Vidhi Chugh's photo

Vidhi Chugh

15 min

Streamline Your Machine Learning Workflow with MLFlow

Take a deep dive into what MLflow is and how you can leverage this open-source platform for tracking and deploying your machine learning experiments.
Moez Ali 's photo

Moez Ali

12 min

An Introduction to Q-Learning: A Tutorial For Beginners

Learn about the most popular model-free reinforcement learning algorithm with a Python tutorial.
Abid Ali Awan's photo

Abid Ali Awan

16 min

A Complete Guide to Data Augmentation

Learn about data augmentation techniques, applications, and tools with a TensorFlow and Keras tutorial.
Abid Ali Awan's photo

Abid Ali Awan

15 min

See MoreSee More