Overview of Encoding Methodologies

In this tutorial, you will get a glimpse of encoding techniques along with some advanced references that will help you tackle categorical variables.

Dec 21, 2018 · 5 min read

"In many practical data science activities, the data set will contain categorical variables. These variables are typically stored as text values". (Practical Business Python) Since machine learning is based on mathematical equations, it would cause a problem when we keep categorical variables as is. Many algorithms support categorical values without further manipulation, but in those cases, it’s still a topic of discussion on whether to encode the variables or not. The algorithms that do not support categorical values, in that case, are left with encoding methodologies. Let’s discuss some methods here.

Encoding Methodologies

Label encoding

In label encoding, we map each category to a number or a label. The labels chosen for the categories have no relationship. So categories that have some ties or are close to each other lose such information after encoding.

Frequency encoding

It is a way to utilize the frequency of the categories as labels. In the cases where the frequency is related somewhat with the target variable, it helps the model to understand and assign the weight in direct and inverse proportion, depending on the nature of the data.

One-hot encoding

In this method, we map each category to a vector that contains 1 and 0 denoting the presence of the feature or not. The number of vectors depends on the categories which we want to keep. For high cardinality features, this method produces a lot of columns that slows down the learning significantly. There is a buzz between one hot encoding and dummy encoding and when to use one. They are much alike except one hot encoding produces the number of columns equal to the number of categories and dummy producing is one less. This should ultimately be handled by the modeler accordingly in the validation process.

Dummy variable trap:

Suppose you have two categories in a categorical feature. After performing one hot encoding, there will be two columns representing the categorical feature, but when you look closely only one vector is good enough in representing both values. Like a categorical feature, they have two values let's say “red” and “blue”. After performing one hot encoding two vectors will be formed. One denoting “red” and the other “blue”, but the information can be learned by using only one vector, and the other will be constant. So we can drop one vector that will lead to less data on which algorithms need to be trained. It helps the model to perform better as well as increased ease in the ability to explain the working of the model. In real life scenarios, we have to cater to a lot of categorical features sometimes with prior knowledge of them with results. Encoding practices do allow you to tackle the input process but also makes the process and learning a massive duty task.

Target or Impact or Likelihood encoding

Target Encoding is similar to label encoding, except here labels are correlated directly with the target. For example, in mean target encoding for each category in the feature label is decided with the mean value of the target variable on a training data. This encoding method brings out the relation between similar categories, but the relations are bounded within the categories and target itself. The advantages of the mean target encoding are that it does not affect the volume of the data and helps in faster learning and the disadvantage is its harder to validate. Regularization is required in the implementation process of this encoding methodology. Visit target encoder in python and R.

Encoding Statistics

Encoding variability describes the variation of encoding of individually inside a category. When we talk about the variability in one hot encoding, the variability depends on the time of implementation in which it decides the number of categories to take that do have sufficient impact on the target. Other encoding methodologies do show a significant variability which is identified at the time of validation.

To calculate Encoding distinguishability, we count the labels of the categories and then compute the standard deviation of the labels. That shows the distinguishability of the encoded categories that they are either far away or clustered in an area.

Keep in mind if you go for a simple encoding style in the process of producing a model, the model tends to be biased for better understanding, please visit this discussion, and for different models, you have to estimate your encoding methodology like for regression you can do mean target encoding. If you have the classification, you can go for relative frequency encoding.

Lastly, I'd like to share a scikit-learn-style transformer that helps in encoding categorical variables using different techniques.