Skip to content
Encoding Categorical Variables
  • AI Chat
  • Code
  • Report
  • Encoding Categorical Variables

    An important preprocessing step in machine learning is converting categorical variables into a numerical format through encoding. This template will cover how to handle binary and ordered categorical variables with label encoding, as well as one-hot encoding for unordered categorical data.

    To swap in your dataset in this template, the following is required:

    • There must be at least one column with a categorical variable that you want to encode.
    • There must be no NaN/NA values. You can use this template to impute missing values if needed.

    The placeholder dataset in this template is bank marketing data with details such as job, education, and marital status. Each row represents a different customer. You can find more information on this dataset's source and dictionary here.

    # Import packages
    import pandas as pd
    from sklearn.preprocessing import OneHotEncoder, LabelEncoder
    
    # Load the dataset into a DataFrame
    df = pd.read_csv("bank.csv")  # Replace with the file you want to use
    
    # Preview the DataFrame
    df

    Label Encoding

    Label encoding is a process where categorical values are replaced by numeric data (i.e., 0, 1, 2, ...). It is appropriate for both binary data and ordinal data (i.e., categorical data that has an inherent order). To label encode categorical data, you can use the LabelEncoder() class from sklearn.

    Note: You can also use OrdinalEncoder() to perform a similar operation on multiple features.

    # Create a copy of the original DataFrame
    df_encoded = df.copy()
    
    # Specify the column you wish to one-hot encode
    label_column = "education" 
    
    # Initialize the LabelEncoder
    le = LabelEncoder()
    
    # Create a new column using the fit_transform method of the LabelEncoder
    df_encoded[label_column + "_enc"] = le.fit_transform(df_encoded[label_column])
    
    # Preview the original and encoded column
    df_encoded[[label_column, label_column + "_enc"]]

    One-Hot Encoding Using pandas

    One-hot encoding converts each value in a categorical column into a new column containing 0s and 1s. The simplest way to one-hot encode columns in a DataFrame is to use pandas' get_dummies() function, which allows you to specify a subset of the data.

    You simply need to specify the DataFrame that you wish to use. In this example, there are two key arguments:

    • columns allows you to choose which columns you wish to be encoded. All columns with an object or category data type will be encoded if this is not specified. You may sometimes want to avoid this if some categorical columns contain many different values.
    • drop_first allows you to return k-1 dummy variables if there are k categories (thus reducing the number of features you create).
    # Specify the columns you wish to one-hot encode
    categorical_columns = [
        "job",
        "marital"
    ]  
    
    # Perform the one-hot encoding
    df_encoded = pd.get_dummies(df, columns=categorical_columns, drop_first=True)
    
    # View the resulting DataFrame
    df_encoded

    One-Hot Encoding Using sklearn

    You can also use sklearn's OneHotEncoder to one-hot encode categorical columns. While the process is not as simple as it is with pandas, there are key advantages for machine learning. Most importantly, OneHotEncoder() can ensure consistency when working with new data. In this example, the encoder is initialized and fit to a subset of the data. The data is then transformed, the column names are retrieved, and it is joined with the original data.

    While initializing the encoder, the following two arguments are used:

    • handle_unknown tells the encoder how to treat unknown categorical features during the transform. If set to "error" the encoder will produce an error if it encounters unknown categorical features. If it is set to "ignore", the columns for the problematic feature will contain zeros.
    • sparse specifies whether a sparse matrix or an array is returned. The code below only works with an array, so sparse is set to False.
    # Specify the columns you wish to one-hot encode
    categorical_columns = ["job", "marital"]
    
    # Filter the DataFrame for the categorical features
    cat_features = df[categorical_columns]
    
    # Initialize the OneHotEncoder and fit it to the categorical features
    enc = OneHotEncoder(handle_unknown="ignore", sparse=False)
    enc.fit(cat_features)
    
    # Use the transform method to one hot encode the categorical data and then convert it to a DataFrame
    enc_data = pd.DataFrame(
        enc.transform(cat_features),
        columns=enc.get_feature_names_out(categorical_columns)
    )
    
    # Join with the rest of the data and preview the DataFrame
    df_encoded = df.join(enc_data)
    df_encoded

    Once you have encoded all the categorical variables you want to use, you can remove the original columns and feed the data into a model. If you would like to learn more about preprocessing techniques, be sure to check out the DataCamp course Preprocessing Data for Machine Learning in Python.