Skip to content
Preprocessing Machine Learning Data with Python for Beginners - Solution
  • AI Chat
  • Code
  • Report
  • Preprocessing Machine Learning Data with Python for Beginners

    Welcome to your webinar workspace! You can follow along as we prepare some machine learning data and train a classifier!

    We will begin by importing some packages we will use during the codealong.

    # Data manipulation imports
    import numpy as np
    import pandas as pd
    
    # Visualization imports
    import matplotlib.pyplot as plt
    import plotly.express as px
    
    # Modeling imports
    from sklearn.model_selection import train_test_split
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import OneHotEncoder, StandardScaler
    from sklearn.compose import ColumnTransformer
    from sklearn.pipeline import Pipeline
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay

    Load and explore the data

    We have some data available as a CSV file. Let's use read_csv() to read it in as a DataFrame.

    # Read the data as a DataFrame
    df = pd.read_csv("bank.csv")
    
    # Preview the DataFrame
    df

    We can use the .info() method to get a summary of our DataFrame.

    # Return a summary of the DataFrame
    df.info()

    Let's take a closer look at missing values using .isna() and .sum().

    # Sum all missing values across rows
    df.isna().sum().sort_values(ascending=False)

    We will also want to know the balance of our target variable: deposit.

    We can calculate this with the .value_counts() method.

    # Get the value counts of the deposit column
    df["deposit"].value_counts()

    Categorical variables

    Let's loop through our categorical variables and get an understanding of the different values they can take.

    This can also help us understand if there are any variables that need to be reduced in preparation for one-hot-encoding.

    # Define our categorical variables
    categorical_variables = df.select_dtypes(include=["object"]).columns.tolist()
    categorical_variables.remove("deposit")
    
    # Print the value counts of the categorical columns
    for var in categorical_variables:
        value_counts = df[var].value_counts(ascending=True)
        fig = px.bar(value_counts,
                     x=var,
                     y=value_counts.index,
                     title=var
                    )
        
        fig.show()

    Numeric variables

    We should also inspect our numeric variables to understand both their distribution and their scale.

    # Define our numeric variables
    numeric_variables = df.select_dtypes(include=["float64"]).columns.tolist()
    
    # Create a histogram for each variable
    for var in numeric_variables:
        fig = px.histogram(df,
                           x=var,
                           title=var
                          )
        fig.show()

    Our numeric variables on largely different scales. Many models rely on distance. This means that features on larger scales can bias the outcome of the model. We will use a standard scaler to address this issue.

    Now that we know what we want to do with our data, let's split our data into train and test sets.