Skip to content
New Workbook
Sign up
Preprocessing Machine Learning Data with Python for Beginners - Solution

Preprocessing Machine Learning Data with Python for Beginners

Welcome to your webinar workspace! You can follow along as we prepare some machine learning data and train a classifier!

We will begin by importing some packages we will use during the codealong.

# Data manipulation imports
import numpy as np
import pandas as pd

# Visualization imports
import matplotlib.pyplot as plt
import plotly.express as px

# Modeling imports
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay

Load and explore the data

We have some data available as a CSV file. Let's use read_csv() to read it in as a DataFrame.

# Read the data as a DataFrame
df = pd.read_csv("bank.csv")

# Preview the DataFrame
df

We can use the .info() method to get a summary of our DataFrame.

# Return a summary of the DataFrame
df.info()

Let's take a closer look at missing values using .isna() and .sum().

# Sum all missing values across rows
df.isna().sum().sort_values(ascending=False)

We will also want to know the balance of our target variable: deposit.

We can calculate this with the .value_counts() method.

# Get the value counts of the deposit column
df["deposit"].value_counts()

Categorical variables

Let's loop through our categorical variables and get an understanding of the different values they can take.

This can also help us understand if there are any variables that need to be reduced in preparation for one-hot-encoding.

# Define our categorical variables
categorical_variables = df.select_dtypes(include=["object"]).columns.tolist()
categorical_variables.remove("deposit")

# Print the value counts of the categorical columns
for var in categorical_variables:
    value_counts = df[var].value_counts(ascending=True)
    fig = px.bar(value_counts,
                 x=var,
                 y=value_counts.index,
                 title=var
                )
    
    fig.show()

Numeric variables

We should also inspect our numeric variables to understand both their distribution and their scale.

# Define our numeric variables
numeric_variables = df.select_dtypes(include=["float64"]).columns.tolist()

# Create a histogram for each variable
for var in numeric_variables:
    fig = px.histogram(df,
                       x=var,
                       title=var
                      )
    fig.show()

Our numeric variables on largely different scales. Many models rely on distance. This means that features on larger scales can bias the outcome of the model. We will use a standard scaler to address this issue.

Now that we know what we want to do with our data, let's split our data into train and test sets.