Skip to content
New Workbook
Sign up
Preprocessing Machine Learning Data with Python for Beginners - Solution

Preprocessing Machine Learning Data with Python for Beginners

Welcome to your webinar workspace! You can follow along as we prepare some machine learning data and train a classifier!

We will begin by importing some packages we will use during the codealong.

# Data manipulation imports
import numpy as np
import pandas as pd

# Visualization imports
import matplotlib.pyplot as plt
import plotly.express as px

# Modeling imports
from sklearn.model_selection import train_test_split 
from sklearn.impute import SimpleImputer #imputation of missing data
from sklearn.preprocessing import OneHotEncoder, StandardScaler #categorical , standardize data
from sklearn.compose import ColumnTransformer #setup transformer 
from sklearn.pipeline import Pipeline 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay

Load and explore the data

We have some data available as a CSV file. Let's use read_csv() to read it in as a DataFrame.

[18]
# Read the data as a DataFrame
df = pd.read_csv("bank.csv")

# Preview the DataFrame
df

#target variable: deposit

We can use the .info() method to get a summary of our DataFrame.

# Return a summary of the DataFrame
df.info()

Let's take a closer look at missing values using .isna() and .sum().

[20]
# Sum all missing values across rows
df.isna()
df.isna().sum().sort_values(ascending=False)

We will also want to know the balance of our target variable: deposit.

We can calculate this with the .value_counts() method.

[21]
# Get the value counts of the deposit column
df["deposit"].value_counts()
# Target is balanced

Categorical variables

Let's loop through our categorical variables and get an understanding of the different values they can take.

This can also help us understand if there are any variables that need to be reduced in preparation for one-hot-encoding.

# Define our categorical variables
categorical_variables = df.select_dtypes(include=["object"]).columns.tolist()
categorical_variables.remove("deposit") #remove target variable
categorical_variables

# Print the value counts of the categorical columns
for var in categorical_variables:
    value_counts = df[var].value_counts(ascending=True)
    fig = px.bar(value_counts,
                 x=var,
                 y=value_counts.index,
                 title=var
                )
    
    fig.show()

Numeric variables

We should also inspect our numeric variables to understand both their distribution and their scale.

# Define our numeric variables
numeric_variables = df.select_dtypes(include=["float64"]).columns.tolist()

# Create a histogram for each variable
for var in numeric_variables:
    fig = px.histogram(df,
                       x=var,
                       title=var
                      )
    fig.show()

Our numeric variables on largely different scales. Many models rely on distance. This means that features on larger scales can bias the outcome of the model. We will use a standard scaler to address this issue.

Now that we know what we want to do with our data, let's split our data into train and test sets.