Preprocessing Machine Learning Data with Python for Beginners
Welcome to your webinar workspace! You can follow along as we prepare some machine learning data and train a classifier!
We will begin by importing some packages we will use during the codealong.
# Data manipulation imports
import numpy as np
import pandas as pd
# Visualization imports
import matplotlib.pyplot as plt
import plotly.express as px
# Modeling imports
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay
Load and explore the data
We have some data available as a CSV file. Let's use read_csv()
to read it in as a DataFrame.
# Read the data as a DataFrame
df=pd.read_csv("bank.csv")
# Preview the DataFrame
df.head()
We can use the .info()
method to get a summary of our DataFrame.
# Return a summary of the DataFrame
df.info()
# Sum all missing values across rows
df.isna().sum()
We will also want to know the distribution of our target variable: deposit
.
We can calculate this with the .value_counts()
method.
# Get the value counts of the deposit column
df.deposit.value_counts()
Categorical variables
Let's loop through our categorical variables and get an understanding of the different values they can take.
This can also help us understand if there are any variables that need to be reduced in preparation for one-hot-encoding.
# Define our categorical variables
# Print the value counts of the categorical columns
Numeric variables
We should also inspect our numeric variables to understand both their distribution and their scale.
# Define our numeric variables
# Create a histogram for each variable
Our numeric variables on largely different scales. Many models rely on distance. This means that features on larger scales can bias the outcome of the model. We will use a standard scaler to address this issue.
Now that we know what we want to do with our data, let's split our data into train and test sets.