Logistic Regression Binary Classification
Logistic regression is a fundamental machine learning method originally from the field of statistics. It's a great choice for generating a baseline for any binary classification problem (meaning there are only two outcomes). This template trains and evaluates a logistic regression model for a binary classification problem. If you would like to learn more about logistic regression, take a look at DataCamp's Linear Classifiers in Python course.
To swap in your dataset in this template, the following is required:
- There's at least one feature column and a column with a binary categorical target variable you would like to predict.
- The features have been cleaned and preprocessed, including categorical encoding.
- There are no NaN/NA values. You can use this template to impute missing values if needed.
The placeholder dataset in this template consists of churn data from a telecom company. Each row represents a customer over a year and whether the customer churned (the target variable; 1
= yes, 0
= no). You can find more information on this dataset's source and dictionary here.
1. Loading packages and data
# Load packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
accuracy_score,
confusion_matrix,
precision_score,
recall_score,
RocCurveDisplay,
)
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
# Load the data and replace with your CSV file path
df = pd.read_csv("data/customer_churn.csv")
df
# Check if there are any null values
print(df.isnull().sum())
# Check columns to make sure you have feature(s) and a target variable
df.info()
2. Splitting and standardizing the data
To split the data, we'll use the train_test_split() function. Then, we'll standardize the input data using StandardScaler()
(note: this should be done after splitting the data to avoid data leakage). To learn more about standardizing data and preprocessing techniques, visit DataCamp's Preprocessing for Machine Learning in Python.
# Split the data into two DataFrames: X (features) and y (target variable)
X = df.iloc[:, 0:8] # Specify at least one column as a feature
y = df["Churn"] # Specify one column as the target variable
# Split the data into train and test subsets
# You can adjust the test size and random state
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.30, random_state=123
)
# Standardize X data based on X_train
sc = StandardScaler().fit(X_train)
X_train_scaled = sc.transform(X_train)
X_test_scaled = sc.transform(X_test)
3. Building a logistic regression classifier
The following code builds a scikit-learn logistic regression classifier (linear_model.LogisticRegression
) using the most fundamental parameters. As a reminder, you can learn more about these parameters in DataCamp's Linear Classifiers in Python course and scikit-learn's documentation.
from sklearn import preprocessing
# Define parameters: these will need to be tuned to prevent overfitting and underfitting
params = {
"penalty": "l2", # Norm of the penalty: 'l1', 'l2', 'elasticnet', 'none'
"C": 1, # Inverse of regularization strength, a positive float
"random_state": 123,
}
# Create a logistic regression classifier object with the parameters above
clf = LogisticRegression(**params)
# Train the classifer on the train set
clf = clf.fit(X_train_scaled, y_train)
# Predict the outcomes on the test set
y_pred = clf.predict(X_test_scaled)
To evaluate this classifier, we can calculate the accuracy, precision, and recall scores. You'll have to decide which performance metric is best suited for your problem and goal.
# Calculate the accuracy, precision, and recall scores
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
4. Other evaluation methods: confusion matrix and ROC curve
We can use a confusion matrix and a receiver operating characteristic (ROC) curve to get a fuller picture of the model's performance. These are available from sklearn's metrics module.
# Calculate confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
# Plot a labeled confusion matrix with Seaborn
sns.heatmap(cnf_matrix, annot=True, fmt="g")
plt.title("Confusion matrix")
plt.ylabel("Actual label")
plt.xlabel("Predicted label")