## Logistic Regression Binary Classification

Logistic regression is a fundamental machine learning method originally from the field of statistics. It's a great choice for generating a baseline for any binary classification problem (meaning there are only two outcomes). This template trains and evaluates a logistic regression model for a **binary classification** problem. If you would like to learn more about logistic regression, take a look at DataCamp's Linear Classifiers in Python course.

To swap in your dataset in this template, the following is required:

- There's at least one feature column and a column with a binary categorical target variable you would like to predict.
- The features have been cleaned and preprocessed, including categorical encoding.
- There are no NaN/NA values. You can use this template to impute missing values if needed.

The placeholder dataset in this template consists of churn data from a telecom company. Each row represents a customer over a year and whether the customer churned (the target variable; `1`

= yes, `0`

= no). You can find more information on this dataset's source and dictionary here.

#### 1. Loading packages and data

```
# Load packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
accuracy_score,
confusion_matrix,
precision_score,
recall_score,
RocCurveDisplay,
)
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
# Load the data and replace with your CSV file path
df = pd.read_csv("data/customer_churn.csv")
df
```

```
# Check if there are any null values
print(df.isnull().sum())
```

```
# Check columns to make sure you have feature(s) and a target variable
df.info()
```

#### 2. Splitting and standardizing the data

To split the data, we'll use the train_test_split() function. Then, we'll standardize the input data using `StandardScaler()`

(note: this should be done after splitting the data to avoid data leakage). To learn more about standardizing data and preprocessing techniques, visit DataCamp's Preprocessing for Machine Learning in Python.

```
# Split the data into two DataFrames: X (features) and y (target variable)
X = df.iloc[:, 0:8] # Specify at least one column as a feature
y = df["Churn"] # Specify one column as the target variable
# Split the data into train and test subsets
# You can adjust the test size and random state
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.30, random_state=123
)
# Standardize X data based on X_train
sc = StandardScaler().fit(X_train)
X_train_scaled = sc.transform(X_train)
X_test_scaled = sc.transform(X_test)
```

#### 3. Building a logistic regression classifier

The following code builds a scikit-learn logistic regression classifier (`linear_model.LogisticRegression`

) using the most fundamental parameters. As a reminder, you can learn more about these parameters in DataCamp's Linear Classifiers in Python course and scikit-learn's documentation.

```
from sklearn import preprocessing
# Define parameters: these will need to be tuned to prevent overfitting and underfitting
params = {
"penalty": "l2", # Norm of the penalty: 'l1', 'l2', 'elasticnet', 'none'
"C": 1, # Inverse of regularization strength, a positive float
"random_state": 123,
}
# Create a logistic regression classifier object with the parameters above
clf = LogisticRegression(**params)
# Train the classifer on the train set
clf = clf.fit(X_train_scaled, y_train)
# Predict the outcomes on the test set
y_pred = clf.predict(X_test_scaled)
```

To evaluate this classifier, we can calculate the accuracy, precision, and recall scores. You'll have to decide which performance metric is best suited for your problem and goal.

```
# Calculate the accuracy, precision, and recall scores
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
```

#### 4. Other evaluation methods: confusion matrix and ROC curve

We can use a confusion matrix and a receiver operating characteristic (ROC) curve to get a fuller picture of the model's performance. These are available from sklearn's metrics module.

```
# Calculate confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
# Plot a labeled confusion matrix with Seaborn
sns.heatmap(cnf_matrix, annot=True, fmt="g")
plt.title("Confusion matrix")
plt.ylabel("Actual label")
plt.xlabel("Predicted label")
```