Competition - Summer Sweepstakes 2023

Perform logistic regression

Introduction

In this I will show how to build an effective binomial logistic regression model, which will help me understand the benefits of using logistic regression to predict a dependent variable based on one independent variable. This will also help me build confidence in using logistic regression. Since logistic regression is widely used in various industries, becoming proficient in this process will help me expand my skillset in a widely applicable way.

For this, I'm working as a consultant for an airline. The airline wants to know whether better in-flight entertainment leads to higher customer satisfaction. They have asked me to create and evaluate a model that predicts whether a future customer would be satisfied with their services based on previous customer feedback about their flight experience.

The data for this activity consists of a sample size of 129,880 customers. It includes data points such as class, flight distance, and in-flight entertainment, among others. My goal is to use a binomial logistic regression model to help the airline understand and model this data.

Since this uses a dataset from the industry, I need to perform basic exploratory data analysis (EDA), data cleaning, and other data manipulations to prepare the data for modeling.

I will realize the following skills:

Importing packages and loading data
Exploring and cleaning data
Building a binomial logistic regression model
Evaluating a binomial logistic regression model using a confusion matrix

Imports

Import packages

Import relevant Python packages. Like train_test_split, LogisticRegression, and various imports from sklearn.metrics to build, visualize, and evalute the model.

# Standard operational package imports.
import numpy as np
import pandas as pd

# Important imports for preprocessing, modeling, and evaluation.
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import sklearn.metrics as metrics

# Visualization package imports.
import matplotlib.pyplot as plt
import seaborn as sns

Load the dataset

Load the Invistico_Airline.csv dataset. Save the resulting pandas DataFrame in a variable named df_original.

df_original = pd.read_csv("Invistico_Airline.csv")

# Output the first 10 rows of data.
df_original.head(n = 10)

Data exploration, data cleaning, and model preparation

Prepare the data

After loading the dataset, I will prepare the data to be suitable for a logistic regression model. This includes:

Exploring the data
Checking for missing values
Encoding the data
Renaming a column
Creating the training and testing data

Explore the data

# data Types

df_original.dtypes

Check the number of satisfied customers in the dataset

To predict customer satisfaction, I will check how many customers in the dataset are satisfied before modeling.

# value counts in the satisfaction column

df_original['satisfaction'].value_counts(dropna = False)

There were 71,087 satisfied customers and 58,793 dissatisfied customers.

54.7 percent (71,087/129,880) of customers were satisfied. While this is a simple calculation, this value can be compared to a logistic regression model's accuracy.

Check for missing values

An assumption of logistic regression models is that there are no missing values. Check for missing values in the rows of the data.

‌
‌
‌