Skip to content
0

Understanding Flight Delays with Logistic Regression

Executive Summary

This report presents the findings from an in-depth analysis of flight delays for a major airline. The goal of the analysis was to uncover delay patterns, identify operational inefficiencies, and build a predictive model for forecasting flight delays. The insights derived from this analysis aim to help the airline optimize its operations, improve on-time performance, and enhance passenger satisfaction.

  1. How do different airlines compare in terms of their departure and arrival times? Are there noticeable trends in their on-time performance over the year?

    • The analysis of airline performance revealed that certain carriers, such as Frontier Airlines (FL) and Envoy Air (EV), consistently experience more delays compared to others. Airlines such as Alaska Airlines (AS) and Southwest Airlines (WN) show relatively better punctuality. The trends indicate that budget carriers tend to have higher delays, potentially due to operational challenges and fleet management.
    • Seasonal trends were observed, with summer months showing higher delay frequencies, likely due to increased travel volume. There was an additional delay trend in December.
  2. Are there particular months/weeks/time of day where there is a general trend of greater delays across all carriers?

    • Peak travel times, such as holidays and weekends, showed higher delays across the board, likely due to higher passenger volumes, traffic congestion at airports, and resource constraints. Delays also peaked during afternoon and evening hours, suggesting that flight scheduling and air traffic congestion during these times contribute to increased delays.
  3. How do different airports compare when it comes to departure and arrival punctuality? Could location, traffic volume, or other factors play a role?

    • Airports like LGA (LaGuardia) and EWR (Newark) were identified as having higher delays. Factors such as location, airport traffic volume, and airport infrastructure (e.g., fewer runways or outdated systems) contribute to delays. LGA was notably affected by runway capacity issues and congestion. In contrast, airports like PDX (Portland) and BHM (Birmingham) showed significantly fewer delays, likely due to their smaller size and less air traffic.
  4. Predict whether a flight will have a delay of 15 minutes or more at departure.

    • A logistic regression model was built to predict the likelihood of a 15-minute delay based on various features such as carrier, hour of departure, origin airport, and recent delays. The model achieved an accuracy of 73%, with a recall of 60% and precision of 38%, indicating good potential for predicting delays but highlighting the need for improvement, particularly in precision.
  5. What underlying factors influence flight delays the most? Are some routes more prone to disruptions than others?

    • Time of day (hour) and origin airport (LGA) were found to be the most significant predictors of delay. Certain routes, such as JFK-SJC and LGA-BNA, were identified as having particularly high delays. These routes should be a focus for operational improvements.
    • Holiday travel and recent delays also contributed significantly to the likelihood of future delays, showing a pattern where delayed flights are more likely to remain delayed.

Introduction

Flight delays are a major challenge in the airline industry, affecting millions of passengers each year. Delays not only lead to financial losses but also disrupt schedules, reduce customer satisfaction, and create a ripple effect in the entire travel network. They also have a cascading impact, causing further delays in subsequent flights, affecting airport operations, and resulting in missed connections for passengers. Therefore, understanding the factors contributing to these delays and predicting when they might occur is essential for improving operational efficiency and enhancing the overall passenger experience.

This analysis aims to provide actionable insights into flight delays by examining historical flight data. By identifying delay patterns, uncovering operational inefficiencies, and predicting delays before they occur, the airline can make informed decisions that enhance scheduling, improve on-time performance, and increase passenger satisfaction. The insights gained from this analysis will empower the airline to take proactive measures in addressing these challenges, ultimately improving the quality of service provided to passengers.

Goals of the Report

The primary goal of this report is to answer several key questions:

  • How do different airlines compare in terms of on-time performance?
  • Are there certain times of day, months, or weeks when delays are more common?
  • How do airports compare in terms of delays, and what factors contribute to these delays?
  • Can we predict whether a flight will be delayed by 15 minutes or more at departure?
  • What underlying factors influence flight delays the most?

The findings from this analysis provide valuable insights for improving the airline’s operations. Understanding delay patterns and predicting disruptions before they occur will allow the airline to optimize flight schedules, address inefficiencies, and reduce delays. This approach will not only improve on-time performance but also enhance passenger satisfaction, fostering better customer loyalty and reducing operational costs.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import helpers.delay_helpers as dh
import helpers.plot_helpers as ph
import helpers.eval_helpers as eh
import helpers.ts_helpers as ts
import helpers.airport_helpers as ah
import helpers.import_helpers as ih
import helpers.model_helpers as mh

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.compose import ColumnTransformer

import warnings
warnings.filterwarnings('ignore')

# For reproducibility (if needed)
sns.set(style="whitegrid")
plt.rcParams.update({'figure.max_open_warning': 0})

# Dataset
flight_data = ih.import_data()

%load_ext autoreload
%autoreload 2

Data Preprocessing

The dataset used for this analysis contains historical flight data from January 1, 2023, to December 31, 2023. It includes a total of 336,776 rows and covers flight details such as departure and arrival times, delays, flight distances, and carrier information. The key columns in the dataset are:

  • id: Unique identifier for each flight
  • year: Year of the flight
  • month: Month of the flight
  • day: Day of the month
  • dep_time: Actual departure time (24-hour format)
  • sched_dep_time: Scheduled departure time
  • dep_delay: Delay in departure (in minutes)
  • arr_time: Actual arrival time (24-hour format)
  • sched_arr_time: Scheduled arrival time
  • arr_delay: Delay in arrival (in minutes)
  • carrier: Airline company code
  • flight: Flight number
  • tailnum: Aircraft identifier
  • origin: Origin airport (3-letter code)
  • dest: Destination airport (3-letter code)
  • air_time: Duration of the flight (in minutes)
  • distance: Flight distance (in miles)
  • hour: Hour component of the scheduled departure time
  • minute: Minute component of the scheduled departure time

Data Cleaning

In preparation for analysis, several preprocessing and cleaning steps were applied to the dataset:

  • Missing Values: The dataset contained missing values in the following columns:

    • air_time, arr_delay, recovery, and arr_time all had 9,430 missing values.
    • dep_delay had 8,255 missing values.
    • tailnum had 2,512 missing values.

    For the missing dep_delay and arr_delay, these were handled by either imputing with a median value or removing rows based on the type of analysis being conducted. In some cases, rows with critical missing delay data were removed, as delay values were key to the analysis.

  • Data Types and Transformation:

    • Time-related columns like dep_time, arr_time, sched_dep_time, and sched_arr_time were converted into datetime format for easier manipulation.
    • The hour and minute components were extracted from the scheduled departure time and used in the analysis for time-of-day patterns.
  • Feature Engineering:

    • A new feature, is_delayed, was created for model building purposes. Any flight whose departure delay time exceeded 15 minutes was flagged as 1.
    • Flight recovery time, calculated as the difference between the scheduled and actual arrival times when available, was added as a feature to capture the impact of previous delays on arrival times.
  • Handling Outliers:

    • Extreme values in dep_delay, arr_delay, and air_time were identified using IQR-based filtering and were either capped or removed to ensure they did not unduly influence the results.

Exploratory Data Analysis

Airline Performance

The histogram above shows the distribution of arrival delays (in minutes) for various airline carriers. The x-axis represents the duration of the arrival delays, ranging from 0 to over 1200 minutes, while the y-axis shows the frequency of each delay duration. Each airline is represented on the y-axis, and they are ordered by the median value of their delays, with airlines having a higher median delay appearing at the top of the plot.

  • Carrier Comparison: There is a noticeable variation in delay patterns between carriers. Some carriers, such as F9 and FL, have relatively lower delays (most delays are clustered around the 0–100 minutes range), while others, like HA and AS, have a larger spread, with delays occasionally extending to several hundred minutes.
  • High Frequency of Short Delays: Most airlines experience short delays (under 100 minutes), but the frequency of longer delays varies significantly across airlines.
  • Outliers: Certain carriers, such as AA and VX, have notable outliers, where delays extend to well over 500 minutes, suggesting that these airlines may face occasional operational disruptions that cause prolonged delays.
  • Insights for Recovery Time: This distribution could be tied to how well carriers recover from delays, indicating potential areas for operational improvement.
ph.plot_carrier_histogram(flight_data, 'dep_delay')
ph.plot_carrier_histogram(flight_data, 'arr_delay')
ph.plot_carrier_histogram(flight_data, 'recovery')
Median Departure/Arrival Delay