Understanding flight delays ✈️
📖 Background
You work for a major airline operating flights across the USA. Flight delays are a significant challenge for both the airline and passengers, causing disruptions, financial losses, and dissatisfaction. As part of the airline’s data analytics team, your goal is to analyze historical flight data to uncover delay patterns, identify operational inefficiencies, and predict delays before they occur. By identifying delay patterns, predicting delays, and uncovering the factors that contribute most to delays, you’ll be able to drive operational efficiency and enhance the overall passenger experience. Your insights will help the airline make data-driven decisions to optimize scheduling, improve on-time performance, and enhance passenger satisfaction.
Can you crack the code behind flight delays and revolutionize air travel?
💪 Challenge
Create a report summarizing your insights. Your report should explore the following questions:
- How do different airlines compare in terms of their departure and arrival times? Are there noticeable trends in their on-time performance over the year? A well-structured visualization could help uncover patterns.
- Are there particular months/weeks/time of day where there is a general trend of greater delays in flights across all carriers? If so, what could be the reasons?
- Some airports seem to operate like clockwork, while others are notorious for disruptions. How do different airports compare when it comes to departure and arrival punctuality? Could location, traffic volume, or other factors play a role? Are there patterns that emerge when looking at delays across various airports?
- [Optional 1] Predict whether a flight will have a delay of 15 minutes or more at departure.
- [Optional 2] What underlying factors influence flight delays the most? Are some routes more prone to disruptions than others? Do external variables like time of day, distance, or carrier policies play a significant role? By analyzing the relationships between different features, you might discover unexpected insights.
To begin this analysis, we need to thoroughly preprocess the provided flight data and then engineer features that will enable deeper insights. Below is a breakdown of the steps and considerations for data preprocessing, followed by an analysis of the questions posed.
Step 1: Data Preprocessing
Load and Inspect Data
The first step is to load the dataset and inspect the structure, identifying any missing values or anomalies.
Data Cleaning
- Handle Missing Data: Address any missing data in key columns, like departure times, arrival times, or delay durations.
- Standardize Time Formats: Convert string representations of times to actual time types.
- Identify Erroneous Values: Look for negative delays or unreasonable times.
Feature Engineering
- Time-Based Features:
- Extract features like the month, day of the week, hour of the day, and season to analyze trends based on time.
- Flight Duration:
- Compute the actual flight duration based on departure and arrival times.
- Delay Binary Indicator:
- Create a binary indicator of delays (e.g., 1 for delay ≥15 minutes, 0 for on-time).
- Delay Categories:
- Separate the delay into different categories (e.g., weather, mechanical, late aircraft, etc.).
- Distance:
- Compute the distance between the origin and destination airports to identify if longer flights are more prone to delays.
Data Transformation
- Normalize Continuous Variables: Normalize variables like distance and duration for better model performance.
- Encode Categorical Variables: Convert categorical variables (like airline codes and airports) into a machine-readable format.
import pandas as pd
# Load the datasets
airlines_file_path = 'airlines_carrier_codes.csv'
flights_file_path = 'flights.csv'
airlines_df = pd.read_csv(airlines_file_path)
flights_df = pd.read_csv(flights_file_path)
# Display the first few rows of both datasets to understand their structure
airlines_df.head(), flights_df.head()
Airlines Dataset (airlines_df)
airlines_df)- Carrier Code: Unique airline identifier (e.g., UA for United Airlines).
- Airline Name: Full name of the airline (e.g., United Airlines Inc.).
Flights Dataset (flights_df)
flights_df)- id: Unique identifier for each flight record.
- year, month, day: Date of the flight.
- dep_time: Actual departure time.
- sched_dep_time: Scheduled departure time.
- dep_delay: Delay at departure (in minutes).
- arr_time: Actual arrival time.
- sched_arr_time: Scheduled arrival time.
- arr_delay: Delay at arrival (in minutes).
- carrier: Carrier code (e.g., UA for United Airlines).
- flight: Flight number.
- tailnum: Aircraft tail number.
- origin, dest: Departure and destination airports.
- air_time: Flight duration (in minutes).
- distance: Distance between origin and destination (in miles).
- hour, minute: Hour and minute of the flight's scheduled departure time.
Step 3: Data Preprocessing Plan
Before diving into answering the specific questions, I will clean the data and create necessary features:
1. Flight Delays
- Focus on flights with a departure delay of 15 minutes or more.
- Create a binary "Delayed" indicator for this.
2. Time Features
- Extract month, day of the week, and time of day (morning, afternoon, evening, etc.).
3. Carrier and Airport Mapping
- Use the carrier codes from the
airlines_dfdataset to map the airline names to the flight records.
4. Flight Duration
- Calculate actual flight durations by comparing scheduled and actual times.
5. Distance and Route
- Understand the impact of distance on delays by analyzing how long-haul flights may experience different delay patterns.
Once the data is cleaned and transformed, we'll begin answering the specific questions with visualizations and statistical analysis.
I'll start with the preprocessing steps and create necessary features.
Data Preprocessing and Feature Engineering
====================================
# Merging the airlines dataset with the flights dataset to get the full airline names
flights_df = flights_df.merge(airlines_df, how='left', left_on='carrier', right_on='Carrier Code')
# Convert dep_time and arr_time to datetime and calculate delay status (15+ minutes)
flights_df['dep_time'] = pd.to_datetime(flights_df['dep_time'], format='%H%M', errors='coerce')
flights_df['arr_time'] = pd.to_datetime(flights_df['arr_time'], format='%H%M', errors='coerce')
# Add delay status (1 for delay >= 15 mins, 0 for on-time)
flights_df['dep_delay_status'] = (flights_df['dep_delay'] >= 15).astype(int)
flights_df['arr_delay_status'] = (flights_df['arr_delay'] >= 15).astype(int)
# Extract month, day of week, and time of day features
flights_df['month'] = flights_df['month'].astype(str)
flights_df['day_of_week'] = pd.to_datetime(flights_df[['year', 'month', 'day']].astype(str).agg('-'.join, axis=1), errors='coerce').dt.day_name()
# Define the time of day based on the scheduled departure time
flights_df['time_of_day'] = pd.cut(flights_df['hour'], bins=[0, 6, 12, 18, 24], labels=["Night", "Morning", "Afternoon", "Evening"], right=False)
# Create a new column for actual flight duration (in minutes)
flights_df['actual_flight_duration'] = flights_df['arr_time'] - flights_df['dep_time']
flights_df['actual_flight_duration'] = flights_df['actual_flight_duration'].dt.total_seconds() / 60 # Convert to minutes
# Handle missing values by filling with zeros or means where applicable (for simplicity)
flights_df['arr_delay'].fillna(0, inplace=True)
flights_df['dep_delay'].fillna(0, inplace=True)
# Check for any remaining missing values
missing_values = flights_df.isnull().sum()
# Display the missing values to ensure data is clean before moving forward
missing_values
Handling Missing Values
There are several columns with missing values, notably:
- dep_time and arr_time: These are critical for calculating delays, and missing values here could affect our analysis.
- tailnum and air_time: These are related to specific flight details, and while important, they may be less critical for a broad analysis of delays.
- actual_flight_duration: This has missing values because it depends on the departure and arrival times.
Steps to Handle Missing Values
-
dep_time and arr_time:
- Drop rows where both are missing, as they are crucial for delay analysis.
-
air_time and tailnum:
- Fill with zeros or handle separately if needed.
-
actual_flight_duration:
- Calculate only where both departure and arrival times are available, so missing rows can be discarded.
I'll clean the data and continue with the analysis.
# Drop rows where both dep_time and arr_time are missing, since they are essential for the analysis
flights_df = flights_df.dropna(subset=['dep_time', 'arr_time'])
# For air_time and tailnum, we can fill missing values with zeros as these are less critical for now
flights_df['air_time'].fillna(0, inplace=True)
flights_df['tailnum'].fillna('Unknown', inplace=True)
# Recalculate the actual flight duration where both dep_time and arr_time are available
flights_df['actual_flight_duration'] = pd.to_datetime(flights_df['arr_time'], errors='coerce') - pd.to_datetime(flights_df['dep_time'], errors='coerce')
flights_df['actual_flight_duration'] = flights_df['actual_flight_duration'].dt.total_seconds() / 60 # Convert to minutes
# Check again for missing values after the cleaning
missing_values_after_cleaning = flights_df.isnull().sum()
# Display missing values count after cleaning
missing_values_after_cleaning
Data Analysis Overview
The missing values have been successfully handled, and the dataset is now clean and ready for analysis. All critical columns required for delay analysis are complete.
Next Steps
I will move on to exploring the specific questions:
-
Airline Comparison:
- How do different airlines compare in terms of their departure and arrival times?
- Are there noticeable trends in their on-time performance over the year?
-
Delay Trends:
- Are there particular months, weeks, or times of day where there is a general trend of greater delays across all carriers?
-
Airport Comparison:
- How do different airports compare when it comes to departure and arrival punctuality?
I will begin by visualizing the data to identify trends and patterns. Let's start with the comparison of airlines in terms of their on-time performance.
Step 2: Exploratory Data Analysis (EDA)
Analysis of Airline Delays
import matplotlib.pyplot as plt
import seaborn as sns
# Set a custom color palette for consistent branding
sns.set_palette("Set2")
# 1. How do different airlines compare in terms of their departure and arrival times?
# Plotting the average departure delay and arrival delay for each airline
average_delays = flights_df.groupby('Airline Name')[['dep_delay', 'arr_delay']].mean().sort_values(by='dep_delay', ascending=False)
# Create a bar plot for the average delays
plt.figure(figsize=(12, 6))
average_delays.plot(kind='bar', stacked=False)
plt.title('Average Departure and Arrival Delays by Airline')
plt.ylabel('Delay (minutes)')
plt.xlabel('Airline')
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()