Understanding flight delays ✈️
📖 Background
You work for a major airline operating flights across the USA. Flight delays are a significant challenge for both the airline and passengers, causing disruptions, financial losses, and dissatisfaction. As part of the airline’s data analytics team, your goal is to analyze historical flight data to uncover delay patterns, identify operational inefficiencies, and predict delays before they occur. By identifying delay patterns, predicting delays, and uncovering the factors that contribute most to delays, you’ll be able to drive operational efficiency and enhance the overall passenger experience. Your insights will help the airline make data-driven decisions to optimize scheduling, improve on-time performance, and enhance passenger satisfaction.
Can you crack the code behind flight delays and revolutionize air travel?
import pandas as pd
flight_data = pd.read_csv('data/flights.csv')
airlines_codes = pd.read_csv('data/airlines_carrier_codes.csv')
flight_data.head()airlines_codes.head() 💪 Challenge
Create a report summarizing your insights. Your report should explore the following questions:
- How do different airlines compare in terms of their departure and arrival times? Are there noticeable trends in their on-time performance over the year? A well-structured visualization could help uncover patterns.
- Are there particular months/weeks/time of day where there is a general trend of greater delays in flights across all carriers? If so, what could be the reasons?
- Some airports seem to operate like clockwork, while others are notorious for disruptions. How do different airports compare when it comes to departure and arrival punctuality? Could location, traffic volume, or other factors play a role? Are there patterns that emerge when looking at delays across various airports?
- [Optional 1] Predict whether a flight will have a delay of 15 minutes or more at departure.
- [Optional 2] What underlying factors influence flight delays the most? Are some routes more prone to disruptions than others? Do external variables like time of day, distance, or carrier policies play a significant role? By analyzing the relationships between different features, you might discover unexpected insights.
import pandas as pd
flight_data = pd.read_csv('data/flights.csv')
airlines_codes = pd.read_csv('data/airlines_carrier_codes.csv')
flight_data.head()How do different airlines compare in terms of their departure and arrival times? Are there noticeable trends in their on-time performance over the year? A well-structured visualization could help uncover patterns.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
# Step 1: Load datasets
flight_data = pd.read_csv('data/flights.csv')
airlines_codes = pd.read_csv('data/airlines_carrier_codes.csv')
flight_data.head()
# Step 2: Merge flight data with airline names
flight_data = flight_data.merge(airlines_codes, how='left', left_on='carrier', right_on='Carrier Code')
# Step 3: Clean and prepare the data
# Convert date columns to datetime format
flight_data['date'] = pd.to_datetime(flight_data[['year', 'month', 'day']])
# Create a 'month_label' for grouping by month
flight_data['month_label'] = flight_data['date'].dt.to_period('M').astype(str)
# Drop rows with missing delay data
flight_data = flight_data.dropna(subset=['dep_delay', 'arr_delay'])
# Step 4: Calculate Monthly Delay Metrics
monthly_delays = flight_data.groupby(['Airline Name', 'month_label'])[['dep_delay', 'arr_delay']].mean().reset_index()
# Custom vibrant color palette (ensure each airline has a unique color)
unique_airlines = monthly_delays['Airline Name'].unique()
vibrant_palette = sns.color_palette("dark", len(unique_airlines))
# Mapping airline names to specific colors
airline_colors = dict(zip(unique_airlines, vibrant_palette))
# Define distinct markers for each airline
markers = ['o', 's', 'D', '^', 'v', 'p', '*', 'H', '+', 'x'] # List of available markers
# Ensure there are enough markers for all airlines
while len(markers) < len(unique_airlines):
markers.extend(markers) # Duplicate the list to ensure enough markers
markers_dict = dict(zip(unique_airlines, markers[:len(unique_airlines)]))
# Step 5: Visualizations
import seaborn as sns
import matplotlib.pyplot as plt
# Faceted Line Plot – Separate plots for each airline with slanted x-axis labels
g = sns.FacetGrid(monthly_delays, col="Airline Name", col_wrap=4, height=4, sharex=False)
g.map(sns.lineplot, "month_label", "dep_delay", marker="o")
g.set_axis_labels("Month", "Average Departure Delay (min)")
g.set_titles("{col_name}")
# Rotate x-axis labels for each facet
for ax in g.axes.flatten():
for label in ax.get_xticklabels():
label.set_rotation(90)
label.set_ha('right') # optional: aligns label to the right for readability
g.tight_layout()
plt.show()Summary of Airline Departure Delay Trends in 2023:
Hawaiian Airlines Inc. showed a dramatic improvement in departure delays from January to December 2023, consistently reducing delays throughout the year, which likely reflects strong operational adjustments or fewer weather-related disruptions. Mesa Airlines Inc. exhibited high variability in delays, with fluctuating performance throughout the year and no clear trend of improvement or worsening, indicating inconsistent performance. SkyWest Airlines Inc. displayed a steady decline in delays from January to September 2023, suggesting a focus on performance improvement, although missing data for some months makes it difficult to fully assess the trend. Other airlines, such as Delta, American, and United, showed fluctuating delay patterns without a strong upward or downward trend, indicating mixed operational efficiency or varying external factors like weather and traffic congestion. These patterns highlight the need for better management of operational strategies and external disruptions to ensure more consistent performance across the board.
overall_delays = flight_data.groupby('Airline Name')[['dep_delay', 'arr_delay']].mean().reset_index()
delay_long = overall_delays.melt(id_vars='Airline Name',
value_vars=['dep_delay', 'arr_delay'],
var_name='Delay Type',
value_name='Average Delay')
# Rename delay types for readability
delay_long['Delay Type'] = delay_long['Delay Type'].map({
'dep_delay': 'Departure Delay',
'arr_delay': 'Arrival Delay'
})
# Plot grouped bar chart
plt.figure(figsize=(12, 6))
sns.barplot(data=delay_long, x='Airline Name', y='Average Delay', hue='Delay Type')
plt.title("Average Departure and Arrival Delays by Airline")
plt.ylabel("Average Delay (minutes)")
plt.xticks(rotation=90)
plt.tight_layout()
plt.legend(title='Delay Type')
plt.show()Some airlines, including Frontier Airlines Inc., ExpressJet Airlines Inc., Mesa Airlines Inc., AirTran Airways Corporation, and Southwest Airlines Co., have experienced significant delays in both departure and arrival, indicating consistent punctuality issues. These delays may be linked to challenges such as scheduling problems, maintenance issues, or high-traffic routes. On the other hand, airlines such as Hawaiian Airlines Inc., US Airways Inc., and Alaska Airlines Inc. have shown notably lower delays, suggesting strong operational efficiency and fewer external disruptions.
# C. Heatmap – Delay by Airline and Month
heatmap_data = monthly_delays.pivot(index="Airline Name", columns="month_label", values="dep_delay")
plt.figure(figsize=(12, 8))
sns.heatmap(heatmap_data, cmap="YlOrRd", annot=True, fmt=".1f", cbar_kws={'label': 'Average Departure Delay (min)'})
plt.title("Heatmap of Average Departure Delay by Airline and Month")
plt.xlabel("Month")
plt.ylabel("Airline")
plt.tight_layout()
plt.show()In 2023, Hawaiian Airlines saw the most improvement, reducing delays from 54.4 minutes in January to near-zero by mid-year, suggesting effective operational changes. Alaska Airlines and US Airways consistently maintained low delays, reflecting strong efficiency. United Airlines had moderate delays throughout the year, showing steady performance.In contrast, SkyWest Airlines faced high delays early in the year, improving after August, indicating operational challenges early on. Mesa Airlines and ExpressJet Airlines had erratic delays with no clear improvement, raising concerns about their ability to resolve operational issues. Frontier Airlines regularly experienced delays over 25 minutes, pointing to persistent problems. From August to October, most airlines reduced delays, likely due to better management after the summer peak. However, delays peaked from May to July, highlighting the impact of seasonal demand. These trends emphasize the need for better scheduling, resource management, and preparation for high-demand periods.
# D. Boxplot – Distribution of Departure Delays by Airline
plt.figure(figsize=(12, 6))
sns.boxplot(data=flight_data, x='Airline Name', y='dep_delay', palette=airline_colors)
plt.title("Departure Delay Distribution by Airline")
plt.ylabel("Delay (minutes)")
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()SkyWest, Mesa, and Frontier Airlines are the least reliable, exhibiting high variability and frequent long delays. In contrast, Alaska, United, and US Airways are the most consistent, showing low median delays and fewer extreme cases. This distribution analysis confirms previous findings, indicating that airlines with higher average delays tend to be less consistent in their performance.
Are there particular months/weeks/time of day where there is a general trend of greater delays in flights across all carriers? If so, what could be the reasons?