Understanding flight delays ✈️
📖 Background
I work for a major airline that operates flights across the United States. Flight delays pose a major challenge—not just for the airline, but also for our passengers—causing disruptions, financial losses, and dissatisfaction. In this project, I'm roled as a Data Analyst, working within the airline’s data analytics team. My responsibility is to dive into historical flight data to uncover delay patterns, identify operational inefficiencies, and build predictive models to anticipate delays before they happen.
By analyzing these patterns and pinpointing the key factors contributing to delays, I aim to drive operational efficiency and significantly enhance the passenger experience. The insights I generate support data-driven decisions to optimize flight scheduling, improve on-time performance, and boost overall customer satisfaction.
Can I crack the code behind flight delays and help revolutionize air travel? That’s exactly the mission I’m on.
💾 The data
flights.csv
id- Id number of the flightyear- Year of Flightmonth- Month of Flightday- Day of Monthdep_time- Time of departure (24h format)sched_dep_time- Scheduled departure timedep_delay- Delay in departure (minutes)arr_time- Time of arrival (24h format)sched_arr_time- Scheduled arrival timearr_delay- Delay in arrival (minutes)carrier- Airline company codeflight- Flight numbertailnum- Aircraft identifier numberorigin- Origin Airport - 3 letter codedest- Destination Airport - 3 letter codeair_time- Duration of the flight (minutes)distance- Flight distance (miles)hour- Hour component of scheduled departure timeminute- Minute component of scheduled departure time
airlines_carrier_codes.csv
Carrier Code- Airline company codeAirline Name- Airline Name
💪 Challenge
Create a report summarizing your insights. Your report should explore the following questions:
- How do different airlines compare in terms of their departure and arrival times? Are there noticeable trends in their on-time performance over the year? A well-structured visualization could help uncover patterns.
- Are there particular months/weeks/time of day where there is a general trend of greater delays in flights across all carriers? If so, what could be the reasons?
- Some airports seem to operate like clockwork, while others are notorious for disruptions. How do different airports compare when it comes to departure and arrival punctuality? Could location, traffic volume, or other factors play a role? Are there patterns that emerge when looking at delays across various airports?
- [Optional 1] Predict whether a flight will have a delay of 15 minutes or more at departure.
- [Optional 2] What underlying factors influence flight delays the most? Are some routes more prone to disruptions than others? Do external variables like time of day, distance, or carrier policies play a significant role? By analyzing the relationships between different features, you might discover unexpected insights.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
flight_data = pd.read_csv('data/flights.csv')
airlines_codes = pd.read_csv('data/airlines_carrier_codes.csv')
flight_data.head()airlines_codes.rename(columns={'Carrier Code': 'carrier','Airline Name': 'Airline'} , inplace=True)
airlines_codes.head() merged_df = pd.merge(flight_data, airlines_codes, on='carrier', how='inner')
merged_df# Create arrival delay column
merged_df['arr_delay'] = merged_df['arr_time'] - merged_df['sched_arr_time']
# Group by Airline and Month for Departure and Arrival Delay
dep_delay_by_airline = merged_df.groupby(['Airline', 'month'])['dep_delay'].mean().reset_index()
arr_delay_by_airline = merged_df.groupby(['Airline', 'month'])['arr_delay'].mean().reset_index()sns.set(style="whitegrid")
plt.figure(figsize=(14, 6))
sns.lineplot(data=dep_delay_by_airline, x='month', y='dep_delay', hue='Airline', marker='o')
plt.title('Average Departure Delay by Airline Over Months')
plt.xlabel('Month')
plt.ylabel('Average Departure Delay (minutes)')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()plt.figure(figsize=(14, 6))
sns.lineplot(data=arr_delay_by_airline, x='month', y='arr_delay', hue='Airline', marker='o')
plt.title('Average Arrival Delay by Airline Over Months')
plt.xlabel('Month')
plt.ylabel('Average Arrival Delay (minutes)')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()monthly_avg = merged_df.groupby('month')[['dep_delay', 'arr_delay']].mean().reset_index()
plt.figure(figsize=(12, 5))
sns.lineplot(data=monthly_avg, x='month', y='dep_delay', marker='o', label='Departure Delay')
sns.lineplot(data=monthly_avg, x='month', y='arr_delay', marker='o', label='Arrival Delay')
plt.title('Average Delays by Month (All Airlines)')
plt.xlabel('Month')
plt.ylabel('Average Delay (minutes)')
plt.legend()
plt.grid(True)
plt.show()
# Group by origin airport for delay averages
airport_delays = merged_df.groupby('origin')[['dep_delay', 'arr_delay']].mean().reset_index()
# Sort by departure delay
top_airports = airport_delays.sort_values(by='dep_delay', ascending=False).head(15)
# Plot
plt.figure(figsize=(12, 6))
sns.barplot(data=top_airports, x='dep_delay', y='origin', palette='Greens_r')
plt.title('Top 15 Airports with Highest Average Departure Delay')
plt.xlabel('Avg Departure Delay (min)')
plt.ylabel('Origin Airport')
plt.show()