A prominent airline company in the Pacific Northwest has accumulated extensive data related to flights and weather patterns and needs to understand the factors influencing the departure delays and cancellations to benefit both airlines and passengers. As the data analyst on the team, you decide to embark on this analytical project.
The aviation industry is dynamic with various variables impacting flight operations. To ensure the relevance and applicability of your findings, you choose to focus solely on flights from the 'pnwflights2022' datasets available from the ModernDive team exported as CSV files. These datasets provide comprehensive information on flights departing in the first half of 2022 from both of the two major airports in this region: SEA (Seattle-Tacoma International Airport) and PDX (Portland International Airport):
flights2022.csvcontains information about about each flight including
| Variable | Description |
|---|---|
dep_time | Departure time (in the format hhmm) whereNA corresponds to a cancelled flight |
dep_delay | Departure delay, in minutes (negative for early) |
origin | Origin airport where flight starts (IATA code) |
airline | Carrier/airline name |
dest | Destination airport where flight lands (IATA code) |
flights_weather2022.csvcontains the same flight information as well as weather conditions such as
| Variable | Description |
|---|---|
visib | Visibility (in miles) |
wind_gust | Wind gust speed (in mph) |
import pandas as pd
# Load CSV files
flights = pd.read_csv('flights2022.csv')
flights_weather = pd.read_csv('flights_weather2022.csv')
# Create route using dash
flights['route'] = flights['origin'] + '-' + flights['dest']
# Cancelled flag
flights['cancelled'] = flights['dep_time'].isna()
# Departure hour
flights['dep_hour'] = flights['dep_time'].fillna(0).astype(int) // 100
# Quick checks
print(flights.head())
print(flights_weather.head())
routes_delays_cancels = (
flights
.groupby('route')
.agg(
mean_dep_delay = ('dep_delay', 'mean'),
total_cancellations = ('cancelled', 'sum')
)
.reset_index()
)
print(routes_delays_cancels.head())
airlines_delays_cancels = (
flights
.groupby('airline')
.agg(
mean_dep_delay = ('dep_delay', 'mean'),
total_cancellations = ('cancelled', 'sum')
)
.reset_index()
)
print(airlines_delays_cancels.head())
import matplotlib.pyplot as plt
# 1. Create top_routes_by_cancellations DataFrame (top 9 by total_cancellations)
top_routes_by_cancellations = routes_delays_cancels.nlargest(9, 'total_cancellations')
print(top_routes_by_cancellations)
# Plot 1: Top 9 routes by cancellations
plt.figure(figsize=(10,6))
plt.bar(top_routes_by_cancellations['route'], top_routes_by_cancellations['total_cancellations'], color='skyblue')
plt.xticks(rotation=45, ha='right')
plt.title('Top 9 Routes by Number of Cancellations')
plt.ylabel('Number of Cancellations')
plt.xlabel('Route')
plt.tight_layout()
top9_route_cancels_bar = plt.gcf()
# 2. Create top 9 airlines by mean departure delay
top_airlines_by_delay = airlines_delays_cancels.nlargest(9, 'mean_dep_delay')
print(top_airlines_by_delay)
# Plot 2: Top 9 airlines by average departure delay
plt.figure(figsize=(10,6))
plt.bar(top_airlines_by_delay['airline'], top_airlines_by_delay['mean_dep_delay'], color='salmon')
plt.xticks(rotation=45, ha='right')
plt.title('Top 9 Airlines by Mean Departure Delay')
plt.ylabel('Mean Delay (min)')
plt.xlabel('Airline')
plt.tight_layout()
top9_airline_delays_bar = plt.gcf()
# Wind-gust effect analysis
def windy_delay(origin_code):
sub = flights_weather[flights_weather['origin'] == origin_code]
high = sub[sub['wind_gust'] >= 10]['dep_delay'].mean()
low = sub[sub['wind_gust'] < 10]['dep_delay'].mean()
return high, low
sea_high, sea_low = windy_delay('SEA')
pdx_high, pdx_low = windy_delay('PDX')
wind_response = (sea_high > sea_low) and (pdx_high > pdx_low)
print(f"SEA: High-gust avg delay = {sea_high:.1f}, Low-gust avg = {sea_low:.1f}")
print(f"PDX: High-gust avg delay = {pdx_high:.1f}, Low-gust avg = {pdx_low:.1f}")
print("wind_response =", wind_response)