Skip to content
New Workbook
Sign up
Project: Analyzing Flight Delays and Cancellations

Python exercise: Flight Delays and Cancellations

Title goes here

A prominent airline company in the Pacific Northwest has accumulated extensive data related to flights and weather patterns and needs to understand the factors influencing the departure delays and cancellations to benefit both airlines and passengers. As the data analyst on the team, you decide to embark on this analytical project.

The aviation industry is dynamic with various variables impacting flight operations. To ensure the relevance and applicability of your findings, you choose to focus solely on flights from the 'pnwflights2022' datasets available from the ModernDive team exported as CSV files. These datasets provide comprehensive information on flights departing in the first half of 2022 from both of the two major airports in this region: SEA (Seattle-Tacoma International Airport) and PDX (Portland International Airport):

  • flights2022.csv contains information about about each flight including
VariableDescription
dep_timeDeparture time (in the format hhmm) whereNA corresponds to a cancelled flight
dep_delayDeparture delay, in minutes (negative for early)
originOrigin airport where flight starts (IATA code)
airlineCarrier/airline name
destDestination airport where flight lands (IATA code)
  • flights_weather2022.csv contains the same flight information as well as weather conditions such as
VariableDescription
visibVisibility (in miles)
wind_gustWind gust speed (in mph)

1. Loading and manipulating data

Load the flights2022 dataset and create the route column.

# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# import flights
flights = pd.read_csv('flights2022.csv')

# preview df
flights.head()

# import weather
flights_weather = pd.read_csv('flights_weather2022.csv')

# create route and place it in front
flights['route'] = flights['origin'] + '-' + flights['dest']
columns = ['route'] + [col for col in flights.columns if col != 'route']
flights= flights[columns]


# create cancelled. error: float
#flights_weather['cancelled'] = np.where(flights_weather['dep_time'].isnull(),1,0)

# preview df
nulls = flights.isnull().sum().sort_values(ascending=False)
null_filtered = nulls[nulls > 0]
print('count of nulls by columns, if column contains nulls:')
print(null_filtered)

flights.head()
print(flights.shape)

flights.head(3)
print(flights_weather.shape)

flights_weather.head(3)
unique_counts = flights.nunique().sort_values(ascending=False)

# Filter columns with fewer than 10 unique values
columns_to_display = unique_counts[unique_counts < 10].index.tolist()

# Display unique values for columns with fewer than 10 unique values
for column in columns_to_display:
    print(f"Column: {column}")
    print(flights[column].unique())
    print()
    
print(unique_counts)

2. Finding the average of aggregated data

For each route, use .groupby() and .agg() to find the mean departure delay and number of canceled flights as routes_delays_cancels, and do a similar analysis for airlines in airlines_delays_cancels.

# create routes_delays_cancels
# routes_delays_cancels = flights.groupby('route').agg(
# 	mean_dep_delay=('dep_delay', 'mean'),
# 	total_cancellations=('dep_time', lambda x: x.isna().sum())
# ).reset_index().round(2)
routes_delays_cancels = flights.groupby('route').agg(
	mean_dep_delay=('dep_delay', 'mean'),
	total_cancellations=('dep_time', lambda x: x.isna().sum())
).reset_index()

routes_delays_cancels
# identify routes with the highest number of cancellations
top_routes_by_cancellations = routes_delays_cancels.sort_values('total_cancellations', ascending=False).head(9)
top_routes_by_cancellations

# import seaborn
import seaborn as sns
import matplotlib.pyplot as plt

# top9_route_cancels_bar, ax = plt.subplots()
# sns.set_style('darkgrid')
# sns.set_context('paper')
# ax.barh(top_routes_by_cancellations['route'], top_routes_by_cancellations['total_cancellations'], color='darkred' )
# ax.set_xlabel('total cancellations')
# ax.set_ylabel('route')
# ax.set_title('Top 9 Routes with Highest Number of Cancellations - Jan to Jun 2022')
# plt.gca().invert_yaxis() 
# plt.xlim(0,100)
# plt.show()
# identify routes with the highest mean dep delay
top_routes_by_delays = routes_delays_cancels.sort_values('mean_dep_delay', ascending=False).head(9)
top_routes_by_delays

# top9_route_delays_bar, ax = plt.subplots()
# sns.set_style('whitegrid')
# sns.set_context('paper')
# ax.barh(top_routes_by_delays['route'], top_routes_by_delays['mean_dep_delay'], color='darkblue' )
# ax.set_xlabel('mean departure delay, in minutes')
# ax.set_ylabel('route')
# ax.set_title('Top 9 Routes with Highest Mean Departure Delay - Jan to Jun 2022')
# plt.gca().invert_yaxis() 
# plt.xlim(0,60)
# plt.show()
# create airlines_delays_cancels
airlines_delays_cancels = flights_weather.groupby('airline').agg(
    mean_dep_delay=('dep_delay', 'mean'),
	total_cancellations=('dep_time', lambda x: x.isna().sum())
).reset_index()
# airlines_delays_cancels = flights_weather.groupby('airline').agg(
#     mean_dep_delay=('dep_delay', 'mean'),
# 	total_cancellations=('dep_time', lambda x: x.isna().sum())
# ).reset_index().round(2)
airlines_delays_cancels
# identify airlines with the highest number of cancellations
top_airlines_by_cancellations = airlines_delays_cancels.sort_values('total_cancellations', ascending=False).head(9)
top_airlines_by_cancellations

top9_route_cancels_bar, ax = plt.subplots()
ax.bar(top_routes_by_cancellations['route'], top_routes_by_cancellations['total_cancellations'])
ax.set_xlabel('route')
ax.set_ylabel('total cancellations')
ax.set_title('routes with highest number of cancellations')
ax.set_xticklabels(top_routes_by_cancellations['route'], rotation=90)
plt.show()

# top9_airline_cancels_bar, ax = plt.subplots()
# sns.set_style('darkgrid')
# sns.set_context('paper')
# ax.barh(top_airline_by_cancellations['airline'], top_airline_by_cancellations['total_cancellations'], color='dar' )
# ax.set_xlabel('total cancellations')
# ax.set_ylabel('airline')
# ax.set_title('Top 9 Airline with Highest Number of Cancellations - Jan to Jun 2022')
# plt.gca().invert_yaxis() 
# plt.xlim(0,1400)
# plt.show()