Understanding flight delays ✈️
📖 Background
You work for a major airline operating flights across the USA. Flight delays are a significant challenge for both the airline and passengers, causing disruptions, financial losses, and dissatisfaction. As part of the airline’s data analytics team, your goal is to analyze historical flight data to uncover delay patterns, identify operational inefficiencies, and predict delays before they occur. By identifying delay patterns, predicting delays, and uncovering the factors that contribute most to delays, you’ll be able to drive operational efficiency and enhance the overall passenger experience. Your insights will help the airline make data-driven decisions to optimize scheduling, improve on-time performance, and enhance passenger satisfaction.
Can you crack the code behind flight delays and revolutionize air travel?
💾 The data
Your team provided you with 2 files with the following information (source):
flights.csv
id- Id number of the flightyear- Year of Flightmonth- Month of Flightday- Day of Monthdep_time- Time of departure (24h format)sched_dep_time- Scheduled departure timedep_delay- Delay in departure (minutes)arr_time- Time of arrival (24h format)sched_arr_time- Scheduled arrival timearr_delay- Delay in arrival (minutes)carrier- Airline company codeflight- Flight numbertailnum- Aircraft identifier numberorigin- Origin Airport - 3 letter codedest- Destination Airport - 3 letter codeair_time- Duration of the flight (minutes)distance- Flight distance (miles)hour- Hour component of scheduled departure timeminute- Minute component of scheduled departure time
airlines_carrier_codes.csv
Carrier Code- Airline company codeAirline Name- Airline Name
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier, Pool
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import LabelEncoder
from scipy.sparse import hstack, csr_matrix
import lightgbm as lgb
from imblearn.over_sampling import SMOTENC
from sklearn.utils import resample# Loading datasets
flight_data = pd.read_csv('data/flights.csv')
airlines_codes = pd.read_csv('data/airlines_carrier_codes.csv')
flight_data.head()airlines_codes.head() full_df=flight_data.merge(airlines_codes, left_on='carrier', right_on='Carrier Code', how='left')
full_df.head()EDA
Checking the quality of the data by looking at:
- missing values
- duplicates
- datatype
- unexpected values
# Checking for duplicates
duplicated = full_df[full_df.duplicated()]
if duplicated.empty:
print('There are no duplicates in the dataset')
else:
print(f'There are duplicate rows in the dataset:\n{duplicated}')# Checking the datatypes for all columns
full_df=full_df.rename(columns={'Airline Name':'airline_name'})
full_df.info()full_df['sched_dep_time'] = pd.to_datetime(full_df['year'].astype(str) + '-' +
full_df['month'].astype(str) + '-' +
full_df['day'].astype(str) + ' ' +
full_df['sched_dep_time'].astype(str).str.zfill(4),
format='%Y-%m-%d %H%M', errors='coerce')
full_df['dep_time'] = pd.to_datetime(full_df['year'].astype(str) + '-' +
full_df['month'].astype(str) + '-' +
full_df['day'].astype(str) + ' ' +
full_df['dep_time'].astype(str).str.zfill(4),
format='%Y-%m-%d %H%M', errors='coerce')
full_df['dep_time'].fillna(full_df['sched_dep_time'] + pd.to_timedelta(full_df['dep_delay'], unit='m'), inplace=True)
full_df['sched_arr_time'] = pd.to_datetime(full_df['year'].astype(str) + '-' +
full_df['month'].astype(str) + '-' +
full_df['day'].astype(str) + ' ' +
full_df['sched_arr_time'].astype(str).str.zfill(4),
format='%Y-%m-%d %H%M', errors='coerce')
full_df['arr_time'] = pd.to_datetime(full_df['year'].astype(str) + '-' +
full_df['month'].astype(str) + '-' +
full_df['day'].astype(str) + ' ' +
full_df['arr_time'].astype(str).str.zfill(4),
format='%Y-%m-%d %H%M', errors='coerce')
full_df['arr_time'].fillna(full_df['dep_time'] + pd.to_timedelta(full_df['air_time'], unit='m'), inplace=True)
full_df['arr_delay'].fillna((full_df['arr_time'] - full_df['sched_arr_time']).dt.total_seconds() / 60, inplace=True)
columns_to_drop = ['Carrier Code']
full_df.drop(columns=columns_to_drop, inplace=True)
# Checking the % of missing values by column:
round(full_df.isna().sum()/len(full_df)*100,2)print(f'Length before dropping remaining nan: {len(full_df)}')
full_df=full_df.dropna()
print(f'Length after dropping remaining nan: {len(full_df)}')Exploring each column individually
for col in full_df.columns:
print(f'Column: {col}')
print(f'Datatype: {full_df[col].dtype}')
if full_df[col].dtype == 'object':
print(f'Unique values: {full_df[col].nunique()}')
print(f'Most common values:\n{full_df[col].value_counts().head()}')
plt.figure(figsize=(8, 4))
sns.countplot(y=full_df[col], order=full_df[col].value_counts().index[:10])
plt.title(f'Distribution of {col}')
plt.xlabel('Count')
plt.ylabel(col)
plt.show()
elif pd.api.types.is_numeric_dtype(full_df[col]):
print(f'Basic Stats:\n{full_df[col].describe()}')
plt.figure(figsize=(8, 4))
sns.histplot(full_df[col], bins=30, kde=True)
plt.title(f'Distribution of {col}')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.show()
elif pd.api.types.is_datetime64_any_dtype(full_df[col]):
print(f'Min: {full_df[col].min()}')
print(f'Max: {full_df[col].max()}')
plt.figure(figsize=(10, 5))
full_df[col].dt.hour.value_counts().sort_index().plot(kind='bar')
plt.title(f'Distribution of {col} (Hour-Based)')
plt.xlabel('Hour')
plt.ylabel('Count')
plt.show()
print('-' * 50)
How do different airlines compare in terms of their departure and arrival times? Are there noticeable trends in their on-time performance over the year?