Competition - Understanding Flight Delays ✈️

Understanding flight delays ✈️

📖 Background

You work for a major airline operating flights across the USA. Flight delays are a significant challenge for both the airline and passengers, causing disruptions, financial losses, and dissatisfaction. As part of the airline’s data analytics team, your goal is to analyze historical flight data to uncover delay patterns, identify operational inefficiencies, and predict delays before they occur. By identifying delay patterns, predicting delays, and uncovering the factors that contribute most to delays, you’ll be able to drive operational efficiency and enhance the overall passenger experience. Your insights will help the airline make data-driven decisions to optimize scheduling, improve on-time performance, and enhance passenger satisfaction.

Can you crack the code behind flight delays and revolutionize air travel?

💾 The data

Your team provided you with 2 files with the following information (source):

flights.csv

id - Id number of the flight
year - Year of Flight
month - Month of Flight
day - Day of Month
dep_time - Time of departure (24h format)
sched_dep_time - Scheduled departure time
dep_delay - Delay in departure (minutes)
arr_time - Time of arrival (24h format)
sched_arr_time - Scheduled arrival time
arr_delay - Delay in arrival (minutes)
carrier - Airline company code
flight - Flight number
tailnum- Aircraft identifier number
origin - Origin Airport - 3 letter code
dest - Destination Airport - 3 letter code
air_time - Duration of the flight (minutes)
distance - Flight distance (miles)
hour - Hour component of scheduled departure time
minute - Minute component of scheduled departure time

airlines_carrier_codes.csv

Carrier Code - Airline company code
Airline Name - Airline Name

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier, Pool
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import LabelEncoder
from scipy.sparse import hstack, csr_matrix
import lightgbm as lgb
from imblearn.over_sampling import SMOTENC
from sklearn.utils import resample

# Loading datasets

flight_data = pd.read_csv('data/flights.csv')
airlines_codes = pd.read_csv('data/airlines_carrier_codes.csv')
flight_data.head()

airlines_codes.head()

full_df=flight_data.merge(airlines_codes, left_on='carrier', right_on='Carrier Code', how='left')
full_df.head()

EDA

Checking the quality of the data by looking at:

missing values
duplicates
datatype
unexpected values

# Checking for duplicates
duplicated = full_df[full_df.duplicated()]

if duplicated.empty:
    print('There are no duplicates in the dataset')
else:
    print(f'There are duplicate rows in the dataset:\n{duplicated}')

# Checking the datatypes for all columns

full_df=full_df.rename(columns={'Airline Name':'airline_name'})
full_df.info()

full_df['sched_dep_time'] = pd.to_datetime(full_df['year'].astype(str) + '-' + 
                                           full_df['month'].astype(str) + '-' + 
                                           full_df['day'].astype(str) + ' ' + 
                                           full_df['sched_dep_time'].astype(str).str.zfill(4), 
                                           format='%Y-%m-%d %H%M', errors='coerce')

full_df['dep_time'] = pd.to_datetime(full_df['year'].astype(str) + '-' + 
                                     full_df['month'].astype(str) + '-' + 
                                     full_df['day'].astype(str) + ' ' + 
                                     full_df['dep_time'].astype(str).str.zfill(4), 
                                     format='%Y-%m-%d %H%M', errors='coerce')

full_df['dep_time'].fillna(full_df['sched_dep_time'] + pd.to_timedelta(full_df['dep_delay'], unit='m'), inplace=True)


full_df['sched_arr_time'] = pd.to_datetime(full_df['year'].astype(str) + '-' + 
                                           full_df['month'].astype(str) + '-' + 
                                           full_df['day'].astype(str) + ' ' + 
                                           full_df['sched_arr_time'].astype(str).str.zfill(4), 
                                           format='%Y-%m-%d %H%M', errors='coerce')

full_df['arr_time'] = pd.to_datetime(full_df['year'].astype(str) + '-' + 
                                     full_df['month'].astype(str) + '-' + 
                                     full_df['day'].astype(str) + ' ' + 
                                     full_df['arr_time'].astype(str).str.zfill(4), 
                                     format='%Y-%m-%d %H%M', errors='coerce')

full_df['arr_time'].fillna(full_df['dep_time'] + pd.to_timedelta(full_df['air_time'], unit='m'), inplace=True)

full_df['arr_delay'].fillna((full_df['arr_time'] - full_df['sched_arr_time']).dt.total_seconds() / 60, inplace=True)

columns_to_drop = ['Carrier Code']
full_df.drop(columns=columns_to_drop, inplace=True)

# Checking the % of missing values by column:
round(full_df.isna().sum()/len(full_df)*100,2)

print(f'Length before dropping remaining nan: {len(full_df)}')
full_df=full_df.dropna()
print(f'Length after dropping remaining nan: {len(full_df)}')

Exploring each column individually

for col in full_df.columns:
    print(f'Column: {col}')
    print(f'Datatype: {full_df[col].dtype}')
    
    if full_df[col].dtype == 'object':
        print(f'Unique values: {full_df[col].nunique()}')
        print(f'Most common values:\n{full_df[col].value_counts().head()}')
        plt.figure(figsize=(8, 4))
        sns.countplot(y=full_df[col], order=full_df[col].value_counts().index[:10])
        plt.title(f'Distribution of {col}')
        plt.xlabel('Count')
        plt.ylabel(col)
        plt.show()
    
    elif pd.api.types.is_numeric_dtype(full_df[col]):
        print(f'Basic Stats:\n{full_df[col].describe()}')
        plt.figure(figsize=(8, 4))
        sns.histplot(full_df[col], bins=30, kde=True)
        plt.title(f'Distribution of {col}')
        plt.xlabel(col)
        plt.ylabel('Frequency')
        plt.show()
    
    elif pd.api.types.is_datetime64_any_dtype(full_df[col]):
        print(f'Min: {full_df[col].min()}')
        print(f'Max: {full_df[col].max()}')
        plt.figure(figsize=(10, 5))
        full_df[col].dt.hour.value_counts().sort_index().plot(kind='bar')
        plt.title(f'Distribution of {col} (Hour-Based)')
        plt.xlabel('Hour')
        plt.ylabel('Count')
        plt.show()

    print('-' * 50)

How do different airlines compare in terms of their departure and arrival times? Are there noticeable trends in their on-time performance over the year?

‌
‌
‌

Competition - Understanding Flight Delays ✈️

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Understanding flight delays ✈️

📖 Background

💾 The data

EDA

How do different airlines compare in terms of their departure and arrival times? Are there noticeable trends in their on-time performance over the year?

Understanding flight delays ✈️