Skip to content
0

Understanding flight delays ✈️

📖 Background

You work for a major airline operating flights across the USA. Flight delays are a significant challenge for both the airline and passengers, causing disruptions, financial losses, and dissatisfaction. As part of the airline’s data analytics team, your goal is to analyze historical flight data to uncover delay patterns, identify operational inefficiencies, and predict delays before they occur. By identifying delay patterns, predicting delays, and uncovering the factors that contribute most to delays, you’ll be able to drive operational efficiency and enhance the overall passenger experience. Your insights will help the airline make data-driven decisions to optimize scheduling, improve on-time performance, and enhance passenger satisfaction.

Can you crack the code behind flight delays and revolutionize air travel?

💾 The data

Your team provided you with 2 files with the following information (source):

flights.csv

  • id - Id number of the flight
  • year - Year of Flight
  • month - Month of Flight
  • day - Day of Month
  • dep_time - Time of departure (24h format)
  • sched_dep_time - Scheduled departure time
  • dep_delay - Delay in departure (minutes)
  • arr_time - Time of arrival (24h format)
  • sched_arr_time - Scheduled arrival time
  • arr_delay - Delay in arrival (minutes)
  • carrier - Airline company code
  • flight - Flight number
  • tailnum- Aircraft identifier number
  • origin - Origin Airport - 3 letter code
  • dest - Destination Airport - 3 letter code
  • air_time - Duration of the flight (minutes)
  • distance - Flight distance (miles)
  • hour - Hour component of scheduled departure time
  • minute - Minute component of scheduled departure time

airlines_carrier_codes.csv

  • Carrier Code - Airline company code
  • Airline Name - Airline Name
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier, Pool
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import LabelEncoder
from scipy.sparse import hstack, csr_matrix
import lightgbm as lgb
from imblearn.over_sampling import SMOTENC
from sklearn.utils import resample
# Loading datasets

flight_data = pd.read_csv('data/flights.csv')
airlines_codes = pd.read_csv('data/airlines_carrier_codes.csv')
flight_data.head()
airlines_codes.head() 
full_df=flight_data.merge(airlines_codes, left_on='carrier', right_on='Carrier Code', how='left')
full_df.head()

EDA

Checking the quality of the data by looking at:

  • missing values
  • duplicates
  • datatype
  • unexpected values
# Checking for duplicates
duplicated = full_df[full_df.duplicated()]

if duplicated.empty:
    print('There are no duplicates in the dataset')
else:
    print(f'There are duplicate rows in the dataset:\n{duplicated}')
# Checking the datatypes for all columns

full_df=full_df.rename(columns={'Airline Name':'airline_name'})
full_df.info()
full_df['sched_dep_time'] = pd.to_datetime(full_df['year'].astype(str) + '-' + 
                                           full_df['month'].astype(str) + '-' + 
                                           full_df['day'].astype(str) + ' ' + 
                                           full_df['sched_dep_time'].astype(str).str.zfill(4), 
                                           format='%Y-%m-%d %H%M', errors='coerce')

full_df['dep_time'] = pd.to_datetime(full_df['year'].astype(str) + '-' + 
                                     full_df['month'].astype(str) + '-' + 
                                     full_df['day'].astype(str) + ' ' + 
                                     full_df['dep_time'].astype(str).str.zfill(4), 
                                     format='%Y-%m-%d %H%M', errors='coerce')

full_df['dep_time'].fillna(full_df['sched_dep_time'] + pd.to_timedelta(full_df['dep_delay'], unit='m'), inplace=True)


full_df['sched_arr_time'] = pd.to_datetime(full_df['year'].astype(str) + '-' + 
                                           full_df['month'].astype(str) + '-' + 
                                           full_df['day'].astype(str) + ' ' + 
                                           full_df['sched_arr_time'].astype(str).str.zfill(4), 
                                           format='%Y-%m-%d %H%M', errors='coerce')

full_df['arr_time'] = pd.to_datetime(full_df['year'].astype(str) + '-' + 
                                     full_df['month'].astype(str) + '-' + 
                                     full_df['day'].astype(str) + ' ' + 
                                     full_df['arr_time'].astype(str).str.zfill(4), 
                                     format='%Y-%m-%d %H%M', errors='coerce')

full_df['arr_time'].fillna(full_df['dep_time'] + pd.to_timedelta(full_df['air_time'], unit='m'), inplace=True)

full_df['arr_delay'].fillna((full_df['arr_time'] - full_df['sched_arr_time']).dt.total_seconds() / 60, inplace=True)

columns_to_drop = ['Carrier Code']
full_df.drop(columns=columns_to_drop, inplace=True)
# Checking the % of missing values by column:
round(full_df.isna().sum()/len(full_df)*100,2)
print(f'Length before dropping remaining nan: {len(full_df)}')
full_df=full_df.dropna()
print(f'Length after dropping remaining nan: {len(full_df)}')

Exploring each column individually

for col in full_df.columns:
    print(f'Column: {col}')
    print(f'Datatype: {full_df[col].dtype}')
    
    if full_df[col].dtype == 'object':
        print(f'Unique values: {full_df[col].nunique()}')
        print(f'Most common values:\n{full_df[col].value_counts().head()}')
        plt.figure(figsize=(8, 4))
        sns.countplot(y=full_df[col], order=full_df[col].value_counts().index[:10])
        plt.title(f'Distribution of {col}')
        plt.xlabel('Count')
        plt.ylabel(col)
        plt.show()
    
    elif pd.api.types.is_numeric_dtype(full_df[col]):
        print(f'Basic Stats:\n{full_df[col].describe()}')
        plt.figure(figsize=(8, 4))
        sns.histplot(full_df[col], bins=30, kde=True)
        plt.title(f'Distribution of {col}')
        plt.xlabel(col)
        plt.ylabel('Frequency')
        plt.show()
    
    elif pd.api.types.is_datetime64_any_dtype(full_df[col]):
        print(f'Min: {full_df[col].min()}')
        print(f'Max: {full_df[col].max()}')
        plt.figure(figsize=(10, 5))
        full_df[col].dt.hour.value_counts().sort_index().plot(kind='bar')
        plt.title(f'Distribution of {col} (Hour-Based)')
        plt.xlabel('Hour')
        plt.ylabel('Count')
        plt.show()

    print('-' * 50)
How do different airlines compare in terms of their departure and arrival times? Are there noticeable trends in their on-time performance over the year?