Skip to content
0

Understanding flight delays ✈️

📖 Background

You work for a major airline operating flights across the USA. Flight delays are a significant challenge for both the airline and passengers, causing disruptions, financial losses, and dissatisfaction. As part of the airline’s data analytics team, your goal is to analyze historical flight data to uncover delay patterns, identify operational inefficiencies, and predict delays before they occur. By identifying delay patterns, predicting delays, and uncovering the factors that contribute most to delays, you’ll be able to drive operational efficiency and enhance the overall passenger experience. Your insights will help the airline make data-driven decisions to optimize scheduling, improve on-time performance, and enhance passenger satisfaction.

Can you crack the code behind flight delays and revolutionize air travel?

💾 The data

Your team provided you with 2 files with the following information (source):

flights.csv

  • id - Id number of the flight
  • year - Year of Flight
  • month - Month of Flight
  • day - Day of Month
  • dep_time - Time of departure (24h format)
  • sched_dep_time - Scheduled departure time
  • dep_delay - Delay in departure (minutes)
  • arr_time - Time of arrival (24h format)
  • sched_arr_time - Scheduled arrival time
  • arr_delay - Delay in arrival (minutes)
  • carrier - Airline company code
  • flight - Flight number
  • tailnum- Aircraft identifier number
  • origin - Origin Airport - 3 letter code
  • dest - Destination Airport - 3 letter code
  • air_time - Duration of the flight (minutes)
  • distance - Flight distance (miles)
  • hour - Hour component of scheduled departure time
  • minute - Minute component of scheduled departure time

airlines_carrier_codes.csv

  • Carrier Code - Airline company code
  • Airline Name - Airline Name
import pandas as pd
flight_data = pd.read_csv('data/flights.csv')
airlines_codes = pd.read_csv('data/airlines_carrier_codes.csv')
flight_data.head()
airlines_codes.head() 

💪 Challenge

Create a report summarizing your insights. Your report should explore the following questions:

  1. How do different airlines compare in terms of their departure and arrival times? Are there noticeable trends in their on-time performance over the year? A well-structured visualization could help uncover patterns.
  2. Are there particular months/weeks/time of day where there is a general trend of greater delays in flights across all carriers? If so, what could be the reasons?
  3. Some airports seem to operate like clockwork, while others are notorious for disruptions. How do different airports compare when it comes to departure and arrival punctuality? Could location, traffic volume, or other factors play a role? Are there patterns that emerge when looking at delays across various airports?
  4. [Optional 1] Predict whether a flight will have a delay of 15 minutes or more at departure.
  5. [Optional 2] What underlying factors influence flight delays the most? Are some routes more prone to disruptions than others? Do external variables like time of day, distance, or carrier policies play a significant role? By analyzing the relationships between different features, you might discover unexpected insights.

🧑‍⚖️ Judging criteria: your vote, your winners!

This is a community-driven competition, your votes decide the winners! Once the competition ends, you'll get to explore submissions, celebrate the best insights, and vote for your favorites. The top 5 most upvoted entries will win exclusive DataCamp merchandise - so bring your A-game, impress your peers, and claim your spot at the top!

✅ Checklist before publishing

  • Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
  • Remove redundant cells like the introduction to data science notebooks, so the workbook is focused on your story.
  • Check that all the cells run without error.

⌛️ Time is ticking. Good luck!

Data Exploration

flight_data.info()
flight_data.describe()
airlines_codes.info()

🧩 Data Preparation

flight_data.isnull().sum().sort_values(ascending=False)
# Drop rows with critical missing time/delay data
critical_cols = ['arr_delay', 'air_time', 'arr_time', 'dep_time', 'dep_delay']
flight_data_clean = flight_data.dropna(subset=critical_cols)

# Optional: Fill in tail numbers with a placeholder
flight_data_clean['tailnum'].fillna('UNKNOWN', inplace=True)