Introduction
This data analysis project was focused on identifying trends in flight durations in air travel sourced from the 'nycflights2022' collection produced by the ModernDive team.
The datasets include records of flights departing during the second half of 2022 from major New York City airports, including
- JFK (John F. Kennedy International Airport)
- LGA (LaGuardia Airport), and
- EWR (Newark Liberty International Airport)
They offer a comprehensive view of flight operations, covering various aspects such as departure and arrival times, flight paths, and airline specifics:
(only columns of interest for this project have been mentioned here)
flights2022-h2.csvcontains information about each flight including
| Variable | Description |
|---|---|
carrier | Airline carrier code |
origin | Origin airport (IATA code) |
dest | Destination airport (IATA code) |
air_time | Duration of the flight in air, in minutes |
airlines.csvcontains information about each airline:
| Variable | Description |
|---|---|
carrier | Airline carrier code |
name | Full name of the airline |
airports.csvprovides details of airports:
| Variable | Description |
|---|---|
faa | FAA code of the airport |
name | Full name of the airport |
# Importing required packages
library(dplyr)
library(readr)
library(lubridate)
# Loading the data
flights <- read_csv("flights2022-h2.csv")
airlines <- read_csv("airlines.csv")
airports <- read_csv("airports.csv")the given dataset is limited to H2 of 2022 and it only has flight data of three NYC Airports
head(flights)
nrow(flights)The flights dataset has 218802 rows and the time data has been stored in numeric format. Upon inspection I can see, sched_dep_time is in 24:00 notation as confirmed by data in columns hour minute and time_hour at the end. Lets check if other time columns like dep_time, arr_delay and air_time are following similar format
summary(flights)the numeric column stats show that time data is following 24 hour format but it is stored as numeric data type which needs to change for analysis. This is true for most time related entries in this dataset
head(airports)head(airlines)Data Wrangling
I'll fix the air_time column in flights dataset for current analysis:
# converting numeric to character data type
air_time_char <- ifelse(is.na(flights$air_time),NA,sprintf("%04d", flights$air_time))
head(air_time_char)
# adding colon for parsing strings properly
air_time_colon <- ifelse(is.na(air_time_char), NA,
paste0(substr(air_time_char, 1, 2), ":", substr(air_time_char, 3, 4)))
head(air_time_colon)
flights$air_time <- hm(air_time_colon)
# flights dataset after changes
glimpse(flights)
Analysis