EDA of Indian Flights 2019

Indian Flights 2019

1) Executive Summary

The dataset for this project was sourced from publicly available information and provided by DataCamp instructors for analysis.

The project involved a comprehensive analysis of a flight dataset from 2019 using Python. The dataset encompassed key attributes such as airline, date, origin, destination, route, departure/arrival times, duration, stops, and price.

Rigorous data cleaning and pre-processing were undertaken to address missing values, inconsistencies, and outliers, while generating new features for enhanced insights. Exploratory Data Analysis (EDA) revealed that the data spanned four months with a consistent sample size of ten days per month and was limited to five routes.

Uni-variate, Bi-variate, and Multi-variate analyses were conducted on various features, leading to the formulation of five hypotheses. Subsequent hypothesis testing determined that three of these hypotheses were statistically significant.

With this project, I intend to showcase my proficiency in data manipulation, analysis, and interpretation, demonstrating my capabilities as a data analyst.

2) The Data

We have data related to domestic Indian flights for the year 2019. It comprises of the following fields:

Airline : Name of the carrier airline
Date_of_Journey : Date of the flight
Source : Departure city
Destination : Arrival city
Route : Complete journey path (e.g., DEL -> BOM -> HYD)
Dep_Time : Departure time
Arrival_Time : Arrival time
Duration : Total flight duration
Total_Stops : Number of intermediate stops
Price : Ticket price of each flight

# importing libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')

# importing data

flights = pd.read_csv('datasets/planes.csv')
flights.head()

flights.shape

flights.info()

There are 10660 entries and 11 columns in the dataset, we have null values in several columns and all the columns except Price have been imported as objects. We may need to fix the data types of some columns for accurate analysis

2.1) Handling Missing Data

flights.isna().sum().plot(kind='bar')
plt.show()

Let's focus our attention on the null values which make up over 5% of the total entries in the dataset. Due to the lack of data for filling in the missing entries, the columns having null less than 5% can be cleaned through .drop_na() method with an assumption that they are missing completely at random (MAR).

2.1.1) Fill In Value - 'Unknown'