Indian Flights 2019
1) Executive Summary
The dataset for this project was sourced from publicly available information and provided by DataCamp instructors for analysis.
The project involved a comprehensive analysis of a flight dataset from 2019 using Python. The dataset encompassed key attributes such as airline, date, origin, destination, route, departure/arrival times, duration, stops, and price.
Rigorous data cleaning and pre-processing were undertaken to address missing values, inconsistencies, and outliers, while generating new features for enhanced insights. Exploratory Data Analysis (EDA) revealed that the data spanned four months with a consistent sample size of ten days per month and was limited to five routes.
Uni-variate, Bi-variate, and Multi-variate analyses were conducted on various features, leading to the formulation of five hypotheses. Subsequent hypothesis testing determined that three of these hypotheses were statistically significant.
With this project, I intend to showcase my proficiency in data manipulation, analysis, and interpretation, demonstrating my capabilities as a data analyst.
2) The Data
We have data related to domestic Indian flights for the year 2019. It comprises of the following fields:
Airline: Name of the carrier airlineDate_of_Journey: Date of the flightSource: Departure cityDestination: Arrival cityRoute: Complete journey path (e.g., DEL -> BOM -> HYD)Dep_Time: Departure timeArrival_Time: Arrival timeDuration: Total flight durationTotal_Stops: Number of intermediate stopsPrice: Ticket price of each flight
# importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')# importing data
flights = pd.read_csv('datasets/planes.csv')
flights.head()
flights.shapeflights.info()There are 10660 entries and 11 columns in the dataset, we have null values in several columns and all the columns except Price have been imported as objects. We may need to fix the data types of some columns for accurate analysis
2.1) Handling Missing Data
flights.isna().sum().plot(kind='bar')
plt.show()Let's focus our attention on the null values which make up over 5% of the total entries in the dataset. Due to the lack of data for filling in the missing entries, the columns having null less than 5% can be cleaned through .drop_na() method with an assumption that they are missing completely at random (MAR).
2.1.1) Fill In Value - 'Unknown'