Predicting Hotel Cancellations
🏨 Introduction
The online hotel reservation channels have dramatically changed booking possibilities and customers’ behavior. A significant number of hotel reservations are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with.
This analysis is to predict what factors contribute to cancelations and will we be able to predict it.Also we will try to give some recomendations for the hotels to decrease the cancelation rates.
Lets begin by importing Pandas library and downloading and extracting the dataset to a dataframe.
import pandas as pd
hotels_raw_df = pd.read_csv("data/hotel_bookings.csv")
hotels_raw_df
Data Preparation & Cleaning
Data preparation is the process of preparing the data by cleaning and transforming raw data before processing and analysis. It is an important step before processing and often involves reformatting data, making corrections to data, and combining data sets to enrich data. Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or null values within a dataset. When combining multiple data sources, there are many opportunities for data to be duplicated or mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even though they may look correct. We’ll use the name hotels_raw_df for the data frame to indicate this is unprocessed data that we might clean, filter, and modify to prepare a data frame ready for analysis.
hotels_raw_df.info()
The dataset contains over 36275 rows and 19 columns.Most columns have the data type float. It appears that a few columns contain some empty values since the Non-Null count for a few columns is lower than the total number of rows (36275). We’ll need to deal with null values and manually adjust the data type for each column on a case-by-case basis. But first, let’s check the total number of null values and percentage of null values in our data frame.
# total number of null values in each column
hotels_raw_df.isna().sum()
# Percentage of null values by each column
round((hotels_raw_df.isna().sum().sort_values(ascending = False) * 100) / len(hotels_raw_df), 2)
Percentage of null values in the dataset is very small. So in this analysis,I am going to drop null values and storing the dataset into a new dataframe hotel_new.
hotel_new=hotels_raw_df.dropna().reset_index(drop=True)
hotel_new.shape
Now after droping null values, we hace 27511 rows and 19 columns. Lets check for duplicate values.
hotel_new.Booking_ID.duplicated
We can see no duplicates are present in the dataset
Now lets check for rows with no adults present as reservation cannot be made without an adult. So we will find those rows and delete them