Competition - predict hotel cancellation

Predicting Hotel Cancellations

🏨 Background

You are supporting a hotel with a project aimed to increase revenue from their room bookings. They believe that they can use data science to help them reduce the number of cancellations. This is where you come in!

They have asked you to use any appropriate methodology to identify what contributes to whether a booking will be fulfilled or cancelled. They intend to use the results of your work to reduce the chance someone cancels their booking.

The Data

They have provided you with their bookings data in a file called hotel_bookings.csv, which contains the following:

Column	Description
`Booking_ID`	Unique identifier of the booking.
`no_of_adults`	The number of adults.
`no_of_children`	The number of children.
`no_of_weekend_nights`	Number of weekend nights (Saturday or Sunday).
`no_of_week_nights`	Number of week nights (Monday to Friday).
`type_of_meal_plan`	Type of meal plan included in the booking.
`required_car_parking_space`	Whether a car parking space is required.
`room_type_reserved`	The type of room reserved.
`lead_time`	Number of days before the arrival date the booking was made.
`arrival_year`	Year of arrival.
`arrival_month`	Month of arrival.
`arrival_date`	Date of the month for arrival.
`market_segment_type`	How the booking was made.
`repeated_guest`	Whether the guest has previously stayed at the hotel.
`no_of_previous_cancellations`	Number of previous cancellations.
`no_of_previous_bookings_not_canceled`	Number of previous bookings that were canceled.
`avg_price_per_room`	Average price per day of the booking.
`no_of_special_requests`	Count of special requests made as part of the booking.
`booking_status`	Whether the booking was cancelled or not.

Source (data has been modified): https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset

#import the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#load in the dataset
hotels = pd.read_csv("data/hotel_bookings.csv")
hotels.head()

Data Accessment

Here, the Data will be accessed to get acquainted with it and to derive any quality and tidiness issues comprised in the Data for easy fixing when cleaning. However, the result from this accessment will be documented for reference sake.

#check the shape of the dataset
hotels.shape

The Dataset contains 36275 observations and 19 variables

Run cancelled

#check the information about the dataset
hotels.info()

Run cancelled

#check the description of the dataset
hotels.describe().transpose()

#check for duplicate rows
hotels.duplicated().sum()

#check for missing vales
hotels.isna().sum()

Accessment Documentation

Many columns have too long column names
Many columns contain null values
Many columns have float datatypes
arrival_date splitted to three columns in float datatype
so many columns contain null values
repeated_guest column in zeros and ones

Data Cleaning

Here, A copy of the dataset will be created. Then proceed to cleaning to fix all the issues dervided when accessing the datasets. Each issue will be attended to separately using the Define, Code and Test methods.

#make a copy of the dataset
df = hotels.copy()
#check 5 samples
df.sample(5)

Many columns have too long column names.

‌
‌
‌