Predicting Hotel Cancellations
🏨 Background
You are supporting a hotel with a project aimed to increase revenue from their room bookings. They believe that they can use data science to help them reduce the number of cancellations. This is where you come in!
They have asked you to use any appropriate methodology to identify what contributes to whether a booking will be fulfilled or cancelled. They intend to use the results of your work to reduce the chance someone cancels their booking.
The Data
They have provided you with their bookings data in a file called hotel_bookings.csv, which contains the following:
| Column | Description |
|---|---|
Booking_ID | Unique identifier of the booking. |
no_of_adults | The number of adults. |
no_of_children | The number of children. |
no_of_weekend_nights | Number of weekend nights (Saturday or Sunday). |
no_of_week_nights | Number of week nights (Monday to Friday). |
type_of_meal_plan | Type of meal plan included in the booking. |
required_car_parking_space | Whether a car parking space is required. |
room_type_reserved | The type of room reserved. |
lead_time | Number of days before the arrival date the booking was made. |
arrival_year | Year of arrival. |
arrival_month | Month of arrival. |
arrival_date | Date of the month for arrival. |
market_segment_type | How the booking was made. |
repeated_guest | Whether the guest has previously stayed at the hotel. |
no_of_previous_cancellations | Number of previous cancellations. |
no_of_previous_bookings_not_canceled | Number of previous bookings that were canceled. |
avg_price_per_room | Average price per day of the booking. |
no_of_special_requests | Count of special requests made as part of the booking. |
booking_status | Whether the booking was cancelled or not. |
Source (data has been modified): https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset
import pandas as pd
import numpy as np
import matplotlib as plt
hotels = pd.read_csv("data/hotel_bookings.csv")
hotelsObjectives
Hi All! With this hotel reservation data, I'll try to determine if there are any patterns and/or plausible causes that can be used to explain cancelations. Here are some potential causes/relationships that I'll be investigating in this project:
- Does the time between booking and arrival date influence the likelihood of cancelation?
- When looking at arrival dates for these reservations, does the time in the year have an effect on cancelations?
- Is there any correlation between the party size and cancelation rates? (i.e. individuals vs couples vs families vs friends etc)
- Are reservations booked during the week or weekends more likely to be canceled?
- Does the average price of the room affect cancelations?
- Are first-time bookers (non-repeat guests) more or less likely to cancel?
- How many repeat guests have a high frequency of canceling?
Cleaning Data
Above, the data was extracted and the DataFrame for the full hotel data was created using pandas. The DataFrame has also been displayed to show the columns and the respective data we'll be analyzing. Next, I'll start the process of cleaning the data from missing data values and creating separate variables for key data.
hotels['type_of_meal_plan'] = hotels['type_of_meal_plan'].fillna('Not Selected')
hotels['arrival_year'] = hotels['arrival_year'].fillna(2017)
hotels['arrival_month'] = hotels['arrival_month'].fillna(1)
hotels['arrival_date'] = hotels['arrival_date'].fillna(1)
hotels['no_of_adults'] = hotels['no_of_adults'].fillna(1)
hotels['no_of_children'] = hotels['no_of_children'].fillna(0)
hotels