Competition - predict hotel cancellation

Predicting Hotel Cancellations

🏨 Background

You are supporting a hotel with a project aimed to increase revenue from their room bookings. They believe that they can use data science to help them reduce the number of cancellations. This is where you come in!

They have asked you to use any appropriate methodology to identify what contributes to whether a booking will be fulfilled or cancelled. They intend to use the results of your work to reduce the chance someone cancels their booking.

The Data

They have provided you with their bookings data in a file called hotel_bookings.csv, which contains the following:

Column	Description
`Booking_ID`	Unique identifier of the booking.
`no_of_adults`	The number of adults.
`no_of_children`	The number of children.
`no_of_weekend_nights`	Number of weekend nights (Saturday or Sunday).
`no_of_week_nights`	Number of week nights (Monday to Friday).
`type_of_meal_plan`	Type of meal plan included in the booking.
`required_car_parking_space`	Whether a car parking space is required.
`room_type_reserved`	The type of room reserved.
`lead_time`	Number of days before the arrival date the booking was made.
`arrival_year`	Year of arrival.
`arrival_month`	Month of arrival.
`arrival_date`	Date of the month for arrival.
`market_segment_type`	How the booking was made.
`repeated_guest`	Whether the guest has previously stayed at the hotel.
`no_of_previous_cancellations`	Number of previous cancellations.
`no_of_previous_bookings_not_canceled`	Number of previous bookings that were canceled.
`avg_price_per_room`	Average price per day of the booking.
`no_of_special_requests`	Count of special requests made as part of the booking.
`booking_status`	Whether the booking was cancelled or not.

Source (data has been modified): https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset

import pandas as pd
import matplotlib.pyplot as plt
hotels = pd.read_csv("data/hotel_bookings.csv")
hotels

Exploaring Data

hotels.info()

# Group columns by data type
int_cols = hotels.select_dtypes(include=['int64'])
float_cols = hotels.select_dtypes(include=['float64'])
categorical_cols = hotels.select_dtypes(include=['object'])

# Print column names in each group
print("Integer columns:", list(int_cols.columns))
print("Float columns:", list(float_cols.columns))
print("Categorical columns:", list(categorical_cols.columns))

# Describe the hotels DataFrame
description = hotels.describe()

# Print the description
print(description)

# Describe all columns in the hotels DataFrame
description = hotels.describe(include='all')

# Print the description
print(description)

# Get the total number of null values in each column
null_counts = hotels.isnull().sum()
total_rows = hotels.shape[0]
# Calculate the percentage of null values in each column
null_percentages = (null_counts / total_rows) * 100

# Print the results
print(null_percentages.sort_values(ascending=False))

Validate Columns

hotels.head()

Col 1 Booking_ID:

unique value, could be delete

Col 2 no_of_adults:

Most value is 2, it means that most booking from couples
Filling null values with mode of column which is 2

# Count null values
hotels.no_of_adults.isnull().sum()

# Loop through all columns except 'Booking_ID'
for col in hotels.columns[1:]:
    print(f"Column: {col}")
    print(f"Count of values: {hotels[col].count()}")
    print(f"Number of null values: {hotels[col].isnull().sum()}")
    print(f"Percentage of null values: {hotels[col].isnull().sum()/len(hotels)*100:.2f}%")
    print(f"Unique values: {hotels[col].unique()}")
    print(f"Value counts: {hotels[col].value_counts()}")
    print(f"Data type: {hotels[col].dtype}")
    hotels[col].value_counts().plot(kind='bar')
    plt.show()

‌
‌
‌