Hotel Booking Demand

This dataset consists of booking data from a city hotel and a resort hotel. It includes many details about the bookings, including room specifications, the length of stay, the time between the booking and the stay, whether the booking was canceled, and how the booking was made. The data was gathered between July 2015 and August 2017.

Not sure where to begin? Scroll to the bottom to find challenges!

import pandas as pd
bookings = pd.read_csv("hotel_bookings_clean.csv")
print(bookings.shape)
bookings.head(100)

Data Dictionary

Note: For binary variables: 1 = true and 0 = false.

Column	Explanation
is_canceled	Binary variable indicating whether a booking was canceled
lead_time	Number of days between booking date and arrival date
arrival_date_week_number, arrival_date_day_of_month, arrival_date_month	Week number, day date, and month number of arrival date
stays_in_weekend_nights, stays_in_week_nights	Number of weekend nights (Saturday and Sunday) and weeknights (Monday to Friday) the customer booked
adults, children, babies	Number of adults, children, babies booked for the stay
is_repeated_guest	Binary variable indicating whether the customer was a repeat guest
previous_cancellations	Number of prior bookings that were canceled by the customer
previous_bookings_not_canceled	Number of prior bookings that were not canceled by the customer
required_car_parking_spaces	Number of parking spaces requested by the customer
total_of_special_requests	Number of special requests made by the customer
avg_daily_rate	Average daily rate, as defined by dividing the sum of all lodging transactions by the total number of staying nights
booked_by_company	Binary variable indicating whether a company booked the booking
booked_by_agent	Binary variable indicating whether an agent booked the booking
hotel_City	Binary variable indicating whether the booked hotel is a "City Hotel"
hotel_Resort	Binary variable indicating whether the booked hotel is a "Resort Hotel"
meal_BB	Binary variable indicating whether a bed & breakfast meal was booked
meal_HB	Binary variable indicating whether a half board meal was booked
meal_FB	Binary variable indicating whether a full board meal was booked
meal_No_meal	Binary variable indicating whether there was no meal package booked
market_segment_Aviation, market_segment_Complementary, market_segment_Corporate, market_segment_Direct, market_segment_Groups, market_segment_Offline_TA_TO, market_segment_Online_TA, market_segment_Undefined	Indicates market segment designation with a value of 1. "TA"= travel agent, "TO"= tour operators
distribution_channel_Corporate, distribution_channel_Direct, distribution_channel_GDS, distribution_channel_TA_TO, distribution_channel_Undefined	Indicates booking distribution channel with a value of 1. "TA"= travel agent, "TO"= tour operators, "GDS" = Global Distribution System
reserved_room_type_A, reserved_room_type_B, reserved_room_type_C, reserved_room_type_D, reserved_room_type_E, reserved_room_type_F, reserved_room_type_G, reserved_room_type_H, reserved_room_type_L	Indicates code of room type reserved with a value of 1. Code is presented instead of designation for anonymity reasons
deposit_type_No_Deposit	Binary variable indicating whether a deposit was made
deposit_type_Non_Refund	Binary variable indicating whether a deposit was made in the value of the total stay cost
deposit_type_Refundable	Binary variable indicating whether a deposit was made with a value under the total stay cost
customer_type_Contract	Binary variable indicating whether the booking has an allotment or other type of contract associated to it
customer_type_Group	Binary variable indicating whether the booking is associated to a group
customer_type_Transient	Binary variable indicating whether the booking is not part of a group or contract, and is not associated to other transient booking
customer_type_Transient-Party	Binary variable indicating whether the booking is transient, but is associated to at least another transient booking

Source and license of data.

Citation: The data is originally from an article called Hotel booking demand datasets by Nuno Antonio, Ana de Almeida, and Luis Nunes. It was cleaned by Thomas Mock and Antoine Bichat for #TidyTuesday during the week of February 11th, 2020.

Don't know where to start?

Challenges are brief tasks designed to help you practice specific skills:

🗺️ Explore: Which family sizes are associated with the highest cancellation rate?
📊 Visualize: Create a plot that visualizes the cancellation rates of different times of the year.
🔎 Analyze: Are bookings with longer lead times more likely to result in a cancellation?

Scenarios are broader questions to help you develop an end-to-end project for your portfolio:

A chain of hotels has just hired you as a data analyst. They have noticed that the cancellation rate has risen in the past few years. This often leads to rooms that are left unrented for multiple days at a time.

Hotel management is interested in developing a model to predict the likelihood that a customer will cancel their reservation. If successful, this could be used to optimize their booking service and anticipate when cancellations will occur.

You will need to prepare a report that is accessible to a broad audience. It should outline your motivation, steps, findings, and conclusions.

list(bookings.columns)

bookings['Family_size']= bookings['adults']+ bookings['children']+bookings['babies']
cancellation_rate = bookings.groupby('Family_size')['is_canceled'].mean().sort_values(ascending= False)
print(cancellation_rate)

import matplotlib.pyplot as plt
import seaborn as sns
month_order = [
    'January', 'February', 'March', 'April', 'May', 'June',
    'July', 'August', 'September', 'October', 'November', 'December'
]
bookings['arrival_date_month']= pd.Categorical(bookings['arrival_date_month'], categories= month_order, ordered = True)


# Calculate cancellation rates by month
Monthly_cancellation = bookings.groupby('arrival_date_month')['is_canceled'].mean().reset_index()

Monthly_cancellation.head()

# First, verify your data
print(Monthly_cancellation.head())
print("\nDataFrame shape:", Monthly_cancellation.shape)
print("Null values in 'is_canceled':", Monthly_cancellation['is_canceled'].isna().sum())

# Robust version of the plotting code
plt.figure(figsize=(12, 6))

# Check if we have valid data to plot
if len(Monthly_cancellation) > 0 and not Monthly_cancellation['is_canceled'].isna().all():
    sns.lineplot(
        data=Monthly_cancellation,
        x='arrival_date_month',
        y='is_canceled',
        marker='o',
        color='royalblue',
        linewidth=2.5
    )
    
    plt.title('Monthly Hotel Booking Cancellation Rates', fontsize=16, pad=20)
    plt.xlabel('Month', fontsize=14)
    plt.ylabel('Cancellation Rate', fontsize=14)
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis='y', linestyle='--', alpha=0.4)
    plt.gca().yaxis.set_major_formatter(plt.matplotlib.ticker.PercentFormatter(1.0))
    
    # Only add peak line if we have valid data
    if not Monthly_cancellation['is_canceled'].isna().any():
        peak_month = Monthly_cancellation.loc[Monthly_cancellation['is_canceled'].idxmax()]
        plt.axvline(
            x=peak_month['arrival_date_month'],
            color='crimson',
            linestyle='--',
            linewidth=1.5,
            label=f'Peak: {peak_month["arrival_date_month"]} ({peak_month["is_canceled"]:.1%})'
        )
        
        # Add value labels
        for index, row in Monthly_cancellation.iterrows():
            plt.text(
                x=row['arrival_date_month'],
                y=row['is_canceled'] + 0.01,
                s=f'{row["is_canceled"]:.1%}',
                ha='center',
                fontsize=10
            )
        
        plt.legend()
    
    plt.tight_layout()
    plt.show()
else:
    print("No valid cancellation data available to plot.")

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Ensure data quality
bookings = bookings.dropna(subset=['arrival_date_day_of_month', 'is_canceled'])

# Calculate cancellation rates by day
daily_cancel = (bookings.groupby('arrival_date_day_of_month')['is_canceled']
                .mean()
                .reset_index())

# Create the plot
plt.figure(figsize=(14, 6))

# Line plot with markers
ax = sns.lineplot(
    data=daily_cancel,
    x='arrival_date_day_of_month',
    y='is_canceled',
    marker='o',
    color='teal',
    linewidth=2.5
)

# Customize the plot
plt.title('Daily Cancellation Rates (Day of Month)', fontsize=16, pad=20)
plt.xlabel('Day of Month', fontsize=14)
plt.ylabel('Cancellation Rate', fontsize=14)
plt.xticks(range(1, 32))  # Show all days 1-31
plt.grid(axis='y', linestyle='--', alpha=0.4)

# Format as percentages
ax.yaxis.set_major_formatter(plt.matplotlib.ticker.PercentFormatter(1.0))

# Highlight peak day if data exists
if not daily_cancel.empty:
    peak_day = daily_cancel.loc[daily_cancel['is_canceled'].idxmax()]
    plt.axvline(
        x=peak_day['arrival_date_day_of_month'],
        color='coral',
        linestyle='--',
        label=f'Peak: Day {peak_day["arrival_date_day_of_month"]} ({peak_day["is_canceled"]:.1%})'
    )
    
    # Add data labels
    for _, row in daily_cancel.iterrows():
        ax.text(
            x=row['arrival_date_day_of_month'],
            y=row['is_canceled'] + 0.01,
            s=f'{row["is_canceled"]:.1%}',
            ha='center',
            fontsize=9
        )

plt.legend()
plt.tight_layout()
plt.show()

correlation = bookings['lead_time'].corr(bookings['is_canceled'])
print(f"Correlation between lead time and cancellations: {correlation:.2f}")


plt.figure(figsize=(10, 6))

# Convert is_canceled to strings for the palette mapping
sns.boxplot(
    data=bookings,
    x='is_canceled',
    y='lead_time',
    palette={'0': 'green', '1': 'red'}  # Keys as strings!
)

plt.title('Lead Time Distribution: Canceled vs. Not Canceled')
plt.xlabel('Canceled (1 = Yes, 0 = No)')
plt.ylabel('Lead Time (Days)')
plt.show()

# 2. Cancellation rate by lead time bins
bookings['lead_time_bin'] = pd.cut(
    bookings['lead_time'],
    bins=[0, 30, 60, 90, 180, 365, float('inf')],
    labels=['0-30', '30-60', '60-90', '90-180', '180-365', '365+']
)

bin_rates = bookings.groupby('lead_time_bin')['is_canceled'].mean().reset_index()

plt.figure(figsize=(12, 6))
sns.barplot(
    data=bin_rates,
    x='lead_time_bin',
    y='is_canceled',
    palette='viridis'
)
plt.title('Cancellation Rate by Lead Time')
plt.xlabel('Lead Time (Days)')
plt.ylabel('Cancellation Rate')
plt.gca().yaxis.set_major_formatter(plt.matplotlib.ticker.PercentFormatter(1.0))
plt.show()

Hotel Booking Demand

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Hotel Booking Demand

Data Dictionary

Don't know where to start?

Hotel Booking Demand