    Machine Learning with Python

    In this notebook, you will build a machine learning model to predict whether or not a customer cancelled a hotel booking. You will be introduced to the scikit-learn framework to do machine learning in Python.

    We will use a dataset on hotel bookings from the article "Hotel booking demand datasets", published in the Elsevier journal, Data in Brief. The abstract of the article states

    This data article describes two datasets with hotel demand data. One of the hotels (H1) is a resort hotel and the other is a city hotel (H2). Both datasets share the same structure, with 31 variables describing the 40,060 observations of H1 and 79,330 observations of H2. Each observation represents a hotel booking. Both datasets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled.

    For convenience, the two datasets have been combined into a single csv file data/hotel_bookings.csv. Let us start by importing all the functions needed to import, visualize and model the data.

    # Data imports
    import pandas as pd
    import numpy as np
    # Visualization imports
    import matplotlib.pyplot as plt
    import as px
    plt.rcParams['figure.figsize'] = [8, 4]
    # ML Imports
    from sklearn.model_selection import train_test_split, KFold, cross_validate, cross_val_score
    from sklearn.pipeline import Pipeline
    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import LabelEncoder, OneHotEncoder
    from sklearn.impute import SimpleImputer
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.metrics import accuracy_score

    0. Get the data

    The first step in any machine learning workflow is to get the data and explore it.

    hotel_bookings = pd.read_csv('data/hotel_bookings.csv')

    Let us look at the number of bookings by month.

    bookings_by_month = hotel_bookings.groupby('arrival_date_month', as_index=False)[['hotel']].count().rename(columns={"hotel": "nb_bookings"})
    months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'] 
    fig =
        title=f'Hotel Bookings by Month', 
        category_orders={"arrival_date_month": months}
    ){"displayModeBar": False})

    Our objective is to build a model that predicts whether or not a user cancelled a hotel booking.

    1. Split the data into training and test sets.

    Let us start by defining a split to divide the data into training and test sets. The basic idea is to train the model on a portion of the data and test its performance on the other portion that has not been seen by the model. This is done in order to prevent overfitting. We will use four-fold cross validation with shuffling.

    split = KFold(n_splits=4, shuffle=True, random_state=1234)

    2. Choose a class of models, and hyperparameters.

    The next step is to choose a class of models and specify hyperparameters. This is just for starters and we will see later how we can specify a range of values for hyperparameters and tune the model for optimal performance! We will pick the simple, yet very effective Decision Tree and Random Forest models. We will use scikit-learn to fit the models and evaluate their performance.

    from IPython.display import Image
    Image("", width=750)
    models = [
      ("Decision Tree", DecisionTreeClassifier(random_state=1234)),
      ("Random Forest", RandomForestClassifier(random_state=1234,n_jobs=-1)),

    3. Preprocess the data

    The next step is to set up a pipeline to preprocess the features. We will impute all missing values with a constant, and one-hot encode all categorical features.

    # Preprocess numerical features:
    features_num = [
        "lead_time", "arrival_date_week_number", "arrival_date_day_of_month", "stays_in_weekend_nights",
        "stays_in_week_nights", "adults", "children", "babies", "is_repeated_guest" ,
        "previous_cancellations", "previous_bookings_not_canceled", "agent", "company", 
        "required_car_parking_spaces", "total_of_special_requests", "adr"
    transformer_num = SimpleImputer(strategy="constant")
    # Preprocess categorical features:
    features_cat = [
        "hotel", "arrival_date_month", "meal", "market_segment", "distribution_channel", 
        "reserved_room_type", "deposit_type", "customer_type"
    transformer_cat = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="constant", fill_value="Unknown")),
        ("onehot", OneHotEncoder(handle_unknown='ignore'))
    # Create a preprocessing pipeline
    preprocessor = ColumnTransformer(transformers=[
        ("num", transformer_num, features_num),
        ("cat", transformer_cat, features_cat)

    4. Fit the models and evaluate performance

    Finally, we will fit the Decision Tree and Random Forest models on the training data and use 4-fold cross-validation to evaluate their performance.

    features = features_num + features_cat
    X = hotel_bookings[features]
    y = hotel_bookings["is_canceled"]