Skip to content
Duplicate of Feature Engineering for Predicting Hotel Bookings with tidymodels - solution
  • AI Chat
  • Code
  • Report
  • Required packages

    library(tidyverse)
    library(tidymodels)
    library(lubridate)
    library(devtools)
    install.packages("naniar")
    library(naniar)
    options(warn = -1)
    Hidden output

    A silly but informative exercise

    The height data set contains observations of the height of an object at several points in time. Let's build a simple linear model to predict height.

    • Build a linear model using the base R lm function.
    • Bind a prediction column to the height data frame.
    • Graph the data and predictions in one chart.
    height <- read_csv("height.csv")
    
    # Build a linear model using the base R lm function
    lm_model <- lm(height ~ time, data = height)
    
    # Bind a prediction column to the height data frame
    height_lm_aug <- height %>% 
    	bind_cols(lm_pred = predict(lm_model))
    
    # Graph the data and the predictions in one chart
    height_lm_aug %>%
      ggplot(aes(x = time, y = height)) +
      geom_point() +
      geom_line(aes(y = lm_pred), color = "blue", lwd = .75)

    This result is definitely bad. However, from the shape of the data, we can infer a quadratic behavior. So let's take a shot at this idea and add a feature to our data frame.

    • Build a model using the lm() function to predict height in terms of and .
    • Bind a prediction column to the height data frame.
    • Graph the data and predictions in one chart.
    # Build a model using the lm() function to predict height in terms of time and time^2  
    lm_height_2 <- lm(height ~ time + I(time^2), data = height)
    
    # Bind a prediction column to the height data frame
    height_2_aug <- 
    	height %>% 
    	bind_cols(lm_height_2_pred = predict(lm_height_2))
    
    # Graph the data and predictions in one chart
    height_2_aug %>%
      ggplot(aes(x = time, y = height)) +
      geom_point() +
      geom_line(aes(y = lm_height_2_pred), col = "blue", lwd = .75)

    The tidymodels framework

    cancelations <- read_csv("cancelations_live.csv")
    cancelations

    Let's take a look at our features

    names(cancelations)

    StaysInWeekNights, for example, is an informative feature as it distinguishes those days from those on weekends, while arrival_date is less so, as it doesn't mean much to the model other than a series of values. We can make it informative by creating new features from it:

    This way, the model can distinguish a day like "Friday" or a month like "December," which might be meaningful to our modeling problem.

    For today's exploration, we will stick to the tidymodels framework—a collection of modeling and machine learning packages using tidyverse principles, emphasizing feature engineering.

    Setting up our data for analysis

    • Transform strings into factors.
    • Split into train and test sets stratifying by our target variable.
    • Verify that both sets have similar proportions of the target variable.
    # Transform strings into factors
    cancelations <-  
      cancelations %>% 
      mutate(across(where(is_character),as.factor)) %>%
      mutate(IsCanceled = as.factor(IsCanceled))
    
    # Split into train and test sets stratifying by our target variable
    set.seed(123)
    split <- cancelations %>% initial_split(strata = "IsCanceled")
    train <- training(split)
    test <- testing(split)
    
    # Verify that both sets have similar proportions of the target
    train %>% 
      select(IsCanceled) %>% 
      table() %>% 
      prop.table()
    
    test %>% 
      select(IsCanceled) %>% 
      table() %>% 
      prop.table()