Skip to content

Required packages

library(tidyverse)
library(tidymodels)
library(lubridate)
library(devtools)
install.packages("naniar")
library(naniar)
options(warn = -1)
Hidden output

A silly but informative exercise

The height data set contains observations of the height of an object at several points in time. Let's build a simple linear model to predict height.

  • Build a linear model using the base R lm function.
  • Bind a prediction column to the height data frame.
  • Graph the data and predictions in one chart.
height <- read_csv("height.csv")

# Build a linear model using the base R lm function
lm_model <- lm(height ~ time, data = height)

# Bind a prediction column to the height data frame
height_lm_aug <- height %>% bind_cols(lm_pred = predict(lm_model))

# Graph the data and the predictions in one chart
height_lm_aug %>% 
	ggplot(aes(x = time, y = height)) +
	geom_point() +
	geom_line(aes(y = lm_pred), color = "blue", lwd = .75)

This result is definitely bad. However, from the shape of the data, we can infer a quadratic behavior. So let's take a shot at this idea and add a feature to our data frame.

  • Build a model using the lm() function to predict height in terms of and .
  • Bind a prediction column to the height data frame.
  • Graph the data and predictions in one chart.
# Build a linear model using the base R lm function
lm_model <- lm(height ~ time + I(time^2), data = height)

# Bind a prediction column to the height data frame
height_lm_aug <- height %>% bind_cols(lm_pred = predict(lm_model))

# Graph the data and the predictions in one chart
height_lm_aug %>% 
	ggplot(aes(x = time, y = height)) +
	geom_point() +
	geom_line(aes(y = lm_pred), color = "blue", lwd = .75)

The tidymodels framework

cancelations <- read_csv("cancelations_live.csv")
cancelations

Let's take a look at our features

names(cancelations)

StaysInWeekNights, for example, is an informative feature as it distinguishes those days from those on weekends, while arrival_date is less so, as it doesn't mean much to the model other than a series of values. We can make it informative by creating new features from it:

This way, the model can distinguish a day like "Friday" or a month like "December," which might be meaningful to our modeling problem.

For today's exploration, we will stick to the tidymodels framework—a collection of modeling and machine learning packages using tidyverse principles, emphasizing feature engineering.

Setting up our data for analysis

  • Transform strings into factors.
  • Split into train and test sets stratifying by our target variable.
  • Verify that both sets have similar proportions of the target variable.
# Transform strings into factors
## First, find all columns with characters and make them factor
## Second, change the column IsCanceled to factor

cancelations <- cancelations %>% 
					mutate(across(where(is_character), as.factor)) %>%
					mutate(IsCanceled = as.factor(IsCanceled))

# Split into train and test sets stratifying by our target variable
## First, set seed
## Second, use initial_split function to split the df with stratification on IsCanceled column
## Third, assign train and test dfs to train and test variables

set.seed(123)
split <- cancelations %>% initial_split(strata = "IsCanceled")
train <- training(split)
test <- testing(split)

# Verify that both sets have similar proportions of the target
train %>% 
	select(IsCanceled) %>%
	table() %>%
	prop.table()

test %>%
	select(IsCanceled) %>%
	table() %>%
	prop.table()