Required packages
library(tidyverse)
library(tidymodels)
library(lubridate)
library(devtools)
install.packages("naniar")
library(naniar)
options(warn = -1)
A silly but informative exercise
The height
data set contains observations of the height of an object at several points in time. Let's build a simple linear model to predict height.
- Build a linear model using the base R
lm
function. - Bind a prediction column to the height data frame.
- Graph the data and predictions in one chart.
height <- read_csv("height.csv")
# Build a linear model using the base R lm function
lm_model <- lm(height ~ time, data = height)
# Bind a prediction column to the height data frame
height_lm_aug <- height %>%
bind_cols(lm_pred = predict(lm_model))
# Graph the data and the predictions in one chart
height_lm_aug %>%
ggplot(aes(x = time, y = height)) +
geom_point() +
geom_line(aes(y = lm_pred), color = "blue", lwd = .75)
This result is definitely bad. However, from the shape of the data, we can infer a quadratic behavior. So let's take a shot at this idea and add a
- Build a model using the
lm()
function to predict height in terms of and . - Bind a prediction column to the height data frame.
- Graph the data and predictions in one chart.
# Build a model using the lm() function to predict height in terms of time and time^2
lm_height_2 <- lm(height ~ time + I(time^2), data = height)
# Bind a prediction column to the height data frame
height_2_aug <-
height %>%
bind_cols(lm_height_2_pred = predict(lm_height_2))
# Graph the data and predictions in one chart
height_2_aug %>%
ggplot(aes(x = time, y = height)) +
geom_point() +
geom_line(aes(y = lm_height_2_pred), col = "blue", lwd = .75)
The tidymodels framework
cancelations <- read_csv("cancelations_live.csv")
cancelations
Let's take a look at our features
names(cancelations)
StaysInWeekNights
, for example, is an informative feature as it distinguishes those days from those on weekends, while arrival_date
is less so, as it doesn't mean much to the model other than a series of values. We can make it informative by creating new features from it:
This way, the model can distinguish a day like "Friday" or a month like "December," which might be meaningful to our modeling problem.
For today's exploration, we will stick to the tidymodels
framework—a collection of modeling and machine learning packages using tidyverse principles, emphasizing feature engineering.
Setting up our data for analysis
- Transform strings into factors.
- Split into train and test sets stratifying by our target variable.
- Verify that both sets have similar proportions of the target variable.
# Transform strings into factors
cancelations <-
cancelations %>%
mutate(across(where(is_character),as.factor)) %>%
mutate(IsCanceled = as.factor(IsCanceled))
# Split into train and test sets stratifying by our target variable
set.seed(123)
split <- cancelations %>% initial_split(strata = "IsCanceled")
train <- training(split)
test <- testing(split)
# Verify that both sets have similar proportions of the target
train %>%
select(IsCanceled) %>%
table() %>%
prop.table()
test %>%
select(IsCanceled) %>%
table() %>%
prop.table()