## Required packages

```
library(tidyverse)
library(tidymodels)
library(lubridate)
library(devtools)
install.packages("naniar")
library(naniar)
options(warn = -1)
```

## A silly but informative exercise

The `height`

data set contains observations of the height of an object at several points in time. Let's build a simple linear model to predict height.

- Build a linear model using the base R
`lm`

function. - Bind a prediction column to the height data frame.
- Graph the data and predictions in one chart.

```
height <- read_csv("height.csv")
# Build a linear model using the base R lm function
lm_model <- lm(height ~ time, data = height)
# Bind a prediction column to the height data frame
height_lm_aug <- height %>%
bind_cols(lm_pred = predict(lm_model))
# Graph the data and the predictions in one chart
height_lm_aug %>%
ggplot(aes(x = time, y = height)) +
geom_point() +
geom_line(aes(y = lm_pred), color = "blue", lwd = .75)
```

This result is definitely bad. However, from the shape of the data, we can infer a quadratic behavior. So let's take a shot at this idea and add a

- Build a model using the
`lm()`

function to predict height in terms of and . - Bind a prediction column to the height data frame.
- Graph the data and predictions in one chart.

```
# Build a model using the lm() function to predict height in terms of time and time^2
lm_height_2 <- lm(height ~ time + I(time^2), data = height)
# Bind a prediction column to the height data frame
height_2_aug <-
height %>%
bind_cols(lm_height_2_pred = predict(lm_height_2))
# Graph the data and predictions in one chart
height_2_aug %>%
ggplot(aes(x = time, y = height)) +
geom_point() +
geom_line(aes(y = lm_height_2_pred), col = "blue", lwd = .75)
```

## The tidymodels framework

```
cancelations <- read_csv("cancelations_live.csv")
cancelations
```

Let's take a look at our features

`names(cancelations)`

`StaysInWeekNights`

, for example, is an informative feature as it distinguishes those days from those on weekends, while `arrival_date`

is less so, as it doesn't mean much to the model other than a series of values. We can make it informative by creating new features from it:

This way, the model can distinguish a day like "Friday" or a month like "December," which might be meaningful to our modeling problem.

For today's exploration, we will stick to the `tidymodels`

framework—a collection of modeling and machine learning packages using tidyverse principles, emphasizing feature engineering.

### Setting up our data for analysis

- Transform strings into factors.
- Split into train and test sets stratifying by our target variable.
- Verify that both sets have similar proportions of the target variable.

```
# Transform strings into factors
cancelations <-
cancelations %>%
mutate(across(where(is_character),as.factor)) %>%
mutate(IsCanceled = as.factor(IsCanceled))
# Split into train and test sets stratifying by our target variable
set.seed(123)
split <- cancelations %>% initial_split(strata = "IsCanceled")
train <- training(split)
test <- testing(split)
# Verify that both sets have similar proportions of the target
train %>%
select(IsCanceled) %>%
table() %>%
prop.table()
test %>%
select(IsCanceled) %>%
table() %>%
prop.table()
```