Machine Learning in R
In this notebook, you will build a simple machine learning model to predict house prices based on its location. You will be introduced to the tidymodels
framework to do machine learning in R.
library(tidyverse, warn.conflicts = FALSE)
library(tidymodels, warn.conflicts = FALSE)
0. Get the data
The first step is to get the data and explore it. We will use the ames
dataset that ships with the tidymodels
package. The documentation of the dataset states:
A data set from De Cock (2011) has 82 fields were recorded for 2,930 properties in Ames IA. This version is copies from the AmesHousing package but does not include a few quality columns that appear to be outcomes rather than predictors.
data(ames)
ames_data = ames %>%
janitor::clean_names()
head(ames_data)
Our objective is to build a model that predicts sale_price
. While this dataset is rich in potential features we can use, we will build the simplest possible model. only including the latitude
and longitude
as features, in order to keep the emphasis on the process expounded by tidymodels
.
1. Split the data into training and test sets
Let us start by splitting the data into training and test sets. The basic idea is to train the model on a portion of the data and test its performance on the other portion that has not been seen by the model. This is done in order to prevent overfitting.
set.seed(1234)
ames_splits <- initial_split(ames_data)
ames_train <- training(ames_splits)
ames_test <- testing(ames_splits)
ames_splits
2. Choose a class of model, and hyperparameters.
The next step is to choose a class of models and specify hyperparameters. This is just for starters and we will see later how we can specify a range of values for hyperparameters and tune the model for optimal performance! We will pick the simple, yet very effective K Nearest Neighbours model!
ames_knn_mod <- nearest_neighbor() %>%
set_mode('regression') %>%
set_engine('kknn') %>%
update(neighbors = 5)
ames_knn_mod
3. Fit the model to the training data
It is time to fit the model on the training data. We will fit the simplest possible model that uses the location coordinates of the house as predictors of sale price. We use the logarithm of the sale price as the response variable, since it is highly skewed.
ames_knn_fit <- ames_knn_mod %>%
fit(log10(sale_price) ~ latitude + longitude, data = ames_train)
ames_knn_fit
4. Use the fitted model to predict on test data
Now that we have fitted the model using training data, we can use the fitted model to predict prices for houses in the test dataset. We do this using the predict
method.
ames_knn_pred <- ames_knn_fit %>%
predict(new_data = ames_test)
ames_knn_pred
5. Evaluate performance
Finally, let us evaluate the performance of the model. A good performance measure to use here is the RMSE (Root Mean Squared Error). Additionally, we will also plot the actual prices vs. the predicted prices to get a sense of how close they are.
ames_test_pred <- ames_knn_pred %>%
bind_cols(ames_test) %>%
mutate(truth = log10(sale_price))
ames_test_pred %>%
ggplot(aes(x = truth, y = .pred)) +
geom_point() +
geom_smooth(method = 'lm')
ames_knn_metrics <- ames_test_pred %>%
rmse(truth = truth, estimate = .pred)
ames_knn_metrics