## Machine Learning in R

In this notebook, you will build a simple machine learning model to predict house prices based on its location. You will be introduced to the `tidymodels`

framework to do machine learning in R.

```
library(tidyverse, warn.conflicts = FALSE)
library(tidymodels, warn.conflicts = FALSE)
```

### 0. Get the data

The first step is to get the data and explore it. We will use the `ames`

dataset that ships with the `tidymodels`

package. The documentation of the dataset states:

A data set from De Cock (2011) has 82 fields were recorded for 2,930 properties in Ames IA. This version is copies from the AmesHousing package but does not include a few quality columns that appear to be outcomes rather than predictors.

```
data(ames)
ames_data = ames %>%
janitor::clean_names()
head(ames_data)
```

Our objective is to build a model that predicts `sale_price`

. While this dataset is rich in potential **features** we can use, we will build the simplest possible model. only including the `latitude`

and `longitude`

as features, in order to keep the emphasis on the process expounded by `tidymodels`

.

### 1. Split the data into training and test sets

Let us start by splitting the data into training and test sets. The basic idea is to train the model on a portion of the data and test its performance on the other portion that has not been seen by the model. This is done in order to prevent **overfitting**.

```
set.seed(1234)
ames_splits <- initial_split(ames_data)
ames_train <- training(ames_splits)
ames_test <- testing(ames_splits)
ames_splits
```

### 2. Choose a class of model, and hyperparameters.

The next step is to choose a class of models and specify hyperparameters. This is just for starters and we will see later how we can specify a range of values for hyperparameters and tune the model for optimal performance! We will pick the simple, yet very effective K Nearest Neighbours model!

```
ames_knn_mod <- nearest_neighbor() %>%
set_mode('regression') %>%
set_engine('kknn') %>%
update(neighbors = 5)
ames_knn_mod
```

### 3. Fit the model to the training data

It is time to fit the model on the training data. We will fit the simplest possible model that uses the location coordinates of the house as predictors of sale price. We use the logarithm of the sale price as the response variable, since it is highly skewed.

```
ames_knn_fit <- ames_knn_mod %>%
fit(log10(sale_price) ~ latitude + longitude, data = ames_train)
ames_knn_fit
```

### 4. Use the fitted model to predict on test data

Now that we have fitted the model using training data, we can use the fitted model to predict prices for houses in the test dataset. We do this using the `predict`

method.

```
ames_knn_pred <- ames_knn_fit %>%
predict(new_data = ames_test)
ames_knn_pred
```

### 5. Evaluate performance

Finally, let us evaluate the performance of the model. A good performance measure to use here is the RMSE (Root Mean Squared Error). Additionally, we will also plot the actual prices vs. the predicted prices to get a sense of how close they are.

```
ames_test_pred <- ames_knn_pred %>%
bind_cols(ames_test) %>%
mutate(truth = log10(sale_price))
ames_test_pred %>%
ggplot(aes(x = truth, y = .pred)) +
geom_point() +
geom_smooth(method = 'lm')
ames_knn_metrics <- ames_test_pred %>%
rmse(truth = truth, estimate = .pred)
ames_knn_metrics
```