Skip to content
Feature selection
# Import any packages you want to use here
Unsupervised feature selection methods
- drop features with many missing values
- drop features with low variance
- drop too correlated features
Add your notes here
When training a machine learning model, you would want a sample that includes each combination several times, so that every combination appears at least once in both the training and testing data set. In this example, healthcare_cat_df had eight dimensions and needed a bare minimum of 6,480 observations.
# Calculate the minimum number of value combinations
healthcare_cat_df %>%
summarise(across(everything(), ~ length(unique(.)))) %>%
prod()
# Create zero-variance filter
zero_var_filter <- house_sales_df %>%
summarise(across(everything(), ~ var(., na.rm = T))) %>%
pivot_longer(everything(), names_to = "feature", values_to = "variance") %>%
filter(variance == 0) %>%
pull(feature)
# Create a missing values filter
n = nrow(df)
na_filter <- house_sales_df %>%
summarize(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "feature", values_to = "NA_count") %>%
filter(NA_count/n > 0.8) %>%
pull(feature)
# Combine the two filters
low_info_filter <- c(zero_var_filter, na_filter)
# Apply the filter
house_sales_filtered_df <- house_sales_df %>%
select(-all_of(low_info_filter))
# Tidymodel approach
# Create missing values recipe
missing_vals_recipe <-
recipe(price ~ ., data = house_sales_df) %>%
step_filter_missing(all_predictors(), threshold = .5) %>%
prep()
# Apply recipe to data
filtered_house_sales_df <-
bake(missing_vals_recipe, new_data = NULL)
# Prepare recipe
low_variance_recipe <- recipe(price ~ ., data = house_sales_df) %>%
step_zv(all_predictors()) %>%
step_scale(all_numeric_predictors())%>%
step_nzv(all_predictors())%>%
prep()
# Apply recipe
filtered_house_sales_df <- bake(low_variance_recipe, new_data = NULL)
Selecting based on correlation with other features
# Create a correlation plot
credit_df %>%
select(where(is.numeric)) %>%
correlate() %>%
shave() %>%
rplot(print_cor = TRUE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# Create a recipe using step_corr to remove numeric predictors correlated > 0.7
corr_recipe <-
recipe(price ~ ., data = house_sales_df) %>%
step_corr(all_numeric_predictors(), threshold = 0.7) %>%
prep()
# Apply the recipe to the data
filtered_house_sales_df <-
corr_recipe %>%
bake(new_data = NULL)
# Identify the features that were removed
tidy(corr_recipe, number = 1)
Notice that step_corr() removes the minimal number of correlated features, not all features that are correlated above the threshold.
Supervised feature selection
- Entropy (information gain)
- Recursive feature elimination
- Lasso regression
- Random forest models
# Initialize the split
split <- initial_split(attrition_df,prop = 0.8, strata = Attrition)
# Extract training set
train <- split %>% training()
# Extract testing set
test <- split %>% testing()