Course Notes: Dimensionality Reduction in R

Feature selection

# Import any packages you want to use here

Unsupervised feature selection methods

drop features with many missing values
drop features with low variance
drop too correlated features

Add your notes here

When training a machine learning model, you would want a sample that includes each combination several times, so that every combination appears at least once in both the training and testing data set. In this example, healthcare_cat_df had eight dimensions and needed a bare minimum of 6,480 observations.

# Calculate the minimum number of value combinations
healthcare_cat_df %>% 
  summarise(across(everything(), ~ length(unique(.)))) %>% 
  prod()

# Create zero-variance filter
zero_var_filter <- house_sales_df %>% 
  summarise(across(everything(), ~ var(., na.rm = T))) %>% 
  pivot_longer(everything(), names_to = "feature", values_to = "variance") %>% 
  filter(variance == 0) %>% 
  pull(feature)


# Create a missing values filter
n = nrow(df)
na_filter <- house_sales_df %>% 
  summarize(across(everything(), ~ sum(is.na(.)))) %>% 
  pivot_longer(everything(), names_to = "feature", values_to = "NA_count") %>% 
  filter(NA_count/n > 0.8) %>% 
  pull(feature)

# Combine the two filters
low_info_filter <- c(zero_var_filter, na_filter)

# Apply the filter
house_sales_filtered_df <- house_sales_df %>% 
  select(-all_of(low_info_filter))

# Tidymodel approach
# Create missing values recipe
missing_vals_recipe <- 
  recipe(price ~ ., data = house_sales_df) %>% 
  step_filter_missing(all_predictors(), threshold = .5) %>% 
  prep()
  
# Apply recipe to data
filtered_house_sales_df <- 
  bake(missing_vals_recipe, new_data = NULL)


# Prepare recipe
low_variance_recipe <- recipe(price ~ ., data = house_sales_df) %>% 
  step_zv(all_predictors()) %>% 
  step_scale(all_numeric_predictors())%>%  
  step_nzv(all_predictors())%>%
  prep()

# Apply recipe
filtered_house_sales_df <- bake(low_variance_recipe, new_data = NULL)

Selecting based on correlation with other features

# Create a correlation plot
credit_df %>% 
  select(where(is.numeric)) %>% 
  correlate() %>% 
  shave() %>% 
  rplot(print_cor = TRUE) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

# Create a recipe using step_corr to remove numeric predictors correlated > 0.7
corr_recipe <-  
  recipe(price ~ ., data = house_sales_df) %>% 
  step_corr(all_numeric_predictors(), threshold = 0.7) %>% 
  prep() 

# Apply the recipe to the data
filtered_house_sales_df <- 
  corr_recipe %>% 
  bake(new_data = NULL)

# Identify the features that were removed
tidy(corr_recipe, number = 1)

Notice that step_corr() removes the minimal number of correlated features, not all features that are correlated above the threshold.

Supervised feature selection

Entropy (information gain)
Recursive feature elimination
Lasso regression
Random forest models

# Initialize the split
split <- initial_split(attrition_df,prop = 0.8, strata = Attrition)

# Extract training set
train <- split %>% training()

# Extract testing set
test <- split %>% testing()

Course Notes: Dimensionality Reduction in R

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Feature selection

Unsupervised feature selection methods

Selecting based on correlation with other features

Supervised feature selection

Feature selection