Course Notes: Dimensionality Reduction in R
  • Feature selection

    # Import any packages you want to use here

    Unsupervised feature selection methods

    • drop features with many missing values
    • drop features with low variance
    • drop too correlated features

    When training a machine learning model, you would want a sample that includes each combination several times, so that every combination appears at least once in both the training and testing data set. In this example, healthcare_cat_df had eight dimensions and needed a bare minimum of 6,480 observations.

    # Calculate the minimum number of value combinations
    healthcare_cat_df %>% 
      summarise(across(everything(), ~ length(unique(.)))) %>% 
    # Create zero-variance filter
    zero_var_filter <- house_sales_df %>% 
      summarise(across(everything(), ~ var(., na.rm = T))) %>% 
      pivot_longer(everything(), names_to = "feature", values_to = "variance") %>% 
      filter(variance == 0) %>% 
    # Create a missing values filter
    n = nrow(df)
    na_filter <- house_sales_df %>% 
      summarize(across(everything(), ~ sum( %>% 
      pivot_longer(everything(), names_to = "feature", values_to = "NA_count") %>% 
      filter(NA_count/n > 0.8) %>% 
    # Combine the two filters
    low_info_filter <- c(zero_var_filter, na_filter)
    # Apply the filter
    house_sales_filtered_df <- house_sales_df %>% 
    # Tidymodel approach
    # Create missing values recipe
    missing_vals_recipe <- 
      recipe(price ~ ., data = house_sales_df) %>% 
      step_filter_missing(all_predictors(), threshold = .5) %>% 
    # Apply recipe to data
    filtered_house_sales_df <- 
      bake(missing_vals_recipe, new_data = NULL)
    # Prepare recipe
    low_variance_recipe <- recipe(price ~ ., data = house_sales_df) %>% 
      step_zv(all_predictors()) %>% 
    # Apply recipe
    filtered_house_sales_df <- bake(low_variance_recipe, new_data = NULL)

    Selecting based on correlation with other features

    # Create a correlation plot
    credit_df %>% 
      select(where(is.numeric)) %>% 
      correlate() %>% 
      shave() %>% 
      rplot(print_cor = TRUE) +
      theme(axis.text.x = element_text(angle = 90, hjust = 1))
    # Create a recipe using step_corr to remove numeric predictors correlated > 0.7
    corr_recipe <-  
      recipe(price ~ ., data = house_sales_df) %>% 
      step_corr(all_numeric_predictors(), threshold = 0.7) %>% 
    # Apply the recipe to the data
    filtered_house_sales_df <- 
      corr_recipe %>% 
      bake(new_data = NULL)
    # Identify the features that were removed
    tidy(corr_recipe, number = 1)

    Notice that step_corr() removes the minimal number of correlated features, not all features that are correlated above the threshold.

    Supervised feature selection

    • Entropy (information gain)
    • Recursive feature elimination
    • Lasso regression
    • Random forest models
    # Initialize the split
    split <- initial_split(attrition_df,prop = 0.8, strata = Attrition)
    # Extract training set
    train <- split %>% training()
    # Extract testing set
    test <- split %>% testing()