Feature selection vs. feature extraction

Dimensionality Reduction in R

Matt Pickard

Owner, Pickard Predictives, LLC

Approaches to dimensionality reduction

Vegetable garden

  • Feature selection like pulling weeds
  • Feature extraction like making a salad
1 Image Source: Daderot, CC0, via Wikimedia Commons
Dimensionality Reduction in R

Feature selection

A set of six features that are color coded

Dimensionality Reduction in R

Feature selection

A set of six features with features with low information being filtered

Dimensionality Reduction in R

Feature selection

A filtered set of four features

Dimensionality Reduction in R

Example credit data

credit_df %>% head(n=5)
  annual_income num_bank_accounts num_credit_card outstanding_debt credit_history_months
          <dbl>             <dbl>           <dbl>            <dbl>                 <dbl>
1        87630.                 2               5             526.                   286
2        16574.                 2               5              NA                    122
3        24931.                 2               5              NA                    351
4       136680.                 2               5              NA                    216
5        76850.                 2               5            1112.                   272
Dimensionality Reduction in R

Create an zero-variance filter

na_filter <- credit_df %>% 
  summarize(across(everything(), ~ var(., na.rm = TRUE))) %>%

pivot_longer(everything(), names_to = "feature", values_to = "variance") %>%
filter(variance == 0) %>%
pull(feature)
na_filter
"num_bank_accounts" "num_credit_card"
Dimensionality Reduction in R

Create missing values filter

na_filter <- credit_df %>%  
  summarize(across(everything(), ~ sum(is.na(.)))) %>%

pivot_longer(everything(), names_to = "feature", values_to = "num_missing_values") %>%
filter(num_missing_values > 0) %>%
pull(feature)
na_filter
"outstanding_debt"
Dimensionality Reduction in R

Applying the combined filter

combined_filter <- 
  c(low_var_filter, na_filter)

credit_df %>% 
  select(-all_of(combined_filter)) %>% 
  head(3)
  annual_income credit_history_months
          <dbl>                 <dbl>
1        87630.                   286
2        16574.                   122
3        24931.                   351
Dimensionality Reduction in R

Feature extraction

A set of six features that are color coded

Dimensionality Reduction in R

Feature extraction

Some features combined to make four features

Dimensionality Reduction in R

Feature extraction and mutual information

Venn diagram with intersection

Dimensionality Reduction in R

Feature extraction: Combining mutual exclusive info

Combined features including mutual information and mutually exclusive information

Dimensionality Reduction in R

Feature extraction: Combining mutual exclusive info

Combined features with mutual information removed

Dimensionality Reduction in R

Advantages and disadvantages of feature extraction

Advantages
  • can combine information into new features
Disadvantages
  • implementation is more complicated
  • new features are difficult to interpret

Principal component analysis of body mass index, height and weight

Dimensionality Reduction in R

Let's practice!

Dimensionality Reduction in R

Preparing Video For Download...