Omgaan met weinig-informatieve predictoren

Machine Learning met caret in R

Zach Mayer

Data Scientist at DataRobot and co-author of caret

Geen (of lage) variantie-variabelen

Sommige variabelen bevatten weinig informatie
- Constant (geen variantie)
- Bijna constant (lage variantie)
Bij één fold van CV kan een kolom constant worden
- Kan problemen geven voor je modellen
Verwijder meestal variabelen met extreem lage variantie

Voorbeeld: constante kolom in mtcars

# Reproduce dataset from last video
data(mtcars)
set.seed(42)
mtcars[sample(1:nrow(mtcars), 10), "hp"] <- NA
Y <- mtcars$mpg
X <- mtcars[, 2:4]

# Add constant-valued column to mtcars
X$bad <- 1

Voorbeeld: constante kolom in mtcars

# Try to fit a model with PCA + glm
model <- train(
  X, Y, method = "glm", 
  preProcess = c("center", "scale", "medianImpute", "pca"))

Warning in preProcess.default(thresh = 0.95, k = 5, method = c("medianImpute",  :
  These variables have zero variances: bad
Something is wrong; all the RMSE metric values are missing:
      RMSE        Rsquared  
 Min.   : NA   Min.   : NA  
 1st Qu.: NA   1st Qu.: NA  
 Median : NA   Median : NA  
 Mean   :NaN   Mean   :NaN  
 3rd Qu.: NA   3rd Qu.: NA  
 Max.   : NA   Max.   : NA  
 NA's   :1     NA's   :1

caret to the rescue (opnieuw)

"zv" verwijdert constante kolommen
"nzv" verwijdert bijna constante kolommen

# Laat caret die kolommen verwijderen tijdens het modelleren
set.seed(42)
model <- train(
  X, Y, method = "glm", 
  preProcess = c("zv", "center", "scale", "medianImpute", "pca")
)
min(model$results$RMSE)

3.402557

Laten we oefenen!

Machine Learning met caret in R