Categorical inputs

Supervised Learning in R: Regression

Nina Zumel and John Mount

Win-Vector, LLC

Example: Effect of Diet on Weight Loss

WtLoss24 ~ Diet + Age + BMI
Diet Age BMI WtLoss24
Med 59 30.67 -6.7
Low-Carb 48 29.59 8.4
Low-Fat 52 32.9 6.3
Med 53 28.92 8.3
Low-Fat 47 30.20 6.3
Supervised Learning in R: Regression

model.matrix()

model.matrix(WtLoss24 ~ Diet + Age + BMI, data = diet)
  • All numerical values
  • Converts categorical variable with N levels into N - 1 indicator variables
Supervised Learning in R: Regression

Indicator Variables to Represent Categories

Original Data

Diet Age ...
Med 59 ...
Low-Carb 48 ...
Low-Fat 52 ...
Med 53 ...
Low-Fat 47 ...

Model Matrix

(Int) DietLow-Fat DietMed ...
1 0 1 ...
1 0 0 ...
1 1 0 ...
1 0 1 ...
1 1 0 ...
  • reference level: "Low-Carb"
Supervised Learning in R: Regression

Interpreting the Indicator Variables

Linear Model:

$$ WtLoss24 = \beta_{0} + \beta_{DietLow} x_{DietLow} + \beta_{DietMed} x_{DietMed} + \beta_{Age} x_{Age} + \beta_{BMI} x_{BMI} $$

lm(WtLoss24 ~ Diet + Age + BMI, data = diet))
Coefficients:
      (Intercept)        DietLow-Fat     DietMed  
         -1.37149           -2.32130    -0.97883  
              Age                BMI  
          0.12648            0.01262
Supervised Learning in R: Regression

Issues with one-hot-encoding

  • Too many levels can be a problem
    • Example: ZIP code (about 40,000 codes)
  • Don't hash with geometric methods!
Supervised Learning in R: Regression

Let's practice!

Supervised Learning in R: Regression

Preparing Video For Download...