Model Building and Evaluation with tidymodels

Dimensionality Reduction in R

Matt Pickard

Owner, Pickard Predictives, LLC

Model fitting process

first step of modeling fitting is splitting the data

Model fitting process

the second step of model fitting is preparing the data

Model fitting process

the third step of model fitting is fitting the model

Model fitting process

the fourth step of model fitting is evaluating the model

Model fitting with tidymodels

tidymodels has functions to split the data into train and test sets

Model fitting with tidymodels

tidymodel recipes have functions to create step for pre-processing the data

Model fitting with tidymodels

tidymodels have functions to fit a variety of different models in the workflow

Splitting out train and test sets

split <- initial_split(credit_df, prop = 0.8, strata = credit_score)


train <- split %>% training()


test <-  split %>% testing()

Creating a recipe and a model

feature_selection_recipe <- 
  recipe(credit_score ~ ., data = train) %>%

  step_filter_missing(all_predictors(), threshold = 0.5) %>%

  step_scale(all_numeric_predictors()) %>%

  step_nzv(all_predictors()) %>%

  prep()

lr_model <- logistic_reg() %>%

  set_engine("glm")

Create and fit the workflow

credit_wflow <- workflow() %>%

  add_recipe(feature_selection_recipe) %>%

  add_model(lr_model)


credit_fit <- 
  credit_wflow %>% fit(data = train)

Evaluate the model

# Predict test data
credit_pred_df <- predict(credit_fit, test) %>% 
  bind_cols(test %>% select(credit_score))


# Evaluate F score
f_meas(credit_pred_df, credit_score, .pred_class)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 f_meas  macro          0.519

Explore the recipe with tidy()

tidy(feature_selection_recipe, number = 1)

# A tibble: 2 × 2
  terms            id                  
  <chr>            <chr>               
1 age              filter_missing_gVVfc
2 outstanding_debt filter_missing_gVVfc

Explore the model with tidy()

# Display model estimates
tidy(credit_fit)

# A tibble: 44 × 5
   term                estimate std.error statistic p.value
   <chr>                  <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept)           2.88       0.918    3.13   0.00173
 2 monthAugust          -0.449      0.236   -1.91   0.0565 
 3 monthFebruary        17.7      677.       0.0262 0.979  
 4 monthJanuary         17.7      661.       0.0268 0.979  
 ...                    ...       ...        ...    ...

Let's practice!

Dimensionality Reduction in R