Foundations of Tidy Machine Learning

Machine Learning in the Tidyverse

Dmitriy (Dima) Gorenshteyn

Lead Data Scientist, Memorial Sloan Kettering Cancer Center

The Core of Tidy Machine Learning

Machine Learning in the Tidyverse

The Core of Tidy Machine Learning

Machine Learning in the Tidyverse

List Column Workflow

Machine Learning in the Tidyverse

The Gapminder Dataset

  • dslabs package
  • Observations: 77 countries for 52 years per country (1960-2011)
  • Features:
    • year
    • infant_mortality
    • life_expectancy
    • fertility
    • population
    • gdpPercap
Machine Learning in the Tidyverse

List Column Workflow

Machine Learning in the Tidyverse

Step 1: Make a List Column - Nest Your Data

Machine Learning in the Tidyverse

Step 1: Make a List Column - Nest Your Data

Machine Learning in the Tidyverse

Nesting By Country

library(tidyverse)
nested <- gapminder %>%
          group_by(country) %>%
          nest() 

Machine Learning in the Tidyverse

Viewing a Nested Tibble

Machine Learning in the Tidyverse

Viewing a Nested Tibble

> nested$data[[4]]
# A tibble: 52 x 6
    year infant_mortality life_expectancy fertility population gdpPercap
   <int>            <dbl>           <dbl>     <dbl>      <dbl>     <int>
 1  1960             37.3            68.8      2.70    7065525      7415
 2  1961             35.0            69.7      2.79    7105654      7781
 3  1962             32.9            69.5      2.80    7151077      7937
 4  1963             31.2            69.6      2.82    7199962      8209
 5  1964             29.7            70.1      2.80    7249855      8652
 6  1965             28.3            69.9      2.70    7298794      8893
Machine Learning in the Tidyverse

Step 3: Simplify List Columns - unnest()

Machine Learning in the Tidyverse

Step 3: Simplify List Columns - unnest()

nested %>% 
  unnest(data)

# A tibble: 4,004 x 7
   country  year infant_mortality life_expectancy fertility population   ...
   <fct>   <int>            <dbl>           <dbl>     <dbl>      <dbl>   ...
 1 Algeria  1960              148            47.5      7.65   11124892   ...
 2 Algeria  1961              148            48.0      7.65   11404859   ...
 3 Algeria  1962              148            48.6      7.65   11690152   ...
 4 Algeria  1963              148            49.1      7.65   11985130   ...
 5 Algeria  1964              149            49.6      7.65   12295973   ...
 6 Algeria  1965              149            50.1      7.66   12626953   ...
Machine Learning in the Tidyverse

Let's Get Started!

Machine Learning in the Tidyverse

Preparing Video For Download...