Basis van tidy machine learning

Machine Learning in de tidyverse

Dmitriy (Dima) Gorenshteyn

Lead Data Scientist, Memorial Sloan Kettering Cancer Center

De kern van tidy machine learning

Machine Learning in de tidyverse

De kern van tidy machine learning

Machine Learning in de tidyverse

Workflow met lijstkolommen

Machine Learning in de tidyverse

De Gapminder-dataset

  • pakket dslabs
  • Observaties: 77 landen, 52 jaar per land (1960-2011)
  • Features:
    • year
    • infant_mortality
    • life_expectancy
    • fertility
    • population
    • gdpPercap
Machine Learning in de tidyverse

Workflow met lijstkolommen

Machine Learning in de tidyverse

Stap 1: Maak een lijstkolom - nest je data

Machine Learning in de tidyverse

Stap 1: Maak een lijstkolom - nest je data

Machine Learning in de tidyverse

Nesten per land

library(tidyverse)
nested <- gapminder %>%
          group_by(country) %>%
          nest() 

Machine Learning in de tidyverse

Een geneste tibble bekijken

Machine Learning in de tidyverse

Een geneste tibble bekijken

> nested$data[[4]]
# A tibble: 52 x 6
    year infant_mortality life_expectancy fertility population gdpPercap
   <int>            <dbl>           <dbl>     <dbl>      <dbl>     <int>
 1  1960             37.3            68.8      2.70    7065525      7415
 2  1961             35.0            69.7      2.79    7105654      7781
 3  1962             32.9            69.5      2.80    7151077      7937
 4  1963             31.2            69.6      2.82    7199962      8209
 5  1964             29.7            70.1      2.80    7249855      8652
 6  1965             28.3            69.9      2.70    7298794      8893
Machine Learning in de tidyverse

Stap 3: Lijstkolommen vereenvoudigen - unnest()

Machine Learning in de tidyverse

Stap 3: Lijstkolommen vereenvoudigen - unnest()

nested %>% 
  unnest(data)

# A tibble: 4,004 x 7
   country  year infant_mortality life_expectancy fertility population   ...
   <fct>   <int>            <dbl>           <dbl>     <dbl>      <dbl>   ...
 1 Algeria  1960              148            47.5      7.65   11124892   ...
 2 Algeria  1961              148            48.0      7.65   11404859   ...
 3 Algeria  1962              148            48.6      7.65   11690152   ...
 4 Algeria  1963              148            49.1      7.65   11985130   ...
 5 Algeria  1964              149            49.6      7.65   12295973   ...
 6 Algeria  1965              149            50.1      7.66   12626953   ...
Machine Learning in de tidyverse

Laten we beginnen!

Machine Learning in de tidyverse

Preparing Video For Download...