Deriving variables from complex column headers

Rimodellare i dati con tidyr

Jeroen Boeye

Head of Machine Learning, Faktion

Separating column headers into variables

 

two variables in headers

 

two variables in headers fixed

Rimodellare i dati con tidyr

Multiple variable combinations in column headers

who_df
# A tibble: 181 x 5
   country             female_pct.obese male_pct.obese female_life.exp male_life.exp
   <chr>                          <dbl>          <dbl>           <dbl>         <dbl>
 1 Afghanistan                      7.6            3.2            64.5          61  
 2 Albania                         21.8           21.6            78.6          74.3
 3 Algeria                         34.9           19.9            77.4          75.4
 4 Angola                          12.1            4              64.9          60.3
 5 Antigua and Barbuda             25.9           11.6            77.5          72.5
 6 Argentina                       29             27.3            80.3          73.5
 7 Armenia                         23             17.1            78.1          71.2
 8 Australia                       28.4           29.6            84.8          81
Rimodellare i dati con tidyr

Multiple variable combinations in column headers

 

mixed variable headers

 

mixed variable headers fixed

Rimodellare i dati con tidyr

The special .value name

who_df %>% 
  # Example input column name = male_obesity.pct
  pivot_longer(-country,
               names_to = c("sex", ".value"),
               names_sep = "_")
# A tibble: 362 x 4
   country             sex    pct.obese life.exp
   <chr>               <chr>      <dbl>    <dbl>
 1 Afghanistan         female       7.6     64.5
 2 Afghanistan         male         3.2     61  
 3 Albania             female      21.8     78.6
Rimodellare i dati con tidyr

pivot_longer() recap

Variables as headers

two variables in headers

mixed variable headers

Variables as headers tidy

two variables in headers fixed

mixed variable headers fixed

Rimodellare i dati con tidyr

Uncounting data

nuke_df
# A tibble: 8 x 2
  country          n_bombs
  <chr>              <int>
1 Pakistan               2
2 India                  6
3 North Korea            6
4 United Kingdom        21
5 China                 45
6 France               200
7 Russian Federation   726
8 United States       1150
Rimodellare i dati con tidyr

The uncount() function

nuke_df %>% 
  uncount(n_bombs)
# A tibble: 2,156 x 1
   country    
   <chr>      
 1 Pakistan   
 2 Pakistan   
 3 India      
 4 India      
 5 India      
 6 India      
# ... with 2,150 more rows
Rimodellare i dati con tidyr

The uncount() function

nuke_df %>% 
  uncount(2)
# A tibble: 16 x 2
   country            n_bombs
   <chr>                <int>
 1 Pakistan                 2
 2 Pakistan                 2
 3 India                    6
 4 India                    6
 5 North Korea              6
 6 North Korea              6
# ... with 10 more rows
Rimodellare i dati con tidyr

The uncount() function

nuke_df %>% 
  uncount(n_bombs, .id = "bomb_id")
# A tibble: 2,156 x 2
   country     bomb_id
   <chr>         <int>
 1 Pakistan          1
 2 Pakistan          2
 3 India             1
 4 India             2
 5 India             3
 6 India             4
# ... with 2,150 more rows
Rimodellare i dati con tidyr

Let's practice!

Rimodellare i dati con tidyr

Preparing Video For Download...