Deriving variables from complex column headers

Reshaping Data with tidyr

Jeroen Boeye

Head of Machine Learning, Faktion

Separating column headers into variables

 

two variables in headers

 

two variables in headers fixed

Reshaping Data with tidyr

Multiple variable combinations in column headers

who_df
# A tibble: 181 x 5
   country             female_pct.obese male_pct.obese female_life.exp male_life.exp
   <chr>                          <dbl>          <dbl>           <dbl>         <dbl>
 1 Afghanistan                      7.6            3.2            64.5          61  
 2 Albania                         21.8           21.6            78.6          74.3
 3 Algeria                         34.9           19.9            77.4          75.4
 4 Angola                          12.1            4              64.9          60.3
 5 Antigua and Barbuda             25.9           11.6            77.5          72.5
 6 Argentina                       29             27.3            80.3          73.5
 7 Armenia                         23             17.1            78.1          71.2
 8 Australia                       28.4           29.6            84.8          81
Reshaping Data with tidyr

Multiple variable combinations in column headers

 

mixed variable headers

 

mixed variable headers fixed

Reshaping Data with tidyr

The special .value name

who_df %>% 
  # Example input column name = male_obesity.pct
  pivot_longer(-country,
               names_to = c("sex", ".value"),
               names_sep = "_")
# A tibble: 362 x 4
   country             sex    pct.obese life.exp
   <chr>               <chr>      <dbl>    <dbl>
 1 Afghanistan         female       7.6     64.5
 2 Afghanistan         male         3.2     61  
 3 Albania             female      21.8     78.6
Reshaping Data with tidyr

pivot_longer() recap

Variables as headers

two variables in headers

mixed variable headers

Variables as headers tidy

two variables in headers fixed

mixed variable headers fixed

Reshaping Data with tidyr

Uncounting data

nuke_df
# A tibble: 8 x 2
  country          n_bombs
  <chr>              <int>
1 Pakistan               2
2 India                  6
3 North Korea            6
4 United Kingdom        21
5 China                 45
6 France               200
7 Russian Federation   726
8 United States       1150
Reshaping Data with tidyr

The uncount() function

nuke_df %>% 
  uncount(n_bombs)
# A tibble: 2,156 x 1
   country    
   <chr>      
 1 Pakistan   
 2 Pakistan   
 3 India      
 4 India      
 5 India      
 6 India      
# ... with 2,150 more rows
Reshaping Data with tidyr

The uncount() function

nuke_df %>% 
  uncount(2)
# A tibble: 16 x 2
   country            n_bombs
   <chr>                <int>
 1 Pakistan                 2
 2 Pakistan                 2
 3 India                    6
 4 India                    6
 5 North Korea              6
 6 North Korea              6
# ... with 10 more rows
Reshaping Data with tidyr

The uncount() function

nuke_df %>% 
  uncount(n_bombs, .id = "bomb_id")
# A tibble: 2,156 x 2
   country     bomb_id
   <chr>         <int>
 1 Pakistan          1
 2 Pakistan          2
 3 India             1
 4 India             2
 5 India             3
 6 India             4
# ... with 2,150 more rows
Reshaping Data with tidyr

Let's practice!

Reshaping Data with tidyr

Preparing Video For Download...