Deriving variables from complex column headers

Reshaping Data with tidyr

Jeroen Boeye

Head of Machine Learning, Faktion

Separating column headers into variables

two variables in headers

two variables in headers fixed

Multiple variable combinations in column headers

who_df

# A tibble: 181 x 5
   country             female_pct.obese male_pct.obese female_life.exp male_life.exp
   <chr>                          <dbl>          <dbl>           <dbl>         <dbl>
 1 Afghanistan                      7.6            3.2            64.5          61  
 2 Albania                         21.8           21.6            78.6          74.3
 3 Algeria                         34.9           19.9            77.4          75.4
 4 Angola                          12.1            4              64.9          60.3
 5 Antigua and Barbuda             25.9           11.6            77.5          72.5
 6 Argentina                       29             27.3            80.3          73.5
 7 Armenia                         23             17.1            78.1          71.2
 8 Australia                       28.4           29.6            84.8          81

Multiple variable combinations in column headers

mixed variable headers

mixed variable headers fixed

The special .value name

who_df %>% 
  # Example input column name = male_obesity.pct
  pivot_longer(-country,
               names_to = c("sex", ".value"),
               names_sep = "_")

# A tibble: 362 x 4
   country             sex    pct.obese life.exp
   <chr>               <chr>      <dbl>    <dbl>
 1 Afghanistan         female       7.6     64.5
 2 Afghanistan         male         3.2     61  
 3 Albania             female      21.8     78.6

pivot_longer() recap

Variables as headers

two variables in headers

mixed variable headers

Variables as headers tidy

two variables in headers fixed

mixed variable headers fixed

Uncounting data

nuke_df

# A tibble: 8 x 2
  country          n_bombs
  <chr>              <int>
1 Pakistan               2
2 India                  6
3 North Korea            6
4 United Kingdom        21
5 China                 45
6 France               200
7 Russian Federation   726
8 United States       1150

The uncount() function

nuke_df %>% 
  uncount(n_bombs)

# A tibble: 2,156 x 1
   country    
   <chr>      
 1 Pakistan   
 2 Pakistan   
 3 India      
 4 India      
 5 India      
 6 India      
# ... with 2,150 more rows

The uncount() function

nuke_df %>% 
  uncount(2)

# A tibble: 16 x 2
   country            n_bombs
   <chr>                <int>
 1 Pakistan                 2
 2 Pakistan                 2
 3 India                    6
 4 India                    6
 5 North Korea              6
 6 North Korea              6
# ... with 10 more rows

The uncount() function

nuke_df %>% 
  uncount(n_bombs, .id = "bomb_id")

# A tibble: 2,156 x 2
   country     bomb_id
   <chr>         <int>
 1 Pakistan          1
 2 Pakistan          2
 3 India             1
 4 India             2
 5 India             3
 6 India             4
# ... with 2,150 more rows

Let's practice!

Reshaping Data with tidyr