Columns with multiple values

Reshaping Data with tidyr

Jeroen Boeye

Head of Machine Learning, Faktion

Two variables in a single column

netflix_df
# A tibble: 637 x 3
   title                  type    duration
   <chr>                  <chr>   <chr>   
 1 Article 15             Movie   125 min 
 2 Kill Me If You Dare    Movie   100 min 
 3 The Spy                TV Show 1 Seasons
 4 The World We Make      Movie   108 min 
 5 Watchman               Movie   93 min
Reshaping Data with tidyr

Converting separated columns' data types

netflix_df %>% 
  separate(duration, into = c("value", "unit"), convert = TRUE)
# A tibble: 5 x 4
  title               type    value unit  
  <chr>               <chr>   <int> <chr> 
1 Article 15          Movie     125 min   
2 Kill Me If You Dare Movie     100 min   
3 The Spy             TV Show     1 Seasons
4 The World We Make   Movie     108 min   
5 Watchman            Movie      93 min
Reshaping Data with tidyr

dplyr aggregation recap

netflix_df %>% 
  separate(duration, into = c("value", "unit"), convert = TRUE) %>%
  group_by(type, unit) %>% 
  summarize(mean_duration = mean(value))
# A tibble: 2 x 3
# Groups:   type [2]
  type    unit    mean_duration
  <chr>   <chr>           <dbl>
1 Movie   min             98.6 
2 TV Show Seasons          1.85
Reshaping Data with tidyr

Separating variables over columns

Column with multiple variables

  One column per variable

Reshaping Data with tidyr

Combining multiple columns into one

star_wars_df
# A tibble: 4 x 2
  given_name family_name
  <chr>      <chr>      
1 Luke       Skywalker  
2 Han        Solo       
3 Leia       Organa     
4 R2         D2
Reshaping Data with tidyr

Combining multiple columns into one

star_wars_df %>%
  unite("name", given_name, family_name)
# A tibble: 4 x 1
  name          
  <chr>         
1 Luke_Skywalker
2 Han_Solo      
3 Leia_Organa   
4 R2_D2
Reshaping Data with tidyr

Combining multiple columns into one

star_wars_df %>%
  unite("name", given_name, family_name, sep = " ")
# A tibble: 4 x 1
  name          
  <chr>         
1 Luke Skywalker
2 Han Solo      
3 Leia Organa   
4 R2 D2
Reshaping Data with tidyr

Multiple values in a single cell

drink_df
# A tibble: 2 x 2
  drink          ingredients           
  <chr>          <chr>                 
1 Chocolate milk milk, chocolate, sugar
2 Orange juice   oranges, sugar
Reshaping Data with tidyr

Multiple values in a single cell

Netflix data

Column with multiple variables

Drinks data

Cell with multiple values

Reshaping Data with tidyr

Multiple values in a single cell

Netflix data

Column with multiple variables

Drinks data

Cell with multiple values

Values to variables

One column per variable

Reshaping Data with tidyr

Multiple values in a single cell

Netflix data

Column with multiple variables

Drinks data

Cell with multiple values

Values to variables

One column per variable

Values to observations

One value per cell

Reshaping Data with tidyr

Separating values over rows

drink_df %>% 
  separate_rows(ingredients, sep = ", ")
# A tibble: 5 x 2
  drink          ingredients
  <chr>          <chr>      
1 Chocolate milk milk       
2 Chocolate milk chocolate  
3 Chocolate milk sugar      
4 Orange juice   oranges    
5 Orange juice   sugar
Reshaping Data with tidyr

Counting ingredients

drink_df %>% 
  separate_rows(ingredients, sep = ", ") %>% 
  count(drink)
# A tibble: 2 x 2
  drink              n
  <chr>          <int>
1 Chocolate milk     3
2 Orange juice       2
drink_df %>% 
  separate_rows(ingredients, sep = ", ") %>% 
  count(ingredients)
# A tibble: 4 x 2
  ingredients     n
  <chr>       <int>
1 chocolate       1
2 milk            1
3 oranges         1
4 sugar           2
Reshaping Data with tidyr

Visualizing ingredients

drink_df %>% 
  separate_rows(ingredients, sep = ", ") %>% 
  ggplot(aes(x=drink, fill=ingredients)) +
  geom_bar()

Ingredient barchart

Reshaping Data with tidyr

Let's practice!

Reshaping Data with tidyr

Preparing Video For Download...