What is tidy data?

Reshaping Data with tidyr

Jeroen Boeye

Head of Machine Learning, Faktion

 

 

Happy families are all alike, but every unhappy family is unhappy in its own way.


Leo Tolstoy

 

Tidy datasets are all alike, but every messy dataset is messy in its own way.


Hadley Wickham

Reshaping Data with tidyr

Rectangular data

 

Structure

  • Columns
  • Rows
  • Cells

 

Tidy sample

Reshaping Data with tidyr

Tidy data, variables

 

Structure

  • Columns hold variables
  • Rows
  • Cells

 

Tidy sample variables

Reshaping Data with tidyr

Tidy data, observations

 

Structure

  • Columns hold variables
  • Rows hold observations
  • Cells

 

Tidy sample observations

Reshaping Data with tidyr

Tidy data, values

 

Structure

  • Columns hold variables
  • Rows hold observations
  • Cells hold values

 

Tidy sample values

Reshaping Data with tidyr

dplyr recap

character_df
# A tibble: 4 x 3
  name           homeworld species
  <chr>          <chr>     <chr>  
1 Luke Skywalker Tatooine  Human  
2 R2-D2          Naboo     Droid  
3 Darth Vader    Tatooine  Human  
4 Obi-Wan Kenobi Stewjon   Human
Reshaping Data with tidyr

dplyr recap: select()

character_df %>% 
  select(name, homeworld)
# A tibble: 4 x 2
  name           homeworld
  <chr>          <chr>    
1 Luke Skywalker Tatooine 
2 R2-D2          Naboo    
3 Darth Vader    Tatooine 
4 Obi-Wan Kenobi Stewjon
Reshaping Data with tidyr

dplyr recap: filter()

character_df %>% 
  filter(homeworld == "Tatooine")
# A tibble: 2 x 3
  name           homeworld species
  <chr>          <chr>     <chr>  
1 Luke Skywalker Tatooine  Human  
2 Darth Vader    Tatooine  Human
Reshaping Data with tidyr

dplyr recap: mutate()

character_df %>% 
  mutate(is_human = species == "Human")
# A tibble: 4 x 4
  name           homeworld species is_human
  <chr>          <chr>     <chr>   <lgl>   
1 Luke Skywalker Tatooine  Human   TRUE    
2 R2-D2          Naboo     Droid   FALSE   
3 Darth Vader    Tatooine  Human   TRUE    
4 Obi-Wan Kenobi Stewjon   Human   TRUE
Reshaping Data with tidyr

dplyr recap: group_by() and summarize()

character_df %>% 
  group_by(homeworld) %>% 
  summarize(n = n())
# A tibble: 3 x 2
  homeworld     n
  <chr>     <int>
1 Naboo         1
2 Stewjon       1
3 Tatooine      2
Reshaping Data with tidyr

magrittr logo

1 magrittr.tidyverse.org
Reshaping Data with tidyr

 

dplyr logo

 

tidyr logo

1 www.tidyverse.org
Reshaping Data with tidyr

Multiple variables in a single column

population_df
# A tibble: 4 x 2
  country                 population
  <chr>                        <dbl>
1 Brazil, South America        210. 
2 Nepal, Asia                   28.1
3 Senegal, Africa               15.8
4 Australia, Oceania            25.0
Reshaping Data with tidyr

Separating variables over two columns

population_df %>% 
  separate(country, into = c("country", "continent"), sep = ", ")
# A tibble: 4 x 3
  country   continent      population
  <chr>     <chr>               <dbl>
1 Brazil    South America       210. 
2 Nepal     Asia                 28.1
3 Senegal   Africa               15.8
4 Australia Oceania              25.0
Reshaping Data with tidyr

Let's practice!

Reshaping Data with tidyr

Preparing Video For Download...