What is tidy data?

Rimodellare i dati con tidyr

Jeroen Boeye

Head of Machine Learning, Faktion

 

 

Happy families are all alike, but every unhappy family is unhappy in its own way.


Leo Tolstoy

 

Tidy datasets are all alike, but every messy dataset is messy in its own way.


Hadley Wickham

Rimodellare i dati con tidyr

Rectangular data

 

Structure

  • Columns
  • Rows
  • Cells

 

Tidy sample

Rimodellare i dati con tidyr

Tidy data, variables

 

Structure

  • Columns hold variables
  • Rows
  • Cells

 

Tidy sample variables

Rimodellare i dati con tidyr

Tidy data, observations

 

Structure

  • Columns hold variables
  • Rows hold observations
  • Cells

 

Tidy sample observations

Rimodellare i dati con tidyr

Tidy data, values

 

Structure

  • Columns hold variables
  • Rows hold observations
  • Cells hold values

 

Tidy sample values

Rimodellare i dati con tidyr

dplyr recap

character_df
# A tibble: 4 x 3
  name           homeworld species
  <chr>          <chr>     <chr>  
1 Luke Skywalker Tatooine  Human  
2 R2-D2          Naboo     Droid  
3 Darth Vader    Tatooine  Human  
4 Obi-Wan Kenobi Stewjon   Human
Rimodellare i dati con tidyr

dplyr recap: select()

character_df %>% 
  select(name, homeworld)
# A tibble: 4 x 2
  name           homeworld
  <chr>          <chr>    
1 Luke Skywalker Tatooine 
2 R2-D2          Naboo    
3 Darth Vader    Tatooine 
4 Obi-Wan Kenobi Stewjon
Rimodellare i dati con tidyr

dplyr recap: filter()

character_df %>% 
  filter(homeworld == "Tatooine")
# A tibble: 2 x 3
  name           homeworld species
  <chr>          <chr>     <chr>  
1 Luke Skywalker Tatooine  Human  
2 Darth Vader    Tatooine  Human
Rimodellare i dati con tidyr

dplyr recap: mutate()

character_df %>% 
  mutate(is_human = species == "Human")
# A tibble: 4 x 4
  name           homeworld species is_human
  <chr>          <chr>     <chr>   <lgl>   
1 Luke Skywalker Tatooine  Human   TRUE    
2 R2-D2          Naboo     Droid   FALSE   
3 Darth Vader    Tatooine  Human   TRUE    
4 Obi-Wan Kenobi Stewjon   Human   TRUE
Rimodellare i dati con tidyr

dplyr recap: group_by() and summarize()

character_df %>% 
  group_by(homeworld) %>% 
  summarize(n = n())
# A tibble: 3 x 2
  homeworld     n
  <chr>     <int>
1 Naboo         1
2 Stewjon       1
3 Tatooine      2
Rimodellare i dati con tidyr

magrittr logo

1 magrittr.tidyverse.org
Rimodellare i dati con tidyr

 

dplyr logo

 

tidyr logo

1 www.tidyverse.org
Rimodellare i dati con tidyr

Multiple variables in a single column

population_df
# A tibble: 4 x 2
  country                 population
  <chr>                        <dbl>
1 Brazil, South America        210. 
2 Nepal, Asia                   28.1
3 Senegal, Africa               15.8
4 Australia, Oceania            25.0
Rimodellare i dati con tidyr

Separating variables over two columns

population_df %>% 
  separate(country, into = c("country", "continent"), sep = ", ")
# A tibble: 4 x 3
  country   continent      population
  <chr>     <chr>               <dbl>
1 Brazil    South America       210. 
2 Nepal     Asia                 28.1
3 Senegal   Africa               15.8
4 Australia Oceania              25.0
Rimodellare i dati con tidyr

Let's practice!

Rimodellare i dati con tidyr

Preparing Video For Download...