Tidy data

Case Study: Exploratory Data Analysis in R

Dave Robinson

Chief Data Scientist, DataCamp

United Kingdom

4-2.002.png

Case Study: Exploratory Data Analysis in R

United Kingdom

4-2.003.png

Case Study: Exploratory Data Analysis in R

United Kingdom

4-2.004.png

Case Study: Exploratory Data Analysis in R

United Kingdom

4-2.005.png

Case Study: Exploratory Data Analysis in R

United Kingdom

4-2.006.png

Case Study: Exploratory Data Analysis in R

United Kingdom

4-2.007.png

Case Study: Exploratory Data Analysis in R

Tidy data: topic is a variable

4-2.008.png

Case Study: Exploratory Data Analysis in R

Tidy data: topic is a variable

4-2.009.png

Case Study: Exploratory Data Analysis in R

Tidy data: topic is a variable

4-2.010.png

Case Study: Exploratory Data Analysis in R

Tidy data: topic is a variable

4-2.011.png

Case Study: Exploratory Data Analysis in R

Topic is spread across six columns

  • Each topic has one column, so combine into a single variable: topic
votes_joined %>%
  select(rcid, session, vote, country, me:ec)
# A tibble: 353,547 × 10
    rcid session  vote            country    me    nu    di    hr    co    ec
   <dbl>   <dbl> <dbl>              <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     46       2     1      United States     0     0     0     0     0     0
2     46       2     1             Canada     0     0     0     0     0     0
3     46       2     1               Cuba     0     0     0     0     0     0
4     46       2     1              Haiti     0     0     0     0     0     0
5     46       2     1 Dominican Republic     0     0     0     0     0     0
6     46       2     1             Mexico     0     0     0     0     0     0
7     46       2     1          Guatemala     0     0     0     0     0     0
8     46       2     1           Honduras     0     0     0     0     0     0
9     46       2     1        El Salvador     0     0     0     0     0     0
10    46       2     1          Nicaragua     0     0     0     0     0     0
# ... with 353,537 more rows
Case Study: Exploratory Data Analysis in R

Use gather() to bring columns into two

4-2.017.png

Case Study: Exploratory Data Analysis in R

Use gather() to bring columns into two

4-2.018.png

Case Study: Exploratory Data Analysis in R

Use gather() to bring columns into two

4-2.019.png

Case Study: Exploratory Data Analysis in R

Use gather() to bring columns into two variables

library(tidyr)
votes_joined %>%
  gather(topic, has_topic, me:ec)
# A tibble: 2,121,282 × 10
    rcid session  vote ccode  year            country       date   unres topic has_topic
   <dbl>   <dbl> <dbl> <int> <dbl>              <chr>     <dttm>   <chr> <chr>     <dbl>
1     46       2     1     2  1947      United States 1947-09-04 R/2/299    me         0
2     46       2     1    20  1947             Canada 1947-09-04 R/2/299    me         0
3     46       2     1    40  1947               Cuba 1947-09-04 R/2/299    me         0
4     46       2     1    41  1947              Haiti 1947-09-04 R/2/299    me         0
5     46       2     1    42  1947 Dominican Republic 1947-09-04 R/2/299    me         0
6     46       2     1    70  1947             Mexico 1947-09-04 R/2/299    me         0
7     46       2     1    90  1947          Guatemala 1947-09-04 R/2/299    me         0
8     46       2     1    91  1947           Honduras 1947-09-04 R/2/299    me         0
9     46       2     1    92  1947        El Salvador 1947-09-04 R/2/299    me         0
10    46       2     1    93  1947          Nicaragua 1947-09-04 R/2/299    me         0
# ... with 2,121,272 more rows
  • “topic” is now a variable
Case Study: Exploratory Data Analysis in R

Use gather() to bring columns into one variable

library(tidyr)
votes_joined %>%
  gather(topic, is_topic, me:ec) %>%
  filter(has_topic == 1)
# A tibble: 350,032 × 10
    rcid session  vote ccode  year            country       date    unres topic has_topic
   <dbl>   <dbl> <dbl> <int> <dbl>              <chr>     <dttm>    <chr> <chr>     <dbl>
1     77       2     1     2  1947      United States 1947-11-06 R/2/1424    me         1
2     77       2     1    20  1947             Canada 1947-11-06 R/2/1424    me         1
3     77       2     3    40  1947               Cuba 1947-11-06 R/2/1424    me         1
4     77       2     1    41  1947              Haiti 1947-11-06 R/2/1424    me         1
5     77       2     1    42  1947 Dominican Republic 1947-11-06 R/2/1424    me         1
6     77       2     2    70  1947             Mexico 1947-11-06 R/2/1424    me         1
7     77       2     1    90  1947          Guatemala 1947-11-06 R/2/1424    me         1
8     77       2     2    91  1947           Honduras 1947-11-06 R/2/1424    me         1
9     77       2     2    92  1947        El Salvador 1947-11-06 R/2/1424    me         1
10    77       2     1    93  1947          Nicaragua 1947-11-06 R/2/1424    me         1
# ... with 350,022 more rows
Case Study: Exploratory Data Analysis in R

Let's practice!

Case Study: Exploratory Data Analysis in R

Preparing Video For Download...