Stack Overflow questions

Joining Data with dplyr

Chris Cardillo

Data Scientist

The joining verbs

join-review

Joining Data with dplyr

stack-overflow-question

Joining Data with dplyr

The questions table

questions
# A tibble: 294,735 x 3
         id creation_date score
      <int> <date>        <int>
 1 22557677 2014-03-21        1
 2 22557707 2014-03-21        2
 3 22558084 2014-03-21        2
 4 22558395 2014-03-21        2
 5 22558613 2014-03-21        0
 6 22558677 2014-03-21        2
 7 22558887 2014-03-21        8
 8 22559180 2014-03-21        1
 9 22559312 2014-03-21        0
10 22559322 2014-03-21        2
# … with 294,725 more rows
Joining Data with dplyr

The question_tags and tags tables

question_tags
# A tibble: 497,153 x 2
   question_id tag_id
         <int>  <int>
 1    22557677     18
 2    22557677    139
 3    22557677  16088
 4    22557677   1672
 5    22558084   6419
 6    22558084  92764
 7    22558395   5569
 8    22558395    134
 9    22558395   9412
10    22558395  18621
# … with 497,143 more rows
tags
# A tibble: 48,299 x 2
       id tag_name                 
    <dbl> <chr>                    
 1 124399 laravel-dusk             
 2 124402 spring-cloud-vault-config
 3 124404 spring-vault             
 4 124405 apache-bahir             
 5 124407 astc                     
 6 124408 simulacrum               
 7 124410 angulartics2             
 8 124411 django-rest-viewsets     
 9 124414 react-native-lightbox    
10 124417 java-module              
# … with 48,289 more rows
Joining Data with dplyr

Joining question_tags with questions

questions %>%
  inner_join(question_tags, by = c("id" = "question_id"))
Joining Data with dplyr

Joining tags

questions_with_tags <- questions %>%
  inner_join(question_tags, by = c("id" = "question_id")) %>%
  inner_join(tags, by = c("tag_id" = "id"))

questions_with_tags
# A tibble: 497,153 x 5
         id creation_date score tag_id tag_name           
      <int> <date>        <int>  <dbl> <chr>              
 1 22557677 2014-03-21        1     18 regex              
 2 22557677 2014-03-21        1    139 string             
 3 22557677 2014-03-21        1  16088 time-complexity    
 4 22557677 2014-03-21        1   1672 backreference      
 5 22558084 2014-03-21        2   6419 time-series        
 6 22558084 2014-03-21        2  92764 panel-data         
 7 22558395 2014-03-21        2   5569 function           
 8 22558395 2014-03-21        2    134 sorting            
 9 22558395 2014-03-21        2   9412 vectorization      
10 22558395 2014-03-21        2  18621 operator-precedence
# … with 497,143 more rows
Joining Data with dplyr

Most common tags

questions_with_tags %>%
  count(tag_name, sort = TRUE)
# A tibble: 7,840 x 2
   tag_name       n
   <chr>      <int>
 1 ggplot2    28228
 2 dataframe  18874
 3 shiny      14219
 4 dplyr      14039
 5 plot       11315
 6 data.table  8809
 7 matrix      6205
 8 loops       5149
 9 regex       4912
10 function    4892
# … with 7,830 more rows
Joining Data with dplyr

Let's practice!

Joining Data with dplyr

Preparing Video For Download...