Joining Data with dplyr
Chris Cardillo
Data Scientist
questions
# A tibble: 294,735 x 3
id creation_date score
<int> <date> <int>
1 22557677 2014-03-21 1
2 22557707 2014-03-21 2
3 22558084 2014-03-21 2
4 22558395 2014-03-21 2
5 22558613 2014-03-21 0
6 22558677 2014-03-21 2
7 22558887 2014-03-21 8
8 22559180 2014-03-21 1
9 22559312 2014-03-21 0
10 22559322 2014-03-21 2
# … with 294,725 more rows
question_tags
# A tibble: 497,153 x 2
question_id tag_id
<int> <int>
1 22557677 18
2 22557677 139
3 22557677 16088
4 22557677 1672
5 22558084 6419
6 22558084 92764
7 22558395 5569
8 22558395 134
9 22558395 9412
10 22558395 18621
# … with 497,143 more rows
tags
# A tibble: 48,299 x 2
id tag_name
<dbl> <chr>
1 124399 laravel-dusk
2 124402 spring-cloud-vault-config
3 124404 spring-vault
4 124405 apache-bahir
5 124407 astc
6 124408 simulacrum
7 124410 angulartics2
8 124411 django-rest-viewsets
9 124414 react-native-lightbox
10 124417 java-module
# … with 48,289 more rows
questions %>%
inner_join(question_tags, by = c("id" = "question_id"))
questions_with_tags <- questions %>%
inner_join(question_tags, by = c("id" = "question_id")) %>%
inner_join(tags, by = c("tag_id" = "id"))
questions_with_tags
# A tibble: 497,153 x 5
id creation_date score tag_id tag_name
<int> <date> <int> <dbl> <chr>
1 22557677 2014-03-21 1 18 regex
2 22557677 2014-03-21 1 139 string
3 22557677 2014-03-21 1 16088 time-complexity
4 22557677 2014-03-21 1 1672 backreference
5 22558084 2014-03-21 2 6419 time-series
6 22558084 2014-03-21 2 92764 panel-data
7 22558395 2014-03-21 2 5569 function
8 22558395 2014-03-21 2 134 sorting
9 22558395 2014-03-21 2 9412 vectorization
10 22558395 2014-03-21 2 18621 operator-precedence
# … with 497,143 more rows
questions_with_tags %>%
count(tag_name, sort = TRUE)
# A tibble: 7,840 x 2
tag_name n
<chr> <int>
1 ggplot2 28228
2 dataframe 18874
3 shiny 14219
4 dplyr 14039
5 plot 11315
6 data.table 8809
7 matrix 6205
8 loops 5149
9 regex 4912
10 function 4892
# … with 7,830 more rows
Joining Data with dplyr