Perguntas do Stack Overflow

Combinando dados com dplyr

Chris Cardillo

Data Scientist

Os verbos de join

revisão de joins

Combinando dados com dplyr

pergunta no Stack Overflow

Combinando dados com dplyr

A tabela questions

questions
# A tibble: 294,735 x 3
         id creation_date score
      <int> <date>        <int>
 1 22557677 2014-03-21        1
 2 22557707 2014-03-21        2
 3 22558084 2014-03-21        2
 4 22558395 2014-03-21        2
 5 22558613 2014-03-21        0
 6 22558677 2014-03-21        2
 7 22558887 2014-03-21        8
 8 22559180 2014-03-21        1
 9 22559312 2014-03-21        0
10 22559322 2014-03-21        2
# … with 294,725 more rows
Combinando dados com dplyr

As tabelas question_tags e tags

question_tags
# A tibble: 497,153 x 2
   question_id tag_id
         <int>  <int>
 1    22557677     18
 2    22557677    139
 3    22557677  16088
 4    22557677   1672
 5    22558084   6419
 6    22558084  92764
 7    22558395   5569
 8    22558395    134
 9    22558395   9412
10    22558395  18621
# … with 497,143 more rows
tags
# A tibble: 48,299 x 2
       id tag_name                 
    <dbl> <chr>                    
 1 124399 laravel-dusk             
 2 124402 spring-cloud-vault-config
 3 124404 spring-vault             
 4 124405 apache-bahir             
 5 124407 astc                     
 6 124408 simulacrum               
 7 124410 angulartics2             
 8 124411 django-rest-viewsets     
 9 124414 react-native-lightbox    
10 124417 java-module              
# … with 48,289 more rows
Combinando dados com dplyr

Juntando question_tags com questions

questions %>%
  inner_join(question_tags, by = c("id" = "question_id"))
Combinando dados com dplyr

Juntando tags

questions_with_tags <- questions %>%
  inner_join(question_tags, by = c("id" = "question_id")) %>%
  inner_join(tags, by = c("tag_id" = "id"))

questions_with_tags
# A tibble: 497,153 x 5
         id creation_date score tag_id tag_name           
      <int> <date>        <int>  <dbl> <chr>              
 1 22557677 2014-03-21        1     18 regex              
 2 22557677 2014-03-21        1    139 string             
 3 22557677 2014-03-21        1  16088 time-complexity    
 4 22557677 2014-03-21        1   1672 backreference      
 5 22558084 2014-03-21        2   6419 time-series        
 6 22558084 2014-03-21        2  92764 panel-data         
 7 22558395 2014-03-21        2   5569 function           
 8 22558395 2014-03-21        2    134 sorting            
 9 22558395 2014-03-21        2   9412 vectorization      
10 22558395 2014-03-21        2  18621 operator-precedence
# … with 497,143 more rows
Combinando dados com dplyr

Tags mais comuns

questions_with_tags %>%
  count(tag_name, sort = TRUE)
# A tibble: 7,840 x 2
   tag_name       n
   <chr>      <int>
 1 ggplot2    28228
 2 dataframe  18874
 3 shiny      14219
 4 dplyr      14039
 5 plot       11315
 6 data.table  8809
 7 matrix      6205
 8 loops       5149
 9 regex       4912
10 function    4892
# … with 7,830 more rows
Combinando dados com dplyr

Vamos praticar!

Combinando dados com dplyr

Preparing Video For Download...