Preguntas de Stack Overflow

Uniones de datos con dplyr

Chris Cardillo

Data Scientist

Los verbos de join

repaso de joins

Uniones de datos con dplyr

pregunta de Stack Overflow

Uniones de datos con dplyr

La tabla questions

questions
# A tibble: 294,735 x 3
         id creation_date score
      <int> <date>        <int>
 1 22557677 2014-03-21        1
 2 22557707 2014-03-21        2
 3 22558084 2014-03-21        2
 4 22558395 2014-03-21        2
 5 22558613 2014-03-21        0
 6 22558677 2014-03-21        2
 7 22558887 2014-03-21        8
 8 22559180 2014-03-21        1
 9 22559312 2014-03-21        0
10 22559322 2014-03-21        2
# … with 294,725 more rows
Uniones de datos con dplyr

Las tablas question_tags y tags

question_tags
# A tibble: 497,153 x 2
   question_id tag_id
         <int>  <int>
 1    22557677     18
 2    22557677    139
 3    22557677  16088
 4    22557677   1672
 5    22558084   6419
 6    22558084  92764
 7    22558395   5569
 8    22558395    134
 9    22558395   9412
10    22558395  18621
# … with 497,143 more rows
tags
# A tibble: 48,299 x 2
       id tag_name                 
    <dbl> <chr>                    
 1 124399 laravel-dusk             
 2 124402 spring-cloud-vault-config
 3 124404 spring-vault             
 4 124405 apache-bahir             
 5 124407 astc                     
 6 124408 simulacrum               
 7 124410 angulartics2             
 8 124411 django-rest-viewsets     
 9 124414 react-native-lightbox    
10 124417 java-module              
# … with 48,289 more rows
Uniones de datos con dplyr

Unir question_tags con questions

questions %>%
  inner_join(question_tags, by = c("id" = "question_id"))
Uniones de datos con dplyr

Uniendo tags

questions_with_tags <- questions %>%
  inner_join(question_tags, by = c("id" = "question_id")) %>%
  inner_join(tags, by = c("tag_id" = "id"))

questions_with_tags
# A tibble: 497,153 x 5
         id creation_date score tag_id tag_name           
      <int> <date>        <int>  <dbl> <chr>              
 1 22557677 2014-03-21        1     18 regex              
 2 22557677 2014-03-21        1    139 string             
 3 22557677 2014-03-21        1  16088 time-complexity    
 4 22557677 2014-03-21        1   1672 backreference      
 5 22558084 2014-03-21        2   6419 time-series        
 6 22558084 2014-03-21        2  92764 panel-data         
 7 22558395 2014-03-21        2   5569 function           
 8 22558395 2014-03-21        2    134 sorting            
 9 22558395 2014-03-21        2   9412 vectorization      
10 22558395 2014-03-21        2  18621 operator-precedence
# … with 497,143 more rows
Uniones de datos con dplyr

Etiquetas más comunes

questions_with_tags %>%
  count(tag_name, sort = TRUE)
# A tibble: 7,840 x 2
   tag_name       n
   <chr>      <int>
 1 ggplot2    28228
 2 dataframe  18874
 3 shiny      14219
 4 dplyr      14039
 5 plot       11315
 6 data.table  8809
 7 matrix      6205
 8 loops       5149
 9 regex       4912
10 function    4892
# … with 7,830 more rows
Uniones de datos con dplyr

¡Vamos a practicar!

Uniones de datos con dplyr

Preparing Video For Download...