Generating and comparing pairs

Pulizia dei dati in R

Maggie Matsui

Content Developer @ DataCamp

When joins won't work

Left table has event and time columns. Games shown are Houston Rockets vs Chicago Bulls at 19:00. Miami Heat vs Los Angeles Lakers at 19:00. Brooklyn Nets vs Orlando Magic at 20:00. Denver Nuggets vs. Miami Heat at 21:00. San antonio spurs vs Atlanta Hawks at 21:00. Table on the right has gaves: NBA: Nets vs Magic at 8pm. NBA: Bulls vs Rockets at 9pm. NBA: Heat vs Lakers at 7pm. NBA: Grizzlies vs Heat at 10pm. NBA: Heat vs Cavaliers at 9pm.

Pulizia dei dati in R

When joins won't work

Pulizia dei dati in R

What is record linkage?

record_linkage_1.png

Pulizia dei dati in R

What is record linkage?

On the left, two databases with brooms labeled Data A and Data B

Pulizia dei dati in R

What is record linkage?

Arrows point from Data A and Data B to three pairs of people, labeled Generate pairs

Pulizia dei dati in R

What is record linkage?

Arrow points from generate pairs to two columns of circles with arrows pointing at each other in various directions. Description is compare pairs.

Pulizia dei dati in R

What is record linkage?

Arrow points from compare pairs to a figure of a person holding up a sign that says .93. Labeled score pairs.

Pulizia dei dati in R

What is record linkage?

Arrow from score pairs to a chain, labeled link data.

Pulizia dei dati in R

What is record linkage?

Same diagram with a blue box around the generate pairs step

Pulizia dei dati in R

Pairs of records

Two tables, df_A and df_B each containing names of people, their zip codes, and state. One row in df_A is highlighted for Keaton Z Snyder, zip 15020, state PA. One row in df_B is highlighted for Keaton Snyder, zip 15020, state PA.

Pulizia dei dati in R

Generating pairs

Same tables with lines going from every row in df_A to every row in df_B to show every combination.

Pulizia dei dati in R

Generating pairs in R

library(reclin)
pair_blocking(df_A, df_B)
Simple blocking
  No blocking used.
  First data set:  5 records
  Second data set: 5 records
  Total number of pairs: 25 pairs

ldat with 25 rows and 2 columns x y 1 1 1 2 2 1 3 3 1 ...
Pulizia dei dati in R

Too many pairs

Same tables extended downwards to have more rows, with even more lines connecting each pair.

Pulizia dei dati in R

Blocking

Same tables, but only rows that have the same state are connected by lines.

Only consider pairs when they agree on the blocking variable (State)

Pulizia dei dati in R

Pair blocking in R

pair_blocking(df_A, df_B, blocking_var = "state")
Simple blocking                                 ldat with 8 rows and 2 columns
  Blocking variable(s): state                     x y
  First data set:  5 records                    1 1 1
  Second data set: 5 records                    2 1 4
  Total number of pairs: 8 pairs                3 2 3
                                                4 2 5
                                                5 3 2
                                                6 4 2
                                                7 5 1
                                                8 5 4
Pulizia dei dati in R

Comparing pairs

Record linkage steps diagram with compare pairs step highlighted.

Pulizia dei dati in R

Comparing pairs

pair_blocking(df_A, df_B, blocking_var = "state") %>%

compare_pairs(by = "name", default_comparator = lcs())
Compare                                     ldat with 8 rows and 3 columns            
  By: name                                      x y      name
                                              1 1 1 0.3529412
Simple blocking                               2 1 4 0.3030303
  Blocking variable(s): state                 3 2 3 0.9285714
  First data set:  5 records                  4 2 5 0.2962963    
  Second data set: 5 records                  ...
  Total number of pairs: 8 pairs              8 5 4 0.3333333
Pulizia dei dati in R

Comparing multiple columns

pair_blocking(df_A, df_B, blocking_var = "state") %>%
  compare_pairs(by = c("name", "zip"), default_comparator = lcs())
Compare                                    ldat with 8 rows and 4 columns
  By: name, zip                              x y      name zip
                                           1 1 1 0.3529412 0.4
Simple blocking                            2 1 4 0.3030303 0.2
  Blocking variable(s): state              3 2 3 0.9285714 1.0
  First data set:  5 records               4 2 5 0.2962963 0.2
  Second data set: 5 records               ...
  Total number of pairs: 8 pairs           8 5 4 0.3333333 0.2
Pulizia dei dati in R

Different comparators

  • default_comparator = lcs()
  • default_comparator = jaccard()
  • default_comparator = jaro_winkler()
Pulizia dei dati in R

Let's practice!

Pulizia dei dati in R

Preparing Video For Download...