Generating and comparing pairs

Nettoyer des données avec R

Maggie Matsui

Content Developer @ DataCamp

When joins won't work

Left table has event and time columns. Games shown are Houston Rockets vs Chicago Bulls at 19:00. Miami Heat vs Los Angeles Lakers at 19:00. Brooklyn Nets vs Orlando Magic at 20:00. Denver Nuggets vs. Miami Heat at 21:00. San antonio spurs vs Atlanta Hawks at 21:00. Table on the right has gaves: NBA: Nets vs Magic at 8pm. NBA: Bulls vs Rockets at 9pm. NBA: Heat vs Lakers at 7pm. NBA: Grizzlies vs Heat at 10pm. NBA: Heat vs Cavaliers at 9pm.

Nettoyer des données avec R

When joins won't work

Nettoyer des données avec R

What is record linkage?

record_linkage_1.png

Nettoyer des données avec R

What is record linkage?

On the left, two databases with brooms labeled Data A and Data B

Nettoyer des données avec R

What is record linkage?

Arrows point from Data A and Data B to three pairs of people, labeled Generate pairs

Nettoyer des données avec R

What is record linkage?

Arrow points from generate pairs to two columns of circles with arrows pointing at each other in various directions. Description is compare pairs.

Nettoyer des données avec R

What is record linkage?

Arrow points from compare pairs to a figure of a person holding up a sign that says .93. Labeled score pairs.

Nettoyer des données avec R

What is record linkage?

Arrow from score pairs to a chain, labeled link data.

Nettoyer des données avec R

What is record linkage?

Same diagram with a blue box around the generate pairs step

Nettoyer des données avec R

Pairs of records

Two tables, df_A and df_B each containing names of people, their zip codes, and state. One row in df_A is highlighted for Keaton Z Snyder, zip 15020, state PA. One row in df_B is highlighted for Keaton Snyder, zip 15020, state PA.

Nettoyer des données avec R

Generating pairs

Same tables with lines going from every row in df_A to every row in df_B to show every combination.

Nettoyer des données avec R

Generating pairs in R

library(reclin)
pair_blocking(df_A, df_B)
Simple blocking
  No blocking used.
  First data set:  5 records
  Second data set: 5 records
  Total number of pairs: 25 pairs

ldat with 25 rows and 2 columns x y 1 1 1 2 2 1 3 3 1 ...
Nettoyer des données avec R

Too many pairs

Same tables extended downwards to have more rows, with even more lines connecting each pair.

Nettoyer des données avec R

Blocking

Same tables, but only rows that have the same state are connected by lines.

Only consider pairs when they agree on the blocking variable (State)

Nettoyer des données avec R

Pair blocking in R

pair_blocking(df_A, df_B, blocking_var = "state")
Simple blocking                                 ldat with 8 rows and 2 columns
  Blocking variable(s): state                     x y
  First data set:  5 records                    1 1 1
  Second data set: 5 records                    2 1 4
  Total number of pairs: 8 pairs                3 2 3
                                                4 2 5
                                                5 3 2
                                                6 4 2
                                                7 5 1
                                                8 5 4
Nettoyer des données avec R

Comparing pairs

Record linkage steps diagram with compare pairs step highlighted.

Nettoyer des données avec R

Comparing pairs

pair_blocking(df_A, df_B, blocking_var = "state") %>%

compare_pairs(by = "name", default_comparator = lcs())
Compare                                     ldat with 8 rows and 3 columns            
  By: name                                      x y      name
                                              1 1 1 0.3529412
Simple blocking                               2 1 4 0.3030303
  Blocking variable(s): state                 3 2 3 0.9285714
  First data set:  5 records                  4 2 5 0.2962963    
  Second data set: 5 records                  ...
  Total number of pairs: 8 pairs              8 5 4 0.3333333
Nettoyer des données avec R

Comparing multiple columns

pair_blocking(df_A, df_B, blocking_var = "state") %>%
  compare_pairs(by = c("name", "zip"), default_comparator = lcs())
Compare                                    ldat with 8 rows and 4 columns
  By: name, zip                              x y      name zip
                                           1 1 1 0.3529412 0.4
Simple blocking                            2 1 4 0.3030303 0.2
  Blocking variable(s): state              3 2 3 0.9285714 1.0
  First data set:  5 records               4 2 5 0.2962963 0.2
  Second data set: 5 records               ...
  Total number of pairs: 8 pairs           8 5 4 0.3333333 0.2
Nettoyer des données avec R

Different comparators

  • default_comparator = lcs()
  • default_comparator = jaccard()
  • default_comparator = jaro_winkler()
Nettoyer des données avec R

Let's practice!

Nettoyer des données avec R

Preparing Video For Download...