Generating and comparing pairs

Cleaning Data in R

Maggie Matsui

Content Developer @ DataCamp

When joins won't work

What is record linkage?

On the left, two databases with brooms labeled Data A and Data B

What is record linkage?

Arrows point from Data A and Data B to three pairs of people, labeled Generate pairs

What is record linkage?

Arrow points from generate pairs to two columns of circles with arrows pointing at each other in various directions. Description is compare pairs.

What is record linkage?

Arrow points from compare pairs to a figure of a person holding up a sign that says .93. Labeled score pairs.

What is record linkage?

Arrow from score pairs to a chain, labeled link data.

What is record linkage?

Same diagram with a blue box around the generate pairs step

Pairs of records

Two tables, df_A and df_B each containing names of people, their zip codes, and state. One row in df_A is highlighted for Keaton Z Snyder, zip 15020, state PA. One row in df_B is highlighted for Keaton Snyder, zip 15020, state PA.

Generating pairs

Same tables with lines going from every row in df_A to every row in df_B to show every combination.

Generating pairs in R

library(reclin)
pair_blocking(df_A, df_B)

Simple blocking
  No blocking used.
  First data set:  5 records
  Second data set: 5 records
  Total number of pairs: 25 pairs

ldat with 25 rows and 2 columns
   x y
1  1 1
2  2 1
3  3 1
...

Too many pairs

Same tables extended downwards to have more rows, with even more lines connecting each pair.

Blocking

Same tables, but only rows that have the same state are connected by lines.

Only consider pairs when they agree on the blocking variable (State)

Pair blocking in R

pair_blocking(df_A, df_B, blocking_var = "state")

Simple blocking                                 ldat with 8 rows and 2 columns
  Blocking variable(s): state                     x y
  First data set:  5 records                    1 1 1
  Second data set: 5 records                    2 1 4
  Total number of pairs: 8 pairs                3 2 3
                                                4 2 5
                                                5 3 2
                                                6 4 2
                                                7 5 1
                                                8 5 4

Comparing pairs

Record linkage steps diagram with compare pairs step highlighted.

Comparing pairs

pair_blocking(df_A, df_B, blocking_var = "state") %>%

  compare_pairs(by = "name", default_comparator = lcs())

Compare                                     ldat with 8 rows and 3 columns            
  By: name                                      x y      name
                                              1 1 1 0.3529412
Simple blocking                               2 1 4 0.3030303
  Blocking variable(s): state                 3 2 3 0.9285714
  First data set:  5 records                  4 2 5 0.2962963    
  Second data set: 5 records                  ...
  Total number of pairs: 8 pairs              8 5 4 0.3333333

Comparing multiple columns

pair_blocking(df_A, df_B, blocking_var = "state") %>%
  compare_pairs(by = c("name", "zip"), default_comparator = lcs())

Compare                                    ldat with 8 rows and 4 columns
  By: name, zip                              x y      name zip
                                           1 1 1 0.3529412 0.4
Simple blocking                            2 1 4 0.3030303 0.2
  Blocking variable(s): state              3 2 3 0.9285714 1.0
  First data set:  5 records               4 2 5 0.2962963 0.2
  Second data set: 5 records               ...
  Total number of pairs: 8 pairs           8 5 4 0.3333333 0.2

Different comparators

default_comparator = lcs()
default_comparator = jaccard()
default_comparator = jaro_winkler()

Let's practice!

Cleaning Data in R