Generating and comparing pairs

Cleaning Data in R

Maggie Matsui

Content Developer @ DataCamp

When joins won't work

Left table has event and time columns. Games shown are Houston Rockets vs Chicago Bulls at 19:00. Miami Heat vs Los Angeles Lakers at 19:00. Brooklyn Nets vs Orlando Magic at 20:00. Denver Nuggets vs. Miami Heat at 21:00. San antonio spurs vs Atlanta Hawks at 21:00. Table on the right has gaves: NBA: Nets vs Magic at 8pm. NBA: Bulls vs Rockets at 9pm. NBA: Heat vs Lakers at 7pm. NBA: Grizzlies vs Heat at 10pm. NBA: Heat vs Cavaliers at 9pm.

Cleaning Data in R

When joins won't work

Cleaning Data in R

What is record linkage?

record_linkage_1.png

Cleaning Data in R

What is record linkage?

On the left, two databases with brooms labeled Data A and Data B

Cleaning Data in R

What is record linkage?

Arrows point from Data A and Data B to three pairs of people, labeled Generate pairs

Cleaning Data in R

What is record linkage?

Arrow points from generate pairs to two columns of circles with arrows pointing at each other in various directions. Description is compare pairs.

Cleaning Data in R

What is record linkage?

Arrow points from compare pairs to a figure of a person holding up a sign that says .93. Labeled score pairs.

Cleaning Data in R

What is record linkage?

Arrow from score pairs to a chain, labeled link data.

Cleaning Data in R

What is record linkage?

Same diagram with a blue box around the generate pairs step

Cleaning Data in R

Pairs of records

Two tables, df_A and df_B each containing names of people, their zip codes, and state. One row in df_A is highlighted for Keaton Z Snyder, zip 15020, state PA. One row in df_B is highlighted for Keaton Snyder, zip 15020, state PA.

Cleaning Data in R

Generating pairs

Same tables with lines going from every row in df_A to every row in df_B to show every combination.

Cleaning Data in R

Generating pairs in R

library(reclin)
pair_blocking(df_A, df_B)
Simple blocking
  No blocking used.
  First data set:  5 records
  Second data set: 5 records
  Total number of pairs: 25 pairs

ldat with 25 rows and 2 columns x y 1 1 1 2 2 1 3 3 1 ...
Cleaning Data in R

Too many pairs

Same tables extended downwards to have more rows, with even more lines connecting each pair.

Cleaning Data in R

Blocking

Same tables, but only rows that have the same state are connected by lines.

Only consider pairs when they agree on the blocking variable (State)

Cleaning Data in R

Pair blocking in R

pair_blocking(df_A, df_B, blocking_var = "state")
Simple blocking                                 ldat with 8 rows and 2 columns
  Blocking variable(s): state                     x y
  First data set:  5 records                    1 1 1
  Second data set: 5 records                    2 1 4
  Total number of pairs: 8 pairs                3 2 3
                                                4 2 5
                                                5 3 2
                                                6 4 2
                                                7 5 1
                                                8 5 4
Cleaning Data in R

Comparing pairs

Record linkage steps diagram with compare pairs step highlighted.

Cleaning Data in R

Comparing pairs

pair_blocking(df_A, df_B, blocking_var = "state") %>%

compare_pairs(by = "name", default_comparator = lcs())
Compare                                     ldat with 8 rows and 3 columns            
  By: name                                      x y      name
                                              1 1 1 0.3529412
Simple blocking                               2 1 4 0.3030303
  Blocking variable(s): state                 3 2 3 0.9285714
  First data set:  5 records                  4 2 5 0.2962963    
  Second data set: 5 records                  ...
  Total number of pairs: 8 pairs              8 5 4 0.3333333
Cleaning Data in R

Comparing multiple columns

pair_blocking(df_A, df_B, blocking_var = "state") %>%
  compare_pairs(by = c("name", "zip"), default_comparator = lcs())
Compare                                    ldat with 8 rows and 4 columns
  By: name, zip                              x y      name zip
                                           1 1 1 0.3529412 0.4
Simple blocking                            2 1 4 0.3030303 0.2
  Blocking variable(s): state              3 2 3 0.9285714 1.0
  First data set:  5 records               4 2 5 0.2962963 0.2
  Second data set: 5 records               ...
  Total number of pairs: 8 pairs           8 5 4 0.3333333 0.2
Cleaning Data in R

Different comparators

  • default_comparator = lcs()
  • default_comparator = jaccard()
  • default_comparator = jaro_winkler()
Cleaning Data in R

Let's practice!

Cleaning Data in R

Preparing Video For Download...