Cleaning Data in R
Maggie Matsui
Content Developer @ DataCamp











library(reclin)
pair_blocking(df_A, df_B)
Simple blocking No blocking used. First data set: 5 records Second data set: 5 records Total number of pairs: 25 pairsldat with 25 rows and 2 columns x y 1 1 1 2 2 1 3 3 1 ...


Only consider pairs when they agree on the blocking variable (State)
pair_blocking(df_A, df_B, blocking_var = "state")
Simple blocking                                 ldat with 8 rows and 2 columns
  Blocking variable(s): state                     x y
  First data set:  5 records                    1 1 1
  Second data set: 5 records                    2 1 4
  Total number of pairs: 8 pairs                3 2 3
                                                4 2 5
                                                5 3 2
                                                6 4 2
                                                7 5 1
                                                8 5 4

pair_blocking(df_A, df_B, blocking_var = "state") %>%compare_pairs(by = "name", default_comparator = lcs())
Compare                                     ldat with 8 rows and 3 columns            
  By: name                                      x y      name
                                              1 1 1 0.3529412
Simple blocking                               2 1 4 0.3030303
  Blocking variable(s): state                 3 2 3 0.9285714
  First data set:  5 records                  4 2 5 0.2962963    
  Second data set: 5 records                  ...
  Total number of pairs: 8 pairs              8 5 4 0.3333333
pair_blocking(df_A, df_B, blocking_var = "state") %>%
  compare_pairs(by = c("name", "zip"), default_comparator = lcs())
Compare                                    ldat with 8 rows and 4 columns
  By: name, zip                              x y      name zip
                                           1 1 1 0.3529412 0.4
Simple blocking                            2 1 4 0.3030303 0.2
  Blocking variable(s): state              3 2 3 0.9285714 1.0
  First data set:  5 records               4 2 5 0.2962963 0.2
  Second data set: 5 records               ...
  Total number of pairs: 8 pairs           8 5 4 0.3333333 0.2
default_comparator = lcs()default_comparator = jaccard()default_comparator = jaro_winkler()Cleaning Data in R