Cleaning Data in R
Maggie Matsui
Content Developer @ DataCamp
library(reclin)
pair_blocking(df_A, df_B)
Simple blocking No blocking used. First data set: 5 records Second data set: 5 records Total number of pairs: 25 pairs
ldat with 25 rows and 2 columns x y 1 1 1 2 2 1 3 3 1 ...
Only consider pairs when they agree on the blocking variable (State)
pair_blocking(df_A, df_B, blocking_var = "state")
Simple blocking ldat with 8 rows and 2 columns
Blocking variable(s): state x y
First data set: 5 records 1 1 1
Second data set: 5 records 2 1 4
Total number of pairs: 8 pairs 3 2 3
4 2 5
5 3 2
6 4 2
7 5 1
8 5 4
pair_blocking(df_A, df_B, blocking_var = "state") %>%
compare_pairs(by = "name", default_comparator = lcs())
Compare ldat with 8 rows and 3 columns
By: name x y name
1 1 1 0.3529412
Simple blocking 2 1 4 0.3030303
Blocking variable(s): state 3 2 3 0.9285714
First data set: 5 records 4 2 5 0.2962963
Second data set: 5 records ...
Total number of pairs: 8 pairs 8 5 4 0.3333333
pair_blocking(df_A, df_B, blocking_var = "state") %>%
compare_pairs(by = c("name", "zip"), default_comparator = lcs())
Compare ldat with 8 rows and 4 columns
By: name, zip x y name zip
1 1 1 0.3529412 0.4
Simple blocking 2 1 4 0.3030303 0.2
Blocking variable(s): state 3 2 3 0.9285714 1.0
First data set: 5 records 4 2 5 0.2962963 0.2
Second data set: 5 records ...
Total number of pairs: 8 pairs 8 5 4 0.3333333 0.2
default_comparator = lcs()
default_comparator = jaccard()
default_comparator = jaro_winkler()
Cleaning Data in R