Cleaning Data in Python
Adel Nehme
Content Developer @ DataCamp
The recordlinkage
package
census_A
given_name surname date_of_birth suburb state address_1
rec_id
rec-1070-org michaela neumann 19151111 winston hills cal stanley street
rec-1016-org courtney painter 19161214 richlands txs pinkerton circuit
...
census_B
given_name surname date_of_birth suburb state address_1
rec_id
rec-561-dup-0 elton NaN 19651013 windermere ny light setreet
rec-2642-dup-0 mitchell maxon 19390212 north ryde cal edkins street
...
# Import recordlinkage import recordlinkage
# Create indexing object indexer = recordlinkage.Index()
# Generate pairs blocked on state indexer.block('state') pairs = indexer.index(census_A, census_B)
print(pairs)
MultiIndex(levels=[['rec-1007-org', 'rec-1016-org', 'rec-1054-org', 'rec-1066-org',
'rec-1070-org', 'rec-1075-org', 'rec-1080-org', 'rec-110-org', 'rec-1146-org',
'rec-1157-org', 'rec-1165-org', 'rec-1185-org', 'rec-1234-org', 'rec-1271-org',
'rec-1280-org',...........
66, 14, 13, 18, 34, 39, 0, 16, 80, 50, 20, 69, 28, 25, 49, 77, 51, 85, 52, 63, 74, 61,
83, 91, 22, 26, 55, 84, 11, 81, 97, 56, 27, 48, 2, 64, 5, 17, 29, 60, 72, 47, 92, 12,
95, 15, 19, 57, 37, 70, 94]], names=['rec_id_1', 'rec_id_2'])
# Generate the pairs pairs = indexer.index(census_A, census_B)
# Create a Compare object compare_cl = recordlinkage.Compare()
# Find exact matches for pairs of date_of_birth and state compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth') compare_cl.exact('state', 'state', label='state')
# Find similar matches for pairs of surname and address_1 using string similarity compare_cl.string('surname', 'surname', threshold=0.85, label='surname') compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')
# Find matches potential_matches = compare_cl.compute(pairs, census_A, census_B)
print(potential_matches)
date_of_birth state surname address_1
rec_id_1 rec_id_2
rec-1070-org rec-561-dup-0 0 1 0.0 0.0
rec-2642-dup-0 0 1 0.0 0.0
rec-608-dup-0 0 1 0.0 0.0
...
rec-1631-org rec-4070-dup-0 0 1 0.0 0.0
rec-4862-dup-0 0 1 0.0 0.0
rec-629-dup-0 0 1 0.0 0.0
...
potential_matches[potential_matches.sum(axis = 1) => 2]
date_of_birth state surname address_1
rec_id_1 rec_id_2
rec-4878-org rec-4878-dup-0 1 1 1.0 0.0
rec-417-org rec-2867-dup-0 0 1 0.0 1.0
rec-3964-org rec-394-dup-0 0 1 1.0 0.0
rec-1373-org rec-4051-dup-0 0 1 1.0 0.0
rec-802-dup-0 0 1 1.0 0.0
rec-3540-org rec-470-dup-0 0 1 1.0 0.0
Cleaning Data in Python