Generating pairs

Pulizia dei dati in Python

Adel Nehme

VP of AI Curriculum, DataCamp

Motivation

NBA games tables

Pulizia dei dati in Python

When joins won't work

record linkage

Pulizia dei dati in Python

Record linkage

The recordlinkage package

Pulizia dei dati in Python

Our DataFrames

census_A

             given_name  surname date_of_birth         suburb state  address_1
rec_id                                                                
rec-1070-org   michaela  neumann      19151111  winston hills   cal  stanley street 
rec-1016-org   courtney  painter      19161214      richlands   txs  pinkerton circuit 
...

census_B

               given_name  surname date_of_birth             suburb state  address_1
rec_id                                                                      
rec-561-dup-0       elton      NaN      19651013         windermere   ny   light setreet 
rec-2642-dup-0   mitchell    maxon      19390212         north ryde   cal  edkins street 
...
Pulizia dei dati in Python

Generating pairs

Pulizia dei dati in Python

Generating pairs

Pulizia dei dati in Python

Blocking

Pulizia dei dati in Python

Generating pairs

# Import recordlinkage
import recordlinkage

# Create indexing object indexer = recordlinkage.Index()
# Generate pairs blocked on state indexer.block('state') pairs = indexer.index(census_A, census_B)
Pulizia dei dati in Python

Generating pairs

print(pairs)
MultiIndex(levels=[['rec-1007-org', 'rec-1016-org', 'rec-1054-org', 'rec-1066-org', 
'rec-1070-org', 'rec-1075-org', 'rec-1080-org', 'rec-110-org', 'rec-1146-org', 
'rec-1157-org', 'rec-1165-org', 'rec-1185-org', 'rec-1234-org', 'rec-1271-org', 
'rec-1280-org',...........  
66, 14, 13, 18, 34, 39, 0, 16, 80, 50, 20, 69, 28, 25, 49, 77, 51, 85, 52, 63, 74, 61, 
83, 91, 22, 26, 55, 84, 11, 81, 97, 56, 27, 48, 2, 64, 5, 17, 29, 60, 72, 47, 92, 12,
95, 15, 19, 57, 37, 70, 94]], names=['rec_id_1', 'rec_id_2'])
Pulizia dei dati in Python

Comparing the DataFrames

# Generate the pairs
pairs = indexer.index(census_A, census_B)

# Create a Compare object compare_cl = recordlinkage.Compare()
# Find exact matches for pairs of date_of_birth and state compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth') compare_cl.exact('state', 'state', label='state')
# Find similar matches for pairs of surname and address_1 using string similarity compare_cl.string('surname', 'surname', threshold=0.85, label='surname') compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')
# Find matches potential_matches = compare_cl.compute(pairs, census_A, census_B)
Pulizia dei dati in Python

Finding matching pairs

print(potential_matches)
                             date_of_birth  state  surname  address_1
rec_id_1     rec_id_2                                                
rec-1070-org rec-561-dup-0               0      1      0.0        0.0
             rec-2642-dup-0              0      1      0.0        0.0
             rec-608-dup-0               0      1      0.0        0.0
...
rec-1631-org rec-4070-dup-0              0      1      0.0        0.0
             rec-4862-dup-0              0      1      0.0        0.0
             rec-629-dup-0               0      1      0.0        0.0
...
Pulizia dei dati in Python

Finding the only pairs we want

potential_matches[potential_matches.sum(axis = 1) => 2]
                             date_of_birth  state  surname  address_1
rec_id_1     rec_id_2                                                
rec-4878-org rec-4878-dup-0              1      1      1.0        0.0
rec-417-org  rec-2867-dup-0              0      1      0.0        1.0
rec-3964-org rec-394-dup-0               0      1      1.0        0.0
rec-1373-org rec-4051-dup-0              0      1      1.0        0.0
             rec-802-dup-0               0      1      1.0        0.0
rec-3540-org rec-470-dup-0               0      1      1.0        0.0
Pulizia dei dati in Python

Let's practice!

Pulizia dei dati in Python

Preparing Video For Download...