Generare coppie

Pulizia dei dati in Python

Adel Nehme

VP of AI Curriculum, DataCamp

Motivazione

Tabelle partite NBA

Pulizia dei dati in Python

Quando le join non bastano

record linkage

Pulizia dei dati in Python

Record linkage

Il pacchetto recordlinkage

Pulizia dei dati in Python

I nostri DataFrame

census_A

             given_name  surname date_of_birth         suburb state  address_1
rec_id                                                                
rec-1070-org   michaela  neumann      19151111  winston hills   cal  stanley street 
rec-1016-org   courtney  painter      19161214      richlands   txs  pinkerton circuit 
...

census_B

               given_name  surname date_of_birth             suburb state  address_1
rec_id                                                                      
rec-561-dup-0       elton      NaN      19651013         windermere   ny   light setreet 
rec-2642-dup-0   mitchell    maxon      19390212         north ryde   cal  edkins street 
...
Pulizia dei dati in Python

Generare coppie

Pulizia dei dati in Python

Generare coppie

Pulizia dei dati in Python

Blocking

Pulizia dei dati in Python

Generare coppie

# Import recordlinkage
import recordlinkage

# Crea oggetto di indicizzazione indexer = recordlinkage.Index()
# Genera coppie bloccate su state indexer.block('state') pairs = indexer.index(census_A, census_B)
Pulizia dei dati in Python

Generare coppie

print(pairs)
MultiIndex(levels=[['rec-1007-org', 'rec-1016-org', 'rec-1054-org', 'rec-1066-org', 
'rec-1070-org', 'rec-1075-org', 'rec-1080-org', 'rec-110-org', 'rec-1146-org', 
'rec-1157-org', 'rec-1165-org', 'rec-1185-org', 'rec-1234-org', 'rec-1271-org', 
'rec-1280-org',...........  
66, 14, 13, 18, 34, 39, 0, 16, 80, 50, 20, 69, 28, 25, 49, 77, 51, 85, 52, 63, 74, 61, 
83, 91, 22, 26, 55, 84, 11, 81, 97, 56, 27, 48, 2, 64, 5, 17, 29, 60, 72, 47, 92, 12,
95, 15, 19, 57, 37, 70, 94]], names=['rec_id_1', 'rec_id_2'])
Pulizia dei dati in Python

Confrontare i DataFrame

# Generate the pairs
pairs = indexer.index(census_A, census_B)

# Create a Compare object compare_cl = recordlinkage.Compare()
# Trova corrispondenze esatte per date_of_birth e state compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth') compare_cl.exact('state', 'state', label='state')
# Trova corrispondenze simili per surname e address_1 con similarità di stringa compare_cl.string('surname', 'surname', threshold=0.85, label='surname') compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')
# Trova corrispondenze potential_matches = compare_cl.compute(pairs, census_A, census_B)
Pulizia dei dati in Python

Trovare coppie corrispondenti

print(potential_matches)
                             date_of_birth  state  surname  address_1
rec_id_1     rec_id_2                                                
rec-1070-org rec-561-dup-0               0      1      0.0        0.0
             rec-2642-dup-0              0      1      0.0        0.0
             rec-608-dup-0               0      1      0.0        0.0
...
rec-1631-org rec-4070-dup-0              0      1      0.0        0.0
             rec-4862-dup-0              0      1      0.0        0.0
             rec-629-dup-0               0      1      0.0        0.0
...
Pulizia dei dati in Python

Filtrare solo le coppie desiderate

potential_matches[potential_matches.sum(axis = 1) => 2]
                             date_of_birth  state  surname  address_1
rec_id_1     rec_id_2                                                
rec-4878-org rec-4878-dup-0              1      1      1.0        0.0
rec-417-org  rec-2867-dup-0              0      1      0.0        1.0
rec-3964-org rec-394-dup-0               0      1      1.0        0.0
rec-1373-org rec-4051-dup-0              0      1      1.0        0.0
             rec-802-dup-0               0      1      1.0        0.0
rec-3540-org rec-470-dup-0               0      1      1.0        0.0
Pulizia dei dati in Python

Passons à la pratique !

Pulizia dei dati in Python

Preparing Video For Download...