Paren genereren

Data opschonen in Python

Adel Nehme

VP of AI Curriculum, DataCamp

Motivatie

NBA-wedstrijdtabels

Data opschonen in Python

Als joins niet werken

recordlinkage

Data opschonen in Python

Record linkage

Het pakket recordlinkage

Data opschonen in Python

Onze DataFrames

census_A

             given_name  surname date_of_birth         suburb state  address_1
rec_id                                                                
rec-1070-org   michaela  neumann      19151111  winston hills   cal  stanley street 
rec-1016-org   courtney  painter      19161214      richlands   txs  pinkerton circuit 
...

census_B

               given_name  surname date_of_birth             suburb state  address_1
rec_id                                                                      
rec-561-dup-0       elton      NaN      19651013         windermere   ny   light setreet 
rec-2642-dup-0   mitchell    maxon      19390212         north ryde   cal  edkins street 
...
Data opschonen in Python

Paren genereren

Data opschonen in Python

Paren genereren

Data opschonen in Python

Blokkeren

Data opschonen in Python

Paren genereren

# Import recordlinkage
import recordlinkage

# Maak een indexing-object indexer = recordlinkage.Index()
# Genereer paren geblokkeerd op state indexer.block('state') pairs = indexer.index(census_A, census_B)
Data opschonen in Python

Paren genereren

print(pairs)
MultiIndex(levels=[['rec-1007-org', 'rec-1016-org', 'rec-1054-org', 'rec-1066-org', 
'rec-1070-org', 'rec-1075-org', 'rec-1080-org', 'rec-110-org', 'rec-1146-org', 
'rec-1157-org', 'rec-1165-org', 'rec-1185-org', 'rec-1234-org', 'rec-1271-org', 
'rec-1280-org',...........  
66, 14, 13, 18, 34, 39, 0, 16, 80, 50, 20, 69, 28, 25, 49, 77, 51, 85, 52, 63, 74, 61, 
83, 91, 22, 26, 55, 84, 11, 81, 97, 56, 27, 48, 2, 64, 5, 17, 29, 60, 72, 47, 92, 12,
95, 15, 19, 57, 37, 70, 94]], names=['rec_id_1', 'rec_id_2'])
Data opschonen in Python

De DataFrames vergelijken

# Genereer de paren
pairs = indexer.index(census_A, census_B)

# Maak een Compare-object compare_cl = recordlinkage.Compare()
# Zoek exacte matches voor paren van date_of_birth en state compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth') compare_cl.exact('state', 'state', label='state')
# Zoek soortgelijke matches voor paren van surname en address_1 met stringsimilariteit compare_cl.string('surname', 'surname', threshold=0.85, label='surname') compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')
# Vind matches potential_matches = compare_cl.compute(pairs, census_A, census_B)
Data opschonen in Python

Overeenkomende paren vinden

print(potential_matches)
                             date_of_birth  state  surname  address_1
rec_id_1     rec_id_2                                                
rec-1070-org rec-561-dup-0               0      1      0.0        0.0
             rec-2642-dup-0              0      1      0.0        0.0
             rec-608-dup-0               0      1      0.0        0.0
...
rec-1631-org rec-4070-dup-0              0      1      0.0        0.0
             rec-4862-dup-0              0      1      0.0        0.0
             rec-629-dup-0               0      1      0.0        0.0
...
Data opschonen in Python

Alleen de gewenste paren vinden

potential_matches[potential_matches.sum(axis = 1) => 2]
                             date_of_birth  state  surname  address_1
rec_id_1     rec_id_2                                                
rec-4878-org rec-4878-dup-0              1      1      1.0        0.0
rec-417-org  rec-2867-dup-0              0      1      0.0        1.0
rec-3964-org rec-394-dup-0               0      1      1.0        0.0
rec-1373-org rec-4051-dup-0              0      1      1.0        0.0
             rec-802-dup-0               0      1      1.0        0.0
rec-3540-org rec-470-dup-0               0      1      1.0        0.0
Data opschonen in Python

Laten we oefenen!

Data opschonen in Python

Preparing Video For Download...