Datenpaare generieren

Datenbereinigung in Python

Adel Nehme

VP of AI Curriculum, DataCamp

Motivation

NBA-Spieltage-Tabellen

Datenbereinigung in Python

Wenn Joins nicht funktionieren

record linkage

Datenbereinigung in Python

Record linkage – Datensatzverknüpfung

Das Paket recordlinkage

Datenbereinigung in Python

Unsere DataFrames

census_A

             given_name  surname date_of_birth         suburb state  address_1
rec_id                                                                
rec-1070-org   michaela  neumann      19151111  winston hills   cal  stanley street 
rec-1016-org   courtney  painter      19161214      richlands   txs  pinkerton circuit 
...

census_B

               given_name  surname date_of_birth             suburb state  address_1
rec_id                                                                      
rec-561-dup-0       elton      NaN      19651013         windermere   ny   light setreet 
rec-2642-dup-0   mitchell    maxon      19390212         north ryde   cal  edkins street 
...
Datenbereinigung in Python

Datenpaare generieren

Datenbereinigung in Python

Datenpaare generieren

Datenbereinigung in Python

Blocking

Datenbereinigung in Python

Datenpaare generieren

# Import recordlinkage
import recordlinkage

# Create indexing object indexer = recordlinkage.Index()
# Generate pairs blocked on state indexer.block('state') pairs = indexer.index(census_A, census_B)
Datenbereinigung in Python

Datenpaare generieren

print(pairs)
MultiIndex(levels=[['rec-1007-org', 'rec-1016-org', 'rec-1054-org', 'rec-1066-org', 
'rec-1070-org', 'rec-1075-org', 'rec-1080-org', 'rec-110-org', 'rec-1146-org', 
'rec-1157-org', 'rec-1165-org', 'rec-1185-org', 'rec-1234-org', 'rec-1271-org', 
'rec-1280-org',...........  
66, 14, 13, 18, 34, 39, 0, 16, 80, 50, 20, 69, 28, 25, 49, 77, 51, 85, 52, 63, 74, 61, 
83, 91, 22, 26, 55, 84, 11, 81, 97, 56, 27, 48, 2, 64, 5, 17, 29, 60, 72, 47, 92, 12,
95, 15, 19, 57, 37, 70, 94]], names=['rec_id_1', 'rec_id_2'])
Datenbereinigung in Python

Vergleich der DataFrames

# Generate the pairs
pairs = indexer.index(census_A, census_B)

# Create a Compare object compare_cl = recordlinkage.Compare()
# Find exact matches for pairs of date_of_birth and state compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth') compare_cl.exact('state', 'state', label='state')
# Find similar matches for pairs of surname and address_1 using string similarity compare_cl.string('surname', 'surname', threshold=0.85, label='surname') compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')
# Find matches potential_matches = compare_cl.compute(pairs, census_A, census_B)
Datenbereinigung in Python

Passende Paare finden

print(potential_matches)
                             date_of_birth  state  surname  address_1
rec_id_1     rec_id_2                                                
rec-1070-org rec-561-dup-0               0      1      0.0        0.0
             rec-2642-dup-0              0      1      0.0        0.0
             rec-608-dup-0               0      1      0.0        0.0
...
rec-1631-org rec-4070-dup-0              0      1      0.0        0.0
             rec-4862-dup-0              0      1      0.0        0.0
             rec-629-dup-0               0      1      0.0        0.0
...
Datenbereinigung in Python

Nur die Paare finden, die wir wollen

potential_matches[potential_matches.sum(axis = 1) => 2]
                             date_of_birth  state  surname  address_1
rec_id_1     rec_id_2                                                
rec-4878-org rec-4878-dup-0              1      1      1.0        0.0
rec-417-org  rec-2867-dup-0              0      1      0.0        1.0
rec-3964-org rec-394-dup-0               0      1      1.0        0.0
rec-1373-org rec-4051-dup-0              0      1      1.0        0.0
             rec-802-dup-0               0      1      1.0        0.0
rec-3540-org rec-470-dup-0               0      1      1.0        0.0
Datenbereinigung in Python

Lass uns üben!

Datenbereinigung in Python

Preparing Video For Download...