Membuat pasangan

Membersihkan Data di Python

Adel Nehme

VP of AI Curriculum, DataCamp

Motivasi

Tabel pertandingan NBA

Membersihkan Data di Python

Saat join tidak berhasil

pencocokan rekaman

Membersihkan Data di Python

Record linkage

Paket recordlinkage

Membersihkan Data di Python

DataFrame kita

census_A

             given_name  surname date_of_birth         suburb state  address_1
rec_id                                                                
rec-1070-org   michaela  neumann      19151111  winston hills   cal  stanley street 
rec-1016-org   courtney  painter      19161214      richlands   txs  pinkerton circuit 
...

census_B

               given_name  surname date_of_birth             suburb state  address_1
rec_id                                                                      
rec-561-dup-0       elton      NaN      19651013         windermere   ny   light setreet 
rec-2642-dup-0   mitchell    maxon      19390212         north ryde   cal  edkins street 
...
Membersihkan Data di Python

Membuat pasangan

Membersihkan Data di Python

Membuat pasangan

Membersihkan Data di Python

Pemblokiran

Membersihkan Data di Python

Membuat pasangan

# Import recordlinkage
import recordlinkage

# Buat objek indexing indexer = recordlinkage.Index()
# Hasilkan pasangan diblok berdasarkan state indexer.block('state') pairs = indexer.index(census_A, census_B)
Membersihkan Data di Python

Membuat pasangan

print(pairs)
MultiIndex(levels=[['rec-1007-org', 'rec-1016-org', 'rec-1054-org', 'rec-1066-org', 
'rec-1070-org', 'rec-1075-org', 'rec-1080-org', 'rec-110-org', 'rec-1146-org', 
'rec-1157-org', 'rec-1165-org', 'rec-1185-org', 'rec-1234-org', 'rec-1271-org', 
'rec-1280-org',...........  
66, 14, 13, 18, 34, 39, 0, 16, 80, 50, 20, 69, 28, 25, 49, 77, 51, 85, 52, 63, 74, 61, 
83, 91, 22, 26, 55, 84, 11, 81, 97, 56, 27, 48, 2, 64, 5, 17, 29, 60, 72, 47, 92, 12,
95, 15, 19, 57, 37, 70, 94]], names=['rec_id_1', 'rec_id_2'])
Membersihkan Data di Python

Membandingkan DataFrame

# Generate the pairs
pairs = indexer.index(census_A, census_B)

# Create a Compare object compare_cl = recordlinkage.Compare()
# Cari kecocokan tepat untuk pasangan date_of_birth dan state compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth') compare_cl.exact('state', 'state', label='state')
# Cari kecocokan mirip untuk pasangan surname dan address_1 dengan kemiripan string compare_cl.string('surname', 'surname', threshold=0.85, label='surname') compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')
# Temukan kecocokan potential_matches = compare_cl.compute(pairs, census_A, census_B)
Membersihkan Data di Python

Menemukan pasangan cocok

print(potential_matches)
                             date_of_birth  state  surname  address_1
rec_id_1     rec_id_2                                                
rec-1070-org rec-561-dup-0               0      1      0.0        0.0
             rec-2642-dup-0              0      1      0.0        0.0
             rec-608-dup-0               0      1      0.0        0.0
...
rec-1631-org rec-4070-dup-0              0      1      0.0        0.0
             rec-4862-dup-0              0      1      0.0        0.0
             rec-629-dup-0               0      1      0.0        0.0
...
Membersihkan Data di Python

Memilah pasangan yang diinginkan

potential_matches[potential_matches.sum(axis = 1) => 2]
                             date_of_birth  state  surname  address_1
rec_id_1     rec_id_2                                                
rec-4878-org rec-4878-dup-0              1      1      1.0        0.0
rec-417-org  rec-2867-dup-0              0      1      0.0        1.0
rec-3964-org rec-394-dup-0               0      1      1.0        0.0
rec-1373-org rec-4051-dup-0              0      1      1.0        0.0
             rec-802-dup-0               0      1      1.0        0.0
rec-3540-org rec-470-dup-0               0      1      1.0        0.0
Membersihkan Data di Python

Ayo berlatih!

Membersihkan Data di Python

Preparing Video For Download...