Generar pares

Limpieza de datos en Python

Adel Nehme

VP of AI Curriculum, DataCamp

Motivación

Tablas de juegos de la NBA

Limpieza de datos en Python

Cuando las uniones no funcionan

Vinculación de registros

Limpieza de datos en Python

Vinculación de registros

El paquete « recordlinkage »

Limpieza de datos en Python

Nuestros DataFrames

census_A

             given_name  surname date_of_birth         suburb state  address_1
rec_id                                                                
rec-1070-org   michaela  neumann      19151111  winston hills   cal  stanley street 
rec-1016-org   courtney  painter      19161214      richlands   txs  pinkerton circuit 
...

census_B

               given_name  surname date_of_birth             suburb state  address_1
rec_id                                                                      
rec-561-dup-0       elton      NaN      19651013         windermere   ny   light setreet 
rec-2642-dup-0   mitchell    maxon      19390212         north ryde   cal  edkins street 
...
Limpieza de datos en Python

Generar pares

Limpieza de datos en Python

Generar pares

Limpieza de datos en Python

Bloqueo

Limpieza de datos en Python

Generar pares

# Import recordlinkage
import recordlinkage

# Create indexing object indexer = recordlinkage.Index()
# Generate pairs blocked on state indexer.block('state') pairs = indexer.index(census_A, census_B)
Limpieza de datos en Python

Generar pares

print(pairs)
MultiIndex(levels=[['rec-1007-org', 'rec-1016-org', 'rec-1054-org', 'rec-1066-org', 
'rec-1070-org', 'rec-1075-org', 'rec-1080-org', 'rec-110-org', 'rec-1146-org', 
'rec-1157-org', 'rec-1165-org', 'rec-1185-org', 'rec-1234-org', 'rec-1271-org', 
'rec-1280-org',...........  
66, 14, 13, 18, 34, 39, 0, 16, 80, 50, 20, 69, 28, 25, 49, 77, 51, 85, 52, 63, 74, 61, 
83, 91, 22, 26, 55, 84, 11, 81, 97, 56, 27, 48, 2, 64, 5, 17, 29, 60, 72, 47, 92, 12,
95, 15, 19, 57, 37, 70, 94]], names=['rec_id_1', 'rec_id_2'])
Limpieza de datos en Python

Comparación de los DataFrames

# Generate the pairs
pairs = indexer.index(census_A, census_B)

# Create a Compare object compare_cl = recordlinkage.Compare()
# Find exact matches for pairs of date_of_birth and state compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth') compare_cl.exact('state', 'state', label='state')
# Find similar matches for pairs of surname and address_1 using string similarity compare_cl.string('surname', 'surname', threshold=0.85, label='surname') compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')
# Find matches potential_matches = compare_cl.compute(pairs, census_A, census_B)
Limpieza de datos en Python

Encontrar pares coincidentes

print(potential_matches)
                             date_of_birth  state  surname  address_1
rec_id_1     rec_id_2                                                
rec-1070-org rec-561-dup-0               0      1      0.0        0.0
             rec-2642-dup-0              0      1      0.0        0.0
             rec-608-dup-0               0      1      0.0        0.0
...
rec-1631-org rec-4070-dup-0              0      1      0.0        0.0
             rec-4862-dup-0              0      1      0.0        0.0
             rec-629-dup-0               0      1      0.0        0.0
...
Limpieza de datos en Python

Encontrar los únicos pares que queremos

potential_matches[potential_matches.sum(axis = 1) => 2]
                             date_of_birth  state  surname  address_1
rec_id_1     rec_id_2                                                
rec-4878-org rec-4878-dup-0              1      1      1.0        0.0
rec-417-org  rec-2867-dup-0              0      1      0.0        1.0
rec-3964-org rec-394-dup-0               0      1      1.0        0.0
rec-1373-org rec-4051-dup-0              0      1      1.0        0.0
             rec-802-dup-0               0      1      1.0        0.0
rec-3540-org rec-470-dup-0               0      1      1.0        0.0
Limpieza de datos en Python

¡Vamos a practicar!

Limpieza de datos en Python

Preparing Video For Download...