Riservatezza dei dati e anonimizzazione in Python
Rebeca Gonzalez
Data engineer
fake_data.name()
'Kelly Clark'
fake_data.name_male()
'Antonio Henderson'
fake_data.name_female()
'Jennifer Ortega'
clients_df
gender active
0 Female No
1 Male Yes
2 Male No
3 Female Yes
4 Male Yes
... ... ...
1465 Male Yes
1466 Male Yes
1467 Male Yes
1468 Male Yes
1469 Male Yes
1470 rows × 2 columns
# Importa la classe Faker from faker import Faker# Inizializza la classe Faker fake_data = Faker()# Genera un nome in base al genere, univoco nel dataset clients_df['name'] = [fake_data.unique.name_female() if x == "Female"else fake_data.unique.name_male()for x in clients_df['gender']]
# Esplora il dataset
clients_df
gender active name
0 Female No Michelle Lang
1 Male Yes Robert Norton
2 Male No Matthew Brown
3 Female Yes Sherry Jones
4 Male Yes Steven Vega
... ... ... ...
1465 Male Yes Bradley Smith
1466 Male Yes Tyler Yu
1467 Male Yes Mr. Joshua Gallegos
1468 Male Yes Brian Aguilar
1469 Male Yes David Johnson
1470 rows × 3 columns
# Generazione di una città casuale clients_df['city'] = [fake_data.city() for x in range(len(clients_df))]clients_df.head()
Gender Active Name City
0 Female No Stacy Hooper Reedland
1 Male Yes Michael Rogers North Michellestad
2 Male No James Sanchez West Josephburgh
3 Female Yes Taylor Berger Hermanton
4 Male Yes Joshua Coleman South Amandaland
# Generazione di email con domini diversi clients_df['contact email'] = [fake_data.company_email() for x in range(len(clients_df))]# Esplora il dataset clients_df.head()
gender active name city contact email
0 Female No Stacy Hooper Reedland [email protected]
1 Male Yes Michael Rogers North Michellestad [email protected]
2 Male No James Sanchez West Josephburgh [email protected]
3 Female Yes Taylor Berger Hermanton [email protected]
4 Male Yes Joshua Coleman South Amandaland [email protected]
# Generazione di email con domini aziendali e username simile al nome clients_df['Contact email'] = [x.replace(" ", "") + "@" + fake_data.domain_name() for x in clients_df['Name']]# Esplora il DataFrame risultante clients_df.head()
gender active name city contact email
0 Female No Stacy Hooper Reedland [email protected]
1 Male Yes Michael Rogers North Michellestad [email protected]
2 Male No James Sanchez West Josephburgh [email protected]
3 Female Yes Taylor Berger Hermanton [email protected]
4 Male Yes Joshua Coleman South Amandaland [email protected]
# Generazione di date tra due tempi clients_df['date'] = [fake_data.date_between(start_date="-10y", end_date="now") for x in range(len(clients_df))]# Esplora il DataFrame risultante clients_df.head()
gender active name city contact email date
0 Female No Stacy Hooper Reedland [email protected] 2019-11-20
1 Male Yes Michael Rogers North Michellestad [email protected] 2015-02-22
2 Male No James Sanchez West Josephburgh [email protected] 2015-12-11
3 Female Yes Taylor Berger Hermanton [email protected] 2012-12-13
4 Male Yes Joshua Coleman South Amandaland [email protected] 2014-05-22
Quando imitiamo un dataset reale, possiamo evitare di rivelare i nomi veri dei valori.
# Importa numpy import numpy as np# Ottieni o specifica le probabilità p = (0.58, 0.23, 0.16, 0.03) cities = ["New York", "Chicago", "Seattle", "Dallas"]# Genera le città selezionate seguendo una distribuzione clients_df['city'] = np.random.choice(cities, size=len(clients_df), p=p)
# Vedi il dataset risultante
clients_df.head()
gender active name city contact email date
0 Female No Stacy Hooper Chicago [email protected] 2019-11-20
1 Male Yes Michael Rogers New York [email protected] 2015-02-22
2 Male No James Sanchez New York [email protected] 2015-12-11
3 Female Yes Taylor Berger Chicago [email protected] 2012-12-13
4 Male Yes Joshua Coleman New York [email protected] 2014-05-22
Riservatezza dei dati e anonimizzazione in Python