Dataprivacy en anonimisering in Python
Rebeca Gonzalez
Data engineer
fake_data.name()
'Kelly Clark'
fake_data.name_male()
'Antonio Henderson'
fake_data.name_female()
'Jennifer Ortega'
clients_df
gender active
0 Female No
1 Male Yes
2 Male No
3 Female Yes
4 Male Yes
... ... ...
1465 Male Yes
1466 Male Yes
1467 Male Yes
1468 Male Yes
1469 Male Yes
1470 rows × 2 columns
# De Faker-klasse importeren from faker import Faker# Een Faker-klasse initialiseren fake_data = Faker()# Genereer een naam volgens het gender, uniek binnen de dataset clients_df['name'] = [fake_data.unique.name_female() if x == "Female"else fake_data.unique.name_male()for x in clients_df['gender']]
# Verken de dataset
clients_df
gender active name
0 Female No Michelle Lang
1 Male Yes Robert Norton
2 Male No Matthew Brown
3 Female Yes Sherry Jones
4 Male Yes Steven Vega
... ... ... ...
1465 Male Yes Bradley Smith
1466 Male Yes Tyler Yu
1467 Male Yes Mr. Joshua Gallegos
1468 Male Yes Brian Aguilar
1469 Male Yes David Johnson
1470 rows × 3 columns
# Een willekeurige stad genereren clients_df['city'] = [fake_data.city() for x in range(len(clients_df))]clients_df.head()
Gender Active Name City
0 Female No Stacy Hooper Reedland
1 Male Yes Michael Rogers North Michellestad
2 Male No James Sanchez West Josephburgh
3 Female Yes Taylor Berger Hermanton
4 Male Yes Joshua Coleman South Amandaland
# E-mails genereren met verschillende domeinen clients_df['contact email'] = [fake_data.company_email() for x in range(len(clients_df))]# Verken de dataset clients_df.head()
gender active name city contact email
0 Female No Stacy Hooper Reedland [email protected]
1 Male Yes Michael Rogers North Michellestad [email protected]
2 Male No James Sanchez West Josephburgh [email protected]
3 Female Yes Taylor Berger Hermanton [email protected]
4 Male Yes Joshua Coleman South Amandaland [email protected]
# E-mails genereren met bedrijfsdomeinen en gebruikersnaam op basis van naam clients_df['Contact email'] = [x.replace(" ", "") + "@" + fake_data.domain_name() for x in clients_df['Name']]# Bekijk de resulterende DataFrame clients_df.head()
gender active name city contact email
0 Female No Stacy Hooper Reedland [email protected]
1 Male Yes Michael Rogers North Michellestad [email protected]
2 Male No James Sanchez West Josephburgh [email protected]
3 Female Yes Taylor Berger Hermanton [email protected]
4 Male Yes Joshua Coleman South Amandaland [email protected]
# Datums genereren tussen twee tijdstippen clients_df['date'] = [fake_data.date_between(start_date="-10y", end_date="now") for x in range(len(clients_df))]# Bekijk de resulterende DataFrame clients_df.head()
gender active name city contact email date
0 Female No Stacy Hooper Reedland [email protected] 2019-11-20
1 Male Yes Michael Rogers North Michellestad [email protected] 2015-02-22
2 Male No James Sanchez West Josephburgh [email protected] 2015-12-11
3 Female Yes Taylor Berger Hermanton [email protected] 2012-12-13
4 Male Yes Joshua Coleman South Amandaland [email protected] 2014-05-22
Bij het nabootsen van een echte dataset kun je vermijden dat echte waardenamen lekken.
# Numpy importeren import numpy as np# Bepaal of geef de kansen op p = (0.58, 0.23, 0.16, 0.03) cities = ["New York", "Chicago", "Seattle", "Dallas"]# Genereer steden uit de selectie volgens een verdeling clients_df['city'] = np.random.choice(cities, size=len(clients_df), p=p)
# Bekijk de resulterende dataset
clients_df.head()
gender active name city contact email date
0 Female No Stacy Hooper Chicago [email protected] 2019-11-20
1 Male Yes Michael Rogers New York [email protected] 2015-02-22
2 Male No James Sanchez New York [email protected] 2015-12-11
3 Female Yes Taylor Berger Chicago [email protected] 2012-12-13
4 Male Yes Joshua Coleman New York [email protected] 2014-05-22
Dataprivacy en anonimisering in Python