Data Privacy and Anonymization in Python
Rebeca Gonzalez
Data engineer
fake_data.name()
'Kelly Clark'
fake_data.name_male()
'Antonio Henderson'
fake_data.name_female()
'Jennifer Ortega'
clients_df
gender active
0 Female No
1 Male Yes
2 Male No
3 Female Yes
4 Male Yes
... ... ...
1465 Male Yes
1466 Male Yes
1467 Male Yes
1468 Male Yes
1469 Male Yes
1470 rows × 2 columns
# Import the Faker class from faker import Faker
# Initialize a Faker class fake_data = Faker()
# Generate a name according to the gender, that will be unique in the dataset clients_df['name'] = [fake_data.unique.name_female() if x == "Female"
else fake_data.unique.name_male()
for x in clients_df['gender']]
# Explore the dataset
clients_df
gender active name
0 Female No Michelle Lang
1 Male Yes Robert Norton
2 Male No Matthew Brown
3 Female Yes Sherry Jones
4 Male Yes Steven Vega
... ... ... ...
1465 Male Yes Bradley Smith
1466 Male Yes Tyler Yu
1467 Male Yes Mr. Joshua Gallegos
1468 Male Yes Brian Aguilar
1469 Male Yes David Johnson
1470 rows × 3 columns
# Generating a random city clients_df['city'] = [fake_data.city() for x in range(len(clients_df))]
clients_df.head()
Gender Active Name City
0 Female No Stacy Hooper Reedland
1 Male Yes Michael Rogers North Michellestad
2 Male No James Sanchez West Josephburgh
3 Female Yes Taylor Berger Hermanton
4 Male Yes Joshua Coleman South Amandaland
# Generating emails with different domains clients_df['contact email'] = [fake_data.company_email() for x in range(len(clients_df))]
# Explore the dataset clients_df.head()
gender active name city contact email
0 Female No Stacy Hooper Reedland [email protected]
1 Male Yes Michael Rogers North Michellestad [email protected]
2 Male No James Sanchez West Josephburgh [email protected]
3 Female Yes Taylor Berger Hermanton [email protected]
4 Male Yes Joshua Coleman South Amandaland [email protected]
# Generating emails with company like domains and username similar to name clients_df['Contact email'] = [x.replace(" ", "") + "@" + fake_data.domain_name() for x in clients_df['Name']]
# Explore the resulting DataFrame clients_df.head()
gender active name city contact email
0 Female No Stacy Hooper Reedland [email protected]
1 Male Yes Michael Rogers North Michellestad [email protected]
2 Male No James Sanchez West Josephburgh [email protected]
3 Female Yes Taylor Berger Hermanton [email protected]
4 Male Yes Joshua Coleman South Amandaland [email protected]
# Generating dates between two times clients_df['date'] = [fake_data.date_between(start_date="-10y", end_date="now") for x in range(len(clients_df))]
# Explore the resulting DataFrame clients_df.head()
gender active name city contact email date
0 Female No Stacy Hooper Reedland [email protected] 2019-11-20
1 Male Yes Michael Rogers North Michellestad [email protected] 2015-02-22
2 Male No James Sanchez West Josephburgh [email protected] 2015-12-11
3 Female Yes Taylor Berger Hermanton [email protected] 2012-12-13
4 Male Yes Joshua Coleman South Amandaland [email protected] 2014-05-22
When imitating a real dataset, we can avoid leaking the real names of values.
# Import numpy import numpy as np
# Obtain or specify the probabilities p = (0.58, 0.23, 0.16, 0.03) cities = ["New York", "Chicago", "Seattle", "Dallas"]
# Generate the cities from the selected ones following a distribution clients_df['city'] = np.random.choice(cities, size=len(clients_df), p=p)
# See the resulting dataset
clients_df.head()
gender active name city contact email date
0 Female No Stacy Hooper Chicago [email protected] 2019-11-20
1 Male Yes Michael Rogers New York [email protected] 2015-02-22
2 Male No James Sanchez New York [email protected] 2015-12-11
3 Female Yes Taylor Berger Chicago [email protected] 2012-12-13
4 Male Yes Joshua Coleman New York [email protected] 2014-05-22
Data Privacy and Anonymization in Python