Data masking and data generation with Faker

Data Privacy and Anonymization in Python

Rebeca Gonzalez

Data engineer

Quasi-identifiers

Non-sensitive Personally identifiable information can be quasi-identifiers as well. Text saying "Head of government in Europe" and a birth date, along with lines pointing to the face of Angela Merkel

1 Photo of Angela Merkel from Wikimedia Commons
Data Privacy and Anonymization in Python

Quasi-identifiers

When combined with other quasi-identifiers, they become identifying.

  • Age
  • Gender
  • Occupation
  • Personal dates: birth, death, healthcare admission, visit, etc.

Drawing of a man and popups around him with what it seems personal information

Data Privacy and Anonymization in Python

Quasi-identifiers and re-identification attacks

Diagram showing the intersection of two datasets: medical data and voter list, which intersection is zip code, birth date, and gender

Data Privacy and Anonymization in Python

More anonymization techniques

Anonymization techniques for quasi-identifiers to reduce data disclosure risks. Two tables with medical and personal data. On the left the original data and on the right the resulting dataset with the suppressed name attribute

Data Privacy and Anonymization in Python

Data masking

Original DataFrame

     name                phone
0    Cassandra Nelson    4399406975395
1    Brian Moss          0389407128613
2    Melody Gill         8283308773967
3    Sandra Huber        4366608954250
4    Patricia Webster    4466462475574

DataFrame with the names masked

     name    phone
0    xxxx    4399406975395
1    xxxx    0389407128613
2    xxxx    8283308773967
3    xxxx    4366608954250
4    xxxx    4466462475574

Replacing sensitive values with other characters like "x" is also known as data masking.

Data Privacy and Anonymization in Python

Data masking

# Explore the DataFrame
df.head()
     country         card_number           email
0    Finland         3546746666030419      [email protected]
1    Belarus         4303032415762821      [email protected]
2    Turkmenistan    4536883671157         [email protected]
3    Puerto Rico     3568819286614160      [email protected]
4    Angola          2514167462583016      [email protected]
Data Privacy and Anonymization in Python

Data masking

# Uniformly mask the card number colum 
df['card_number'] = '****'


# See resulting DataFrame df.head()
     country         card_number     email
0    Finland         ****            [email protected]
1    Belarus         ****            [email protected]
2    Turkmenistan    ****            [email protected]
3    Puerto Rico     ****            [email protected]
4    Angola          ****            [email protected]
Data Privacy and Anonymization in Python

Partial data masking

Screenshot of the reset your password section page of Facebook, showing email partially masked

Data Privacy and Anonymization in Python

Partial data masking

# Mask username from email
df2['email'] = df2['email'].apply(lambda s: s[0] + '****' + s[s.find('@'):] )


# See the resulting pseudonymized data df2.head()
     country         card_number           email
0    Finland         3546746666030419      f****@gmail.com
1    Belarus         4303032415762821      m****@gmail.com
2    Turkmenistan    4536883671157         a****@gmail.com
3    Puerto Rico     3568819286614160      k****@gmail.com
4    Angola          2514167462583016      d****@gmail.com
Data Privacy and Anonymization in Python

Generating synthetic data

On the left, a table showing the original data including card number values, on the right synthetically generated new credit card numbers.

  • Replacing sensitive information with newly generated data, similar to the original.
Data Privacy and Anonymization in Python

Generating synthetic data

# Import Faker class
from faker import Faker


# Create fake data generator fake_data = Faker()
# Generate a credit card number fake_data.credit_card_number()
3542216874440804
Data Privacy and Anonymization in Python

Generating synthetic data

# Mask card number with new generated data using a lambda function
df['card_number'] = df['card_number'].apply(lambda x: fake_data.credit_card_number())

# See the resulting pseudonymized data df.head()
     country         card_number            email
0    Finland         3596625386355448       [email protected]
1    Belarus         376297265347524        [email protected]
2    Turkmenistan    4377494880888682       [email protected]
3    Puerto Rico     30553931809810         [email protected]
4    Angola          4241735748382          [email protected]
Data Privacy and Anonymization in Python

Generating other types of data

fake_data.name()
'Kelly Clark'
fake_data.name_male()
'Antonio Henderson'
fake_data.name_female()
'Jennifer Ortega'
Data Privacy and Anonymization in Python

Time to practice!

Data Privacy and Anonymization in Python

Preparing Video For Download...