Data Privacy and Anonymization in Python
Rebeca Gonzalez
Data engineer
Non-sensitive Personally identifiable information can be quasi-identifiers as well.
When combined with other quasi-identifiers, they become identifying.
Anonymization techniques for quasi-identifiers to reduce data disclosure risks.
Original DataFrame
name phone
0 Cassandra Nelson 4399406975395
1 Brian Moss 0389407128613
2 Melody Gill 8283308773967
3 Sandra Huber 4366608954250
4 Patricia Webster 4466462475574
DataFrame with the names masked
name phone
0 xxxx 4399406975395
1 xxxx 0389407128613
2 xxxx 8283308773967
3 xxxx 4366608954250
4 xxxx 4466462475574
Replacing sensitive values with other characters like "x" is also known as data masking.
# Explore the DataFrame
df.head()
country card_number email
0 Finland 3546746666030419 [email protected]
1 Belarus 4303032415762821 [email protected]
2 Turkmenistan 4536883671157 [email protected]
3 Puerto Rico 3568819286614160 [email protected]
4 Angola 2514167462583016 [email protected]
# Uniformly mask the card number colum df['card_number'] = '****'
# See resulting DataFrame df.head()
country card_number email
0 Finland **** [email protected]
1 Belarus **** [email protected]
2 Turkmenistan **** [email protected]
3 Puerto Rico **** [email protected]
4 Angola **** [email protected]
# Mask username from email df2['email'] = df2['email'].apply(lambda s: s[0] + '****' + s[s.find('@'):] )
# See the resulting pseudonymized data df2.head()
country card_number email
0 Finland 3546746666030419 f****@gmail.com
1 Belarus 4303032415762821 m****@gmail.com
2 Turkmenistan 4536883671157 a****@gmail.com
3 Puerto Rico 3568819286614160 k****@gmail.com
4 Angola 2514167462583016 d****@gmail.com
# Import Faker class from faker import Faker
# Create fake data generator fake_data = Faker()
# Generate a credit card number fake_data.credit_card_number()
3542216874440804
# Mask card number with new generated data using a lambda function df['card_number'] = df['card_number'].apply(lambda x: fake_data.credit_card_number())
# See the resulting pseudonymized data df.head()
country card_number email
0 Finland 3596625386355448 [email protected]
1 Belarus 376297265347524 [email protected]
2 Turkmenistan 4377494880888682 [email protected]
3 Puerto Rico 30553931809810 [email protected]
4 Angola 4241735748382 [email protected]
fake_data.name()
'Kelly Clark'
fake_data.name_male()
'Antonio Henderson'
fake_data.name_female()
'Jennifer Ortega'
Data Privacy and Anonymization in Python