Data Privacy and Anonymization in Python
Rebeca Gonzalez
Data engineer
"The ability to ensure flows of information that satisfy social and legal norms."
Data that, when used alone or with other relevant data, can identify someone.
Data that cannot be used alone to trace a person
Data that cannot be used alone to trace a person, such as gender, occupation, zip code, or city of birth.
Protects PII of people living, or whose data is processed, within Europe.
Removing selected information to protect the privacy of subjects.
# Attribute suppression on Sensitive PII "name" suppressed_salaries = salaries.drop('name', axis="columns")
# Explore obtained dataset suppressed_salaries.head()
gender status salary pay_basis position_title
0 Male Employee 64400.0 Per Annum DEPUTY DIRECTOR
1 Male Employee 43600.0 Per Annum ASSOCIATE DIRECTOR
2 Male Employee 120000.0 Per Annum SPECIAL ASSISTANT TO THE PRESIDENT AND DEPUTY ...
3 Male Employee 86200.0 Per Annum LEAD ADVANCE REPRESENTATIVE
4 Male Employee 106000.0 Per Annum SPECIAL ASSISTANT TO THE PRESIDENT AND DIRECTO...
# Explore the DataFrame
salaries.head()
hours performance salary
0 72 51 $80,500.00
1 20 99 $2,805,000.00
3 75 62 $75,800.00
4 74 58 $60,000.00
5 70 54 $79,000.00
# Drop rows with salaries higher than 2,000,000
salaries = salaries.drop(salaries[salaries.Salary > 2000000].index)
# See reasulting DataFrame
salaries.head()
hours performance salary
0 72 51 80500
2 75 62 75800
3 74 58 60000
4 70 54 79000
5 68 53 62000
Data Privacy and Anonymization in Python