What's private, and why do we care?

Data Privacy and Anonymization in Python

Rebeca Gonzalez

Data engineer

Facebook and Cambridge Analytica scandal

Impact of data privacy

  • Unauthorized access to 87 million people's personal information
  • Build psychological profiles of American voters
  • Persuade them during political campaigns

Mark Zuckerberg in court facing the consequences of not complying with data privacy rights

Data Privacy and Anonymization in Python

What's privacy?

People walking on the street

Data Privacy and Anonymization in Python

Information flow and privacy

Facial recognition system applied to the face of a woman in the street, at the same time a GPS icon is above her head

Data Privacy and Anonymization in Python

Information flow and privacy

  • How your personal information flows

 

"The ability to ensure flows of information that satisfy social and legal norms."

Facial recognition system applied to the face of a woman in the street, at the same time a GPS icon is above her head

Data Privacy and Anonymization in Python

Personally identifiable information (PII)

Data that, when used alone or with other relevant data, can identify someone.

Paper form with space for writing personal information details, and a pen

Data Privacy and Anonymization in Python

Sensitive PII

  • Clearly about someone
  • Exposure can lead to harm, embarrassment or inconvenience

Drawing of a man and popups around him with what it seems personal information

Data Privacy and Anonymization in Python

Sensitive PII

  • Full name
  • Social Security Number (SSN)
  • Financial information
  • Medical records

GDPR image showing the maximum penalties for those who don't comply with the regulations

Data Privacy and Anonymization in Python

Non-sensitive PII

Data that cannot be used alone to trace a person

  • Gender
  • Occupation
  • Zip code
  • City of birth
Data Privacy and Anonymization in Python

Non-sensitive PII

Data that cannot be used alone to trace a person, such as gender, occupation, zip code, or city of birth.

  • Gender
  • Occupation
  • Zip code
  • City of birth
Still can be used with other information to identify someone!

Text saying "Head of government in Europe" and a birth date, along with lines pointing to the face of Angela Merkel

1 Photo of Angela Merkel from Wikimedia Commons.
Data Privacy and Anonymization in Python

GDPR: EU General Data Protection Regulation

Protects PII of people living, or whose data is processed, within Europe.

Key principles of the GDPR

  1. Lawfulness, fairness, and transparency
  2. Purpose limitation
  3. Data minimization
  4. Accuracy
  5. Storage limitation

Learn more here

GDPR image showing the maximum penalties for those who don't comply with the regulations

Data Privacy and Anonymization in Python

Data suppression

Removing selected information to protect the privacy of subjects.

Attribute suppression
  • Removing columns entirely
Cell/record suppression
  • Removing or replacing data in rows or cells
Data Privacy and Anonymization in Python

Attribute suppression on a dataset

# Attribute suppression on Sensitive PII "name"
suppressed_salaries = salaries.drop('name', axis="columns")


# Explore obtained dataset suppressed_salaries.head()
     gender    status    salary     pay_basis    position_title
0    Male    Employee    64400.0    Per Annum    DEPUTY DIRECTOR
1    Male    Employee    43600.0    Per Annum    ASSOCIATE DIRECTOR
2    Male    Employee    120000.0   Per Annum    SPECIAL ASSISTANT TO THE PRESIDENT AND DEPUTY ...
3    Male    Employee    86200.0    Per Annum    LEAD ADVANCE REPRESENTATIVE
4    Male    Employee    106000.0   Per Annum    SPECIAL ASSISTANT TO THE PRESIDENT AND DIRECTO...
Data Privacy and Anonymization in Python

Record suppression on a dataset

# Explore the DataFrame
salaries.head()
      hours    performance    salary 
0     72       51             $80,500.00 
1     20       99             $2,805,000.00 
3     75       62             $75,800.00 
4     74       58             $60,000.00 
5     70       54             $79,000.00 
Data Privacy and Anonymization in Python

Record suppression on a dataset

# Drop rows with salaries higher than 2,000,000
salaries = salaries.drop(salaries[salaries.Salary > 2000000].index)

# See reasulting DataFrame
salaries.head()
      hours    performance    salary 
0     72       51             80500 
2     75       62             75800 
3     74       58             60000 
4     70       54             79000 
5     68       53             62000 
Data Privacy and Anonymization in Python

Suppression and linkage attacks

Image showing two tables: one in the left with resulting medical data after attribute suppression was applied and on the right voter registration data with some of the same data from the other table

Data Privacy and Anonymization in Python

Let's practice!

Data Privacy and Anonymization in Python

Preparing Video For Download...