Introduction to K-anonymity

Data Privacy and Anonymization in Python

Rebeca Gonzalez

Data engineer

Why is k-anonymity important?

Image presenting the governor of Massachusetts in the left, text that says "Identifying the governor of Massachusetts" and Harvard Professor Latanya Sweeney on the right

Data Privacy and Anonymization in Python

Why is it important?

  • Six people had his birth date
  • Only three were men
  • He was the only one in his ZIP code

Diagram showing the intersection of two datasets: medical data and voter list, which intersection is zip code, birth date, and gender

Data Privacy and Anonymization in Python

How to prevent this attack?

Only one record had the demographic values of the governor.

Deleting all the demographic information would leave the data useless for analysis.

What if there is a middle ground to make sure that the demographic values are no longer unique in the dataset?

Data Privacy and Anonymization in Python

Definition of k-anonymity

$$ $$ $$ $$ $$ $$ At least k individuals in the dataset share the set of attributes that might become identifying for each individual.

Data Privacy and Anonymization in Python

Definition of k-anonymity

A large group of people crossing a street

Data Privacy and Anonymization in Python

K-anonymous dataset

2-anonymous:

     ZIP code    Age
0    4217        34
1    4217        34
2    1742        77
3    1742        77

Every combination of values for identifying columns in the dataset appears at least for k different records.

Not 2-anonymous:

     ZIP code    Age
0    4217        34
1    4217        34
2    1742        77
3    1743        77
Data Privacy and Anonymization in Python

K-anonymity: Terminology

Diagram showing the terms of direct identifiers, quasi-identifiers and sensitive attributes

Data Privacy and Anonymization in Python

Medical data dataset

# Explore DataFrame 
medical_df.head()
    Age    Department   Condition
0    34    Marketing    Anxiety disorders
1    46    Finance      Flu
2    41    Finance      Flu
3    62    Marketing    Anxiety disorders
4    44    Marketing    Anxiety disorders
Data Privacy and Anonymization in Python

Privacy attributes

Privacy attributes:

  • Identifying: Could be SSN and ID numbers.
  • Quasi-identifying: Age and department from our dataset
  • Sensitives: Medical condition from our dataset
Data Privacy and Anonymization in Python

Exploring unique combination of quasi-identifiers

# Calculate how many unique combinations are for Age and Department
medical_df.groupby(['Age','Department']).size().reset_index(name='Count')
    Age    Department    Count
0    30    Production    2
1    31    Marketing     1
2    32    Marketing     1
3    32    Production    1
4    33    Production    1
5    34    Finance       1
6    34    Marketing     1
7    34    Production    1
8    35    Marketing     1
9    36    Finance       2
10    38    Finance      1
11    38    Production   1
Data Privacy and Anonymization in Python

Approach: generalization

# Generalize Age by creating 4 groups of intervals
medical_df['Age_group'] = pd.cut(medical_df['Age'], bins=4)


# Explore the dataset with intervals medical_df.head()
    Age    Department    Condition            Age_group
0    34    Marketing     Anxiety disorders    (29.964, 39.0]
1    46    Finance       Flu                  (39.0, 48.0]
2    41    Finance       Flu                  (39.0, 48.0]
3    62    Marketing     Anxiety disorders    (57.0, 66.0]
4    44    Marketing     Anxiety disorders    (39.0, 48.0]
Data Privacy and Anonymization in Python

Approach: generalization

# Calculate how many unique combinations are for Age and Department
medical_df.groupby(['Age_group','Department']).size().reset_index(name='Count')
     Age_group         Department    Count
0    (29.964, 39.0]    Finance       4
1    (29.964, 39.0]    Marketing     4
2    (29.964, 39.0]    Production    6
3    (39.0, 48.0]      Finance       8
4    (39.0, 48.0]      Marketing     5
5    (39.0, 48.0]      Production    4
6    (48.0, 57.0]      Finance       3
7    (48.0, 57.0]      Marketing     2
8    (48.0, 57.0]      Production    4
Data Privacy and Anonymization in Python

Approach: generalization

# Set k to be 2, for a 2-anonymous dataset
k = 2


# Calculate how many unique combinations are for Age and Department df_count = medical_df.groupby(['Age_group','Department']).size().reset_index(name='Count')
# Filter the rows that have count less than k df_count[df_count['Count'] < k]
    Age_group    Department    Count
Data Privacy and Anonymization in Python

Let's k-anonymize!

Data Privacy and Anonymization in Python

Preparing Video For Download...