Introduction to K-anonymity

Data Privacy and Anonymization in Python

Rebeca Gonzalez

Data engineer

Why is k-anonymity important?

Image presenting the governor of Massachusetts in the left, text that says "Identifying the governor of Massachusetts" and Harvard Professor Latanya Sweeney on the right

Why is it important?

Six people had his birth date
Only three were men
He was the only one in his ZIP code

Diagram showing the intersection of two datasets: medical data and voter list, which intersection is zip code, birth date, and gender

How to prevent this attack?

Only one record had the demographic values of the governor.

Deleting all the demographic information would leave the data useless for analysis.

What if there is a middle ground to make sure that the demographic values are no longer unique in the dataset?

Definition of k-anonymity

$$ $$ $$ $$ $$ $$ At least k individuals in the dataset share the set of attributes that might become identifying for each individual.

Definition of k-anonymity

A large group of people crossing a street

K-anonymous dataset

2-anonymous:

     ZIP code    Age
0    4217        34
1    4217        34
2    1742        77
3    1742        77

Every combination of values for identifying columns in the dataset appears at least for k different records.

Not 2-anonymous:

     ZIP code    Age
0    4217        34
1    4217        34
2    1742        77
3    1743        77

K-anonymity: Terminology

Diagram showing the terms of direct identifiers, quasi-identifiers and sensitive attributes

Medical data dataset

# Explore DataFrame 
medical_df.head()

    Age    Department   Condition
0    34    Marketing    Anxiety disorders
1    46    Finance      Flu
2    41    Finance      Flu
3    62    Marketing    Anxiety disorders
4    44    Marketing    Anxiety disorders

Privacy attributes

Privacy attributes:

Identifying: Could be SSN and ID numbers.
Quasi-identifying: Age and department from our dataset
Sensitives: Medical condition from our dataset

Exploring unique combination of quasi-identifiers

# Calculate how many unique combinations are for Age and Department
medical_df.groupby(['Age','Department']).size().reset_index(name='Count')

    Age    Department    Count
0    30    Production    2
1    31    Marketing     1
2    32    Marketing     1
3    32    Production    1
4    33    Production    1
5    34    Finance       1
6    34    Marketing     1
7    34    Production    1
8    35    Marketing     1
9    36    Finance       2
10    38    Finance      1
11    38    Production   1

Approach: generalization

# Generalize Age by creating 4 groups of intervals
medical_df['Age_group'] = pd.cut(medical_df['Age'], bins=4)


# Explore the dataset with intervals
medical_df.head()

    Age    Department    Condition            Age_group
0    34    Marketing     Anxiety disorders    (29.964, 39.0]
1    46    Finance       Flu                  (39.0, 48.0]
2    41    Finance       Flu                  (39.0, 48.0]
3    62    Marketing     Anxiety disorders    (57.0, 66.0]
4    44    Marketing     Anxiety disorders    (39.0, 48.0]

Approach: generalization

# Calculate how many unique combinations are for Age and Department
medical_df.groupby(['Age_group','Department']).size().reset_index(name='Count')

     Age_group         Department    Count
0    (29.964, 39.0]    Finance       4
1    (29.964, 39.0]    Marketing     4
2    (29.964, 39.0]    Production    6
3    (39.0, 48.0]      Finance       8
4    (39.0, 48.0]      Marketing     5
5    (39.0, 48.0]      Production    4
6    (48.0, 57.0]      Finance       3
7    (48.0, 57.0]      Marketing     2
8    (48.0, 57.0]      Production    4

Approach: generalization

# Set k to be 2, for a 2-anonymous dataset
k = 2


# Calculate how many unique combinations are for Age and Department
df_count = medical_df.groupby(['Age_group','Department']).size().reset_index(name='Count')


# Filter the rows that have count less than k
df_count[df_count['Count'] < k]

    Age_group    Department    Count

Let's k-anonymize!

Data Privacy and Anonymization in Python