Data Privacy and Anonymization in Python
Rebeca Gonzalez
Data engineer
Only one record had the demographic values of the governor.
Deleting all the demographic information would leave the data useless for analysis.
What if there is a middle ground to make sure that the demographic values are no longer unique in the dataset?
$$ $$ $$ $$ $$ $$ At least k individuals in the dataset share the set of attributes that might become identifying for each individual.
2-anonymous:
ZIP code Age
0 4217 34
1 4217 34
2 1742 77
3 1742 77
Every combination of values for identifying columns in the dataset appears at least for k different records.
Not 2-anonymous:
ZIP code Age
0 4217 34
1 4217 34
2 1742 77
3 1743 77
# Explore DataFrame
medical_df.head()
Age Department Condition
0 34 Marketing Anxiety disorders
1 46 Finance Flu
2 41 Finance Flu
3 62 Marketing Anxiety disorders
4 44 Marketing Anxiety disorders
Privacy attributes:
# Calculate how many unique combinations are for Age and Department
medical_df.groupby(['Age','Department']).size().reset_index(name='Count')
Age Department Count
0 30 Production 2
1 31 Marketing 1
2 32 Marketing 1
3 32 Production 1
4 33 Production 1
5 34 Finance 1
6 34 Marketing 1
7 34 Production 1
8 35 Marketing 1
9 36 Finance 2
10 38 Finance 1
11 38 Production 1
# Generalize Age by creating 4 groups of intervals medical_df['Age_group'] = pd.cut(medical_df['Age'], bins=4)
# Explore the dataset with intervals medical_df.head()
Age Department Condition Age_group
0 34 Marketing Anxiety disorders (29.964, 39.0]
1 46 Finance Flu (39.0, 48.0]
2 41 Finance Flu (39.0, 48.0]
3 62 Marketing Anxiety disorders (57.0, 66.0]
4 44 Marketing Anxiety disorders (39.0, 48.0]
# Calculate how many unique combinations are for Age and Department
medical_df.groupby(['Age_group','Department']).size().reset_index(name='Count')
Age_group Department Count
0 (29.964, 39.0] Finance 4
1 (29.964, 39.0] Marketing 4
2 (29.964, 39.0] Production 6
3 (39.0, 48.0] Finance 8
4 (39.0, 48.0] Marketing 5
5 (39.0, 48.0] Production 4
6 (48.0, 57.0] Finance 3
7 (48.0, 57.0] Marketing 2
8 (48.0, 57.0] Production 4
# Set k to be 2, for a 2-anonymous dataset k = 2
# Calculate how many unique combinations are for Age and Department df_count = medical_df.groupby(['Age_group','Department']).size().reset_index(name='Count')
# Filter the rows that have count less than k df_count[df_count['Count'] < k]
Age_group Department Count
Data Privacy and Anonymization in Python