Dataprivacy en anonimisering in Python
Rebeca Gonzalez
Data engineer


Slechts één record had de demografische waarden van de gouverneur.
Alle demografische info verwijderen maakt de data onbruikbaar.
Wat als er een middenweg is zodat die demografische waarden niet langer uniek zijn?
$$ $$ $$ $$ $$ $$ Minstens k personen in de dataset delen de set attributen die per individu identificerend kan zijn.

2-anoniem:
ZIP code Age
0 4217 34
1 4217 34
2 1742 77
3 1742 77
Elke combinatie van waarden voor identificerende kolommen komt voor bij minstens k verschillende records.
Niet 2-anoniem:
ZIP code Age
0 4217 34
1 4217 34
2 1742 77
3 1743 77

# Explore DataFrame
medical_df.head()
Age Department Condition
0 34 Marketing Anxiety disorders
1 46 Finance Flu
2 41 Finance Flu
3 62 Marketing Anxiety disorders
4 44 Marketing Anxiety disorders
Privacy-attributen:
# Calculate how many unique combinations are for Age and Department
medical_df.groupby(['Age','Department']).size().reset_index(name='Count')
Age Department Count
0 30 Production 2
1 31 Marketing 1
2 32 Marketing 1
3 32 Production 1
4 33 Production 1
5 34 Finance 1
6 34 Marketing 1
7 34 Production 1
8 35 Marketing 1
9 36 Finance 2
10 38 Finance 1
11 38 Production 1
# Generalize Age by creating 4 groups of intervals medical_df['Age_group'] = pd.cut(medical_df['Age'], bins=4)# Explore the dataset with intervals medical_df.head()
Age Department Condition Age_group
0 34 Marketing Anxiety disorders (29.964, 39.0]
1 46 Finance Flu (39.0, 48.0]
2 41 Finance Flu (39.0, 48.0]
3 62 Marketing Anxiety disorders (57.0, 66.0]
4 44 Marketing Anxiety disorders (39.0, 48.0]
# Calculate how many unique combinations are for Age and Department
medical_df.groupby(['Age_group','Department']).size().reset_index(name='Count')
Age_group Department Count
0 (29.964, 39.0] Finance 4
1 (29.964, 39.0] Marketing 4
2 (29.964, 39.0] Production 6
3 (39.0, 48.0] Finance 8
4 (39.0, 48.0] Marketing 5
5 (39.0, 48.0] Production 4
6 (48.0, 57.0] Finance 3
7 (48.0, 57.0] Marketing 2
8 (48.0, 57.0] Production 4
# Set k to be 2, for a 2-anonymous dataset k = 2# Calculate how many unique combinations are for Age and Department df_count = medical_df.groupby(['Age_group','Department']).size().reset_index(name='Count')# Filter the rows that have count less than k df_count[df_count['Count'] < k]
Age_group Department Count
Dataprivacy en anonimisering in Python