Anonymizing with data generalization

Data Privacy and Anonymization in Python

Rebeca Gonzalez

Data engineer

Data generalization

Technique that allows you to replace data values with less precise ones.

The purpose is to eliminate some of the identifiers while retaining data utility for analysis.

Allows you to replace a data value with a less precise one: "Dancer" -> "Artist"

     name               location
0    Amanda Hooper      146 Rodgers Field\nGregoryview, MS 71630
1    Sarah Smith        817 Garcia Shoal\nJonesville, AR 30299
2    Sean Boyd III      1938 90th St\nDallas, TX 59715

     name               location
0    Amanda Hooper      998 Boone Estate\nReedborough, MS 71630
1    Sarah Smith        7255 Shelby Rapids Apt. 455\nKarenland, AK 30299
2    Sean Boyd III      791 Crist Parks\nGreenton, TX 59715

Data generalization

# Original data
df_employees.head()

     first name   last name    age   ssn
0    Amber        Brown        91    798-29-4785
1    William      Gibson       34    431-66-8381
2    Daniel       Lee          92    825-91-5550
3    Andrea       Stevenson    64    188-59-3544
4    Julie        Horn         35    020-60-6388

# Generalized data: Age in intervals and masked SSN
generalized_df.head()

    First name    Last name    Age        SSN
0    Amber        Brown        (80, 99]    798-**-****
1    William      Gibson       (30, 50]    431-**-****
2    Daniel       Lee          (80, 99]    825-**-****
3    Andrea       Stevenson    (60, 80]    188-**-****
4    Julie        Horn         (30, 50]    020-**-****

Those who are 34 years old will be placed in a range of 30 to 50. This is also known as binning.

Data aggregation

Image showing a list of professional occupations on the left and the overall general categories on the right.

Medical dataset

# Explore the dataset
df_medical.head()

    age    gender    department    condition
0    30    F         Finance       Anxiety disorders
1    42    M         Production    Bronchitis
2    35    F         Marketing     Dysthymia
3    39    F         Production    Dysthymia
4    40    M         Marketing     Flu

Medical dataset

# Explore the histogram of the age variable
df_medical['age'].hist(bins=15)

Generated histogram of the age attribute from the dataset, generated with hist method

Generalization

# Apply generalization by transforming to binary data
df_medical['age'] = df_medical['age'].apply(lambda x:">=40" if x>=40 else "<40" )


# See results
df_medical.head()

    age    gender    department    condition
0    <40   F         Finance       Bronchitis
1    >=40  M         Production    Bronquitis
2    <40   F         Finance       Dysthymia
3    <40   F         Production    Dysthymia
4    >=40  M         Marketing     Flu

Top and bottom coding

# Explore the histogram of the age variable
df_medical['age'].hist(bins=15)

Generated histogram of the age attribute from the dataset, generated with hist method

Few people are younger than 25 and few are older than 55
Apply bounds to reduce the risk of re-identification for these people in outliers
Best when there are very few observations within a category, especially at the tails of the distribution

Top coding

# Filter to see rows affected
df_medical[df_medical['age'] >= 55]

      age   gender   department     condition
26    56    F        Production     Flu
65    55    M        Finance        Dysthymia
126   59    F        Production     Anxiety disorders
139   58    F        Finance        Dysthymia
142   59    M        Marketing      Flu
145   57    M        Marketing      Anxiety disorders

Top code

# Top code the age to 55
df_medical.loc[df_medical['age'] > 55, 'age'] = 55


# Filter to see rows affected
df_medical[df_medical['age'] >= 55]

      age   gender   department     condition
26    55    F        Production     Flu
65    55    M        Finance        Dysthymia
126   55    F        Production     Anxiety disorders
139   55    F        Finance        Dysthymia
142   55    M        Marketing      Flu
145   55    M        Marketing      Anxiety disorders

Bottom coding

# Explore the histogram of the age variable
df_medical['age'].hist(bins=15)

Resulting histogram of the age attribute from the dataset, generated with hist method, after applying top coding. It doesn't show outliers on the right side of the histogram corresponding with the highest values

Bottom code

# Bottom code the age to 25
df_medical.loc[df['age'] < 25, 'age'] = 25


# Explore the histogram of the age variable
df_medical['age'].hist(bins=15)

Resulting histogram of the age attribute from the dataset, generated with hist method, after applying top and bottom coding. It doesn't show outliers

Data generalization and privacy models

Better when used with Suppression and Masking following a Privacy Model like K-anonymity.

Specify conditions that the dataset must satisfy to keep disclosure risk under control.

We'll learn how to implement this privacy model in the next chapter!

Time to practice!

Data Privacy and Anonymization in Python