Anonymizing continuous data

Data Privacy and Anonymization in Python

Rebeca Gonzalez

Instructor

Continuous variables

age
height
weight
temperature
date and time

Continuous variables

# See the dataset
hr.head()

    Age    BusinessTravel        Department                EducationField    EmployeeNumber
0    41    Travel_Rarely         Sales                     Life Sciences     1
1    49    Travel_Frequently     Research & Development    Life Sciences     2
2    37    Travel_Rarely         Research & Development    Other             4
3    33    Travel_Frequently     Research & Development    Life Sciences     5
4    27    Travel_Rarely         Research & Development    Medical           7

Continuous distributions

Use the best continuous distribution for our data.

Create a histogram
Try continuous functions, fitting them to the histogram.
Keep the function that has the smallest error between itself and the histogram.

Continuous distribution

Plot of the distribution in the Age variable

Applying a distribution

import scipy.stats


# Fit the genlogistic distribution to the continuous variable Age
params = scipy.stats.genlogistic.fit(hr['Age'])


# See the parameters of the continuous functions
print(params)

(4.9899067653418285, 22.32808853181744, 7.046590524738551)

Sampling from the continuous distribution

# Sample from the genlogistic distribution
df['Age'] = scipy.stats.genlogistic.rvs(size=len(df.index), *params)


# See the resulting dataset
df['Age'].head()

     Age          BusinessTravel     Department                EducationField    EmployeeNumber
0    40.767259    Travel_Rarely      Sales                     Life Sciences     1
1    45.730504    Travel_Frequently  Research & Development    Life Sciences     2
2    41.910050    Travel_Rarely      Research & Development    Other             4
3    35.275320    Travel_Frequently  Research & Development    Life Sciences     5
4    40.198134    Travel_Rarely      Research & Development    Medical           7

Sampling from the continuous distribution

# Round the values to obtain discrete values
df['Age'] = df['Age'].round()

     Age   BusinessTravel     Department                EducationField    EmployeeNumber
0    41    Travel_Rarely      Sales                     Life Sciences     1
1    46    Travel_Frequently  Research & Development    Life Sciences     2
2    42    Travel_Rarely      Research & Development    Other             4
3    35    Travel_Frequently  Research & Development    Life Sciences     5
4    40    Travel_Rarely      Research & Development    Medical           7

Let's practice!

Data Privacy and Anonymization in Python