Anonymizing continuous data

Data Privacy and Anonymization in Python

Rebeca Gonzalez

Instructor

Continuous variables

  • age
  • height
  • weight
  • temperature
  • date and time
Data Privacy and Anonymization in Python

Continuous variables

# See the dataset
hr.head()
    Age    BusinessTravel        Department                EducationField    EmployeeNumber
0    41    Travel_Rarely         Sales                     Life Sciences     1
1    49    Travel_Frequently     Research & Development    Life Sciences     2
2    37    Travel_Rarely         Research & Development    Other             4
3    33    Travel_Frequently     Research & Development    Life Sciences     5
4    27    Travel_Rarely         Research & Development    Medical           7
Data Privacy and Anonymization in Python

Continuous distributions

Use the best continuous distribution for our data.

  • Create a histogram
  • Try continuous functions, fitting them to the histogram.
  • Keep the function that has the smallest error between itself and the histogram.
Data Privacy and Anonymization in Python

Continuous distribution

Plot of the distribution in the Age variable

Data Privacy and Anonymization in Python

Applying a distribution

import scipy.stats


# Fit the genlogistic distribution to the continuous variable Age params = scipy.stats.genlogistic.fit(hr['Age'])
# See the parameters of the continuous functions print(params)
(4.9899067653418285, 22.32808853181744, 7.046590524738551)
Data Privacy and Anonymization in Python

Sampling from the continuous distribution

# Sample from the genlogistic distribution
df['Age'] = scipy.stats.genlogistic.rvs(size=len(df.index), *params)


# See the resulting dataset df['Age'].head()
     Age          BusinessTravel     Department                EducationField    EmployeeNumber
0    40.767259    Travel_Rarely      Sales                     Life Sciences     1
1    45.730504    Travel_Frequently  Research & Development    Life Sciences     2
2    41.910050    Travel_Rarely      Research & Development    Other             4
3    35.275320    Travel_Frequently  Research & Development    Life Sciences     5
4    40.198134    Travel_Rarely      Research & Development    Medical           7
Data Privacy and Anonymization in Python

Sampling from the continuous distribution

# Round the values to obtain discrete values
df['Age'] = df['Age'].round()
     Age   BusinessTravel     Department                EducationField    EmployeeNumber
0    41    Travel_Rarely      Sales                     Life Sciences     1
1    46    Travel_Frequently  Research & Development    Life Sciences     2
2    42    Travel_Rarely      Research & Development    Other             4
3    35    Travel_Frequently  Research & Development    Life Sciences     5
4    40    Travel_Rarely      Research & Development    Medical           7
Data Privacy and Anonymization in Python

Let's practice!

Data Privacy and Anonymization in Python

Preparing Video For Download...