Data Privacy and Anonymization in Python
Rebeca Gonzalez
Instructor
# See the dataset
hr.head()
Age BusinessTravel Department EducationField EmployeeNumber
0 41 Travel_Rarely Sales Life Sciences 1
1 49 Travel_Frequently Research & Development Life Sciences 2
2 37 Travel_Rarely Research & Development Other 4
3 33 Travel_Frequently Research & Development Life Sciences 5
4 27 Travel_Rarely Research & Development Medical 7
Use the best continuous distribution for our data.
import scipy.stats
# Fit the genlogistic distribution to the continuous variable Age params = scipy.stats.genlogistic.fit(hr['Age'])
# See the parameters of the continuous functions print(params)
(4.9899067653418285, 22.32808853181744, 7.046590524738551)
# Sample from the genlogistic distribution df['Age'] = scipy.stats.genlogistic.rvs(size=len(df.index), *params)
# See the resulting dataset df['Age'].head()
Age BusinessTravel Department EducationField EmployeeNumber
0 40.767259 Travel_Rarely Sales Life Sciences 1
1 45.730504 Travel_Frequently Research & Development Life Sciences 2
2 41.910050 Travel_Rarely Research & Development Other 4
3 35.275320 Travel_Frequently Research & Development Life Sciences 5
4 40.198134 Travel_Rarely Research & Development Medical 7
# Round the values to obtain discrete values
df['Age'] = df['Age'].round()
Age BusinessTravel Department EducationField EmployeeNumber
0 41 Travel_Rarely Sales Life Sciences 1
1 46 Travel_Frequently Research & Development Life Sciences 2
2 42 Travel_Rarely Research & Development Other 4
3 35 Travel_Frequently Research & Development Life Sciences 5
4 40 Travel_Rarely Research & Development Medical 7
Data Privacy and Anonymization in Python