Anonimisasi data kontinu

Privasi Data dan Anonimisasi di Python

Rebeca Gonzalez

Instructor

Variabel kontinu

  • usia
  • tinggi
  • berat
  • suhu
  • tanggal dan waktu
Privasi Data dan Anonimisasi di Python

Variabel kontinu

# Lihat dataset
hr.head()
    Age    BusinessTravel        Department                EducationField    EmployeeNumber
0    41    Travel_Rarely         Sales                     Life Sciences     1
1    49    Travel_Frequently     Research & Development    Life Sciences     2
2    37    Travel_Rarely         Research & Development    Other             4
3    33    Travel_Frequently     Research & Development    Life Sciences     5
4    27    Travel_Rarely         Research & Development    Medical           7
Privasi Data dan Anonimisasi di Python

Distribusi kontinu

Gunakan distribusi kontinu terbaik untuk data kita.

  • Buat histogram
  • Coba fungsi kontinu, fit ke histogram.
  • Simpan fungsi dengan galat terkecil terhadap histogram.
Privasi Data dan Anonimisasi di Python

Distribusi kontinu

Plot distribusi pada variabel Age

Privasi Data dan Anonimisasi di Python

Menerapkan distribusi

import scipy.stats


# Fit distribusi genlogistic ke variabel kontinu Age params = scipy.stats.genlogistic.fit(hr['Age'])
# Lihat parameter fungsi kontinu print(params)
(4.9899067653418285, 22.32808853181744, 7.046590524738551)
Privasi Data dan Anonimisasi di Python

Sampling dari distribusi kontinu

# Sampel dari distribusi genlogistic
df['Age'] = scipy.stats.genlogistic.rvs(size=len(df.index), *params)


# Lihat dataset hasilnya df['Age'].head()
     Age          BusinessTravel     Department                EducationField    EmployeeNumber
0    40.767259    Travel_Rarely      Sales                     Life Sciences     1
1    45.730504    Travel_Frequently  Research & Development    Life Sciences     2
2    41.910050    Travel_Rarely      Research & Development    Other             4
3    35.275320    Travel_Frequently  Research & Development    Life Sciences     5
4    40.198134    Travel_Rarely      Research & Development    Medical           7
Privasi Data dan Anonimisasi di Python

Sampling dari distribusi kontinu

# Bulatkan nilai untuk mendapatkan nilai diskret
df['Age'] = df['Age'].round()
     Age   BusinessTravel     Department                EducationField    EmployeeNumber
0    41    Travel_Rarely      Sales                     Life Sciences     1
1    46    Travel_Frequently  Research & Development    Life Sciences     2
2    42    Travel_Rarely      Research & Development    Other             4
3    35    Travel_Frequently  Research & Development    Life Sciences     5
4    40    Travel_Rarely      Research & Development    Medical           7
Privasi Data dan Anonimisasi di Python

Ayo berlatih!

Privasi Data dan Anonimisasi di Python

Preparing Video For Download...