Python ile Veri Gizliliği ve Anonimleştirme
Rebeca Gonzalez
Data engineer
fake_data.name()
'Kelly Clark'
fake_data.name_male()
'Antonio Henderson'
fake_data.name_female()
'Jennifer Ortega'
clients_df
gender active
0 Female No
1 Male Yes
2 Male No
3 Female Yes
4 Male Yes
... ... ...
1465 Male Yes
1466 Male Yes
1467 Male Yes
1468 Male Yes
1469 Male Yes
1470 rows × 2 columns
# Faker sınıfını içe aktarın from faker import Faker# Faker'ı başlatın fake_data = Faker()# Cinsiyete göre, veri kümesinde benzersiz olacak bir ad üretin clients_df['name'] = [fake_data.unique.name_female() if x == "Female"else fake_data.unique.name_male()for x in clients_df['gender']]
# Veri kümesini inceleyin
clients_df
gender active name
0 Female No Michelle Lang
1 Male Yes Robert Norton
2 Male No Matthew Brown
3 Female Yes Sherry Jones
4 Male Yes Steven Vega
... ... ... ...
1465 Male Yes Bradley Smith
1466 Male Yes Tyler Yu
1467 Male Yes Mr. Joshua Gallegos
1468 Male Yes Brian Aguilar
1469 Male Yes David Johnson
1470 rows × 3 columns
# Rastgele bir şehir oluşturma clients_df['city'] = [fake_data.city() for x in range(len(clients_df))]clients_df.head()
Gender Active Name City
0 Female No Stacy Hooper Reedland
1 Male Yes Michael Rogers North Michellestad
2 Male No James Sanchez West Josephburgh
3 Female Yes Taylor Berger Hermanton
4 Male Yes Joshua Coleman South Amandaland
# Farklı alan adlarıyla e-posta üretme clients_df['contact email'] = [fake_data.company_email() for x in range(len(clients_df))]# Veri kümesini inceleyin clients_df.head()
gender active name city contact email
0 Female No Stacy Hooper Reedland [email protected]
1 Male Yes Michael Rogers North Michellestad [email protected]
2 Male No James Sanchez West Josephburgh [email protected]
3 Female Yes Taylor Berger Hermanton [email protected]
4 Male Yes Joshua Coleman South Amandaland [email protected]
# Şirket benzeri alan adlarıyla, ada benzeyen kullanıcı adıyla e-posta üretme clients_df['Contact email'] = [x.replace(" ", "") + "@" + fake_data.domain_name() for x in clients_df['Name']]# Ortaya çıkan DataFrame'i inceleyin clients_df.head()
gender active name city contact email
0 Female No Stacy Hooper Reedland [email protected]
1 Male Yes Michael Rogers North Michellestad [email protected]
2 Male No James Sanchez West Josephburgh [email protected]
3 Female Yes Taylor Berger Hermanton [email protected]
4 Male Yes Joshua Coleman South Amandaland [email protected]
# İki zaman arasında tarih üretme clients_df['date'] = [fake_data.date_between(start_date="-10y", end_date="now") for x in range(len(clients_df))]# Ortaya çıkan DataFrame'i inceleyin clients_df.head()
gender active name city contact email date
0 Female No Stacy Hooper Reedland [email protected] 2019-11-20
1 Male Yes Michael Rogers North Michellestad [email protected] 2015-02-22
2 Male No James Sanchez West Josephburgh [email protected] 2015-12-11
3 Female Yes Taylor Berger Hermanton [email protected] 2012-12-13
4 Male Yes Joshua Coleman South Amandaland [email protected] 2014-05-22
Gerçek bir veri kümesini taklit ederken, gerçek değer adlarının sızmasını önleyebiliriz.
# numpy'ı içe aktarın import numpy as np# Olasılıkları elde edin veya belirtin p = (0.58, 0.23, 0.16, 0.03) cities = ["New York", "Chicago", "Seattle", "Dallas"]# Seçilen şehirleri bir dağılıma uyarak üretin clients_df['city'] = np.random.choice(cities, size=len(clients_df), p=p)
# Ortaya çıkan veri kümesini görüntüleyin
clients_df.head()
gender active name city contact email date
0 Female No Stacy Hooper Chicago [email protected] 2019-11-20
1 Male Yes Michael Rogers New York [email protected] 2015-02-22
2 Male No James Sanchez New York [email protected] 2015-12-11
3 Female Yes Taylor Berger Chicago [email protected] 2012-12-13
4 Male Yes Joshua Coleman New York [email protected] 2014-05-22
Python ile Veri Gizliliği ve Anonimleştirme