Membuat dataset realistis dengan Faker

Privasi Data dan Anonimisasi di Python

Rebeca Gonzalez

Data engineer

Membuat data dengan Faker

fake_data.name()
'Kelly Clark'
fake_data.name_male()
'Antonio Henderson'
fake_data.name_female()
'Jennifer Ortega'
Privasi Data dan Anonimisasi di Python

DataFrame klien

clients_df
gender    active
0    Female    No
1    Male      Yes
2    Male      No
3    Female    Yes
4    Male      Yes
...    ...    ...
1465    Male    Yes
1466    Male    Yes
1467    Male    Yes
1468    Male    Yes
1469    Male    Yes
1470 rows × 2 columns
Privasi Data dan Anonimisasi di Python

Membuat dataset dengan Faker

  • Buat nama unik sesuai gender
  • Buat kota acak
  • Buat kota tertentu dengan distribusi probabilitas
  • Buat email
  • Buat tanggal dalam rentang waktu
Privasi Data dan Anonimisasi di Python

Menyesuaikan nama dengan gender

Membuat nama unik di dataset

Menghindari duplikasi
# Import the Faker class
from faker import Faker

# Initialize a Faker class fake_data = Faker()
# Generate a name according to the gender, that will be unique in the dataset clients_df['name'] = [fake_data.unique.name_female() if x == "Female"
else fake_data.unique.name_male()
for x in clients_df['gender']]
Privasi Data dan Anonimisasi di Python

Menyesuaikan nama dengan gender

# Explore the dataset
clients_df
    gender    active    name
0    Female    No       Michelle Lang
1    Male      Yes      Robert Norton
2    Male      No       Matthew Brown
3    Female    Yes      Sherry Jones
4    Male      Yes      Steven Vega
...    ...    ...    ...
1465    Male    Yes    Bradley Smith
1466    Male    Yes    Tyler Yu
1467    Male    Yes    Mr. Joshua Gallegos
1468    Male    Yes    Brian Aguilar
1469    Male    Yes    David Johnson
1470 rows × 3 columns
Privasi Data dan Anonimisasi di Python

Membuat kota acak

# Generating a random city
clients_df['city'] = [fake_data.city() 
                      for x in range(len(clients_df))]


clients_df.head()
    Gender    Active    Name              City
0    Female   No        Stacy Hooper      Reedland
1    Male     Yes       Michael Rogers    North Michellestad
2    Male     No        James Sanchez     West Josephburgh
3    Female   Yes       Taylor Berger     Hermanton
4    Male     Yes       Joshua Coleman    South Amandaland
Privasi Data dan Anonimisasi di Python

Membuat email

# Generating emails with different domains
clients_df['contact email'] = [fake_data.company_email() 
                               for x in range(len(clients_df))]


# Explore the dataset clients_df.head()
    gender     active    name            city                  contact email
0    Female    No        Stacy Hooper    Reedland              [email protected]
1    Male      Yes       Michael Rogers  North Michellestad    [email protected]
2    Male      No        James Sanchez   West Josephburgh      [email protected]
3    Female    Yes       Taylor Berger   Hermanton             [email protected]
4    Male      Yes       Joshua Coleman  South Amandaland      [email protected]
Privasi Data dan Anonimisasi di Python

Membuat email

# Generating emails with company like domains and username similar to name
clients_df['Contact email'] = [x.replace(" ", "") + "@" +
                               fake_data.domain_name() 
                               for x in clients_df['Name']]

# Explore the resulting DataFrame clients_df.head()
    gender     active    name            city                  contact email
0    Female    No        Stacy Hooper    Reedland              [email protected]
1    Male      Yes       Michael Rogers  North Michellestad    [email protected]
2    Male      No        James Sanchez   West Josephburgh      [email protected]
3    Female    Yes       Taylor Berger   Hermanton             [email protected]
4    Male      Yes       Joshua Coleman  South Amandaland      [email protected]
Privasi Data dan Anonimisasi di Python

Membuat tanggal

Tanggal antara dua waktu

# Generating dates between two times
clients_df['date'] = [fake_data.date_between(start_date="-10y", end_date="now") 
                               for x in range(len(clients_df))]

# Explore the resulting DataFrame clients_df.head()
    gender     active    name            city                  contact email                  date
0    Female    No        Stacy Hooper    Reedland              [email protected]       2019-11-20
1    Male      Yes       Michael Rogers  North Michellestad    [email protected]       2015-02-22
2    Male      No        James Sanchez   West Josephburgh      [email protected]  2015-12-11
3    Female    Yes       Taylor Berger   Hermanton             [email protected]       2012-12-13
4    Male      Yes       Joshua Coleman  South Amandaland      [email protected]      2014-05-22
Privasi Data dan Anonimisasi di Python

Membuat kota dengan distribusi probabilistik

Saat meniru dataset nyata, kita bisa menghindari membocorkan nama asli nilai.

# Import numpy
import numpy as np


# Obtain or specify the probabilities p = (0.58, 0.23, 0.16, 0.03) cities = ["New York", "Chicago", "Seattle", "Dallas"]
# Generate the cities from the selected ones following a distribution clients_df['city'] = np.random.choice(cities, size=len(clients_df), p=p)
Privasi Data dan Anonimisasi di Python

Membuat kota dengan distribusi probabilistik

# See the resulting dataset
clients_df.head()
    gender     active    name            city        contact email                  date
0    Female    No        Stacy Hooper    Chicago     [email protected]       2019-11-20
1    Male      Yes       Michael Rogers  New York    [email protected]       2015-02-22
2    Male      No        James Sanchez   New York    [email protected]  2015-12-11
3    Female    Yes       Taylor Berger   Chicago     [email protected]       2012-12-13
4    Male      Yes       Joshua Coleman  New York    [email protected]      2014-05-22
Privasi Data dan Anonimisasi di Python

Ayo buat dataset!

Privasi Data dan Anonimisasi di Python

Preparing Video For Download...