Merilis dataset ke publik dengan aman

Privasi Data dan Anonimisasi di Python

Rebeca Gonzalez

Data engineer

Menjelajahi dataset

Untuk menganalisis potensi isu privasi, peroleh dulu pengetahuan domain dan statistiknya.

# Jelajahi dataset
cross_selling.head()

    id    Gender  Age  Driving_License   Region_Code  Previously_Insured   Vehicle_Age   Vintage   Response
0    1    Male    44    1                28.0         0                    > 2 Years     217       1
1    2    Male    76    1                3.0          0                    1-2 Year      183       0
2    3    Male    47    1                28.0         0                    > 2 Years     27        1
3    4    Male    21    1                11.0         1                    < 1 Year      203       0
4    5    Female  29    1                41.0         1                    < 1 Year      39        0

Menjelajahi dataset

# Jelajahi dataset
cross_selling.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 381109 entries, 0 to 381108
Data columns (total 9 columns):
 #   Column              Non-Null Count   Dtype  
     ------              --------------   -----  
 0   id                  381109 non-null  int64  
 1   Gender              381109 non-null  object 
 2   Age                 381109 non-null  int64  
 3   Driving_License     381109 non-null  int64  
 4   Region_Code         381109 non-null  float64
 5   Previously_Insured  381107 non-null  int64  
 6   Vehicle_Age         381109 non-null  object 
 7   Vintage             381109 non-null  int64  
 8   Response            381109 non-null  int64  
dtypes: float64(1), int64(6), object(2)
memory usage: 26.2+ MB

Menjelajahi dataset

# Hitung jumlah nilai unik di DataFrame
cross_selling.nunique()

id                    381109
Gender                     2
Age                       66
Driving_License            2
Region_Code               53
Previously_Insured         2
Vehicle_Age                3
Vintage                  290
Response                   2
dtype: int64

Menyembunyikan atribut unik

# Terapkan penyembunyian atribut pada kolom id
suppressed_df = cross_selling.drop('id', axis="columns")


# Periksa head DataFrame hasil
suppressed_df.head()

    Gender   Age   Driving_License   Region_Code   Previously_Insured   Vehicle_Age   Vintage   Response
0    Male    44    1                 28.0          0                    > 2 Years     217       1
1    Male    76    1                 3.0           0                    1-2 Year      183       0
2    Male    47    1                 28.0          0                    > 2 Years     27        1
3    Male    21    1                 11.0          1                    < 1 Year      203       0
4    Female  29    1                 41.0          1                    < 1 Year      39        0

Membersihkan data

# Hapus baris null dan NaN
cleaned_df = suppressed_df.dropna(axis="index")

cleaned_df.head()

    Gender   Age   Driving_License   Region_Code   Previously_Insured   Vehicle_Age   Vintage   Response
0    Male    44    1                 28.0          0                    > 2 Years     217        1
1    Male    76    1                 3.0           0                    1-2 Year      183        0
2    Male    47    1                 28.0          0                    > 2 Years     27         1
3    Male    21    1                 11.0          1                    < 1 Year      203        0
4    Female  29    1                 41.0          1                    < 1 Year      39         0

Sampling dari nilai kategorikal

# Hitung distribusi probabilitas
cleaned_df['Gender'].value_counts(normalize=True)

Male      0.540957
Female    0.459043
Name: Gender, dtype: float64

Sampling dari nilai kategorikal

# Ambil nilai distribusi probabilitas
distributions = cleaned_df['Gender'].value_counts(normalize=True)


# Sampling dari distribusi probabilitas yang dihitung
cleaned_df['Gender'] = np.random.choice(distributions.index, 
                                        p=distributions, 
                                        size=len(cleaned_df))

Sampling dari nilai kategorikal

# Lihat dataset hasilnya
cleaned_df

    Gender   Age   Driving_License   Region_Code   Previously_Insured   Vehicle_Age   Vintage   Response
0    Male    44    1                 28.0          0                    > 2 Years     217        1
1    Male    76    1                 3.0           0                    1-2 Year      183        0
2    Male    47    1                 28.0          0                    > 2 Years     27         1
3    Female  21    1                 11.0          1                    < 1 Year      203        0
4    Male    29    1                 41.0          1                    < 1 Year      39         0
...    ...    ...    ...    ...    ...    ...    ...    ...
381107 rows × 8 columns

Sampling dari nilai kategorikal

# Hitung distribusi probabilitas
cleaned_df['Gender'].value_counts(normalize=True)

Male      0.541973
Female    0.458027
Name: Gender, dtype: float64

Menghapus nama kolom

# Ganti nama kolom dengan nomor
cleaned_df.columns = range(len(df.columns))

     0        1    2    3       4    5            6      7
0    Male    44    1    28.0    0    > 2 Years    217    1
1    Male    76    1    3.0     0    1-2 Year     183    0
2    Male    47    1    28.0    0    > 2 Years    27     1
3    Female  21    1    11.0    1    < 1 Year     203    0
4    Male    29    1    41.0    1    < 1 Year     39     0

Ayo berlatih!

Privasi Data dan Anonimisasi di Python