Privasi Data dan Anonimisasi di Python
Rebeca Gonzalez
Data engineer
Untuk menganalisis potensi isu privasi, peroleh dulu pengetahuan domain dan statistiknya.
# Jelajahi dataset
cross_selling.head()
id Gender Age Driving_License Region_Code Previously_Insured Vehicle_Age Vintage Response
0 1 Male 44 1 28.0 0 > 2 Years 217 1
1 2 Male 76 1 3.0 0 1-2 Year 183 0
2 3 Male 47 1 28.0 0 > 2 Years 27 1
3 4 Male 21 1 11.0 1 < 1 Year 203 0
4 5 Female 29 1 41.0 1 < 1 Year 39 0
# Jelajahi dataset
cross_selling.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 381109 entries, 0 to 381108
Data columns (total 9 columns):
# Column Non-Null Count Dtype
------ -------------- -----
0 id 381109 non-null int64
1 Gender 381109 non-null object
2 Age 381109 non-null int64
3 Driving_License 381109 non-null int64
4 Region_Code 381109 non-null float64
5 Previously_Insured 381107 non-null int64
6 Vehicle_Age 381109 non-null object
7 Vintage 381109 non-null int64
8 Response 381109 non-null int64
dtypes: float64(1), int64(6), object(2)
memory usage: 26.2+ MB
# Hitung jumlah nilai unik di DataFrame
cross_selling.nunique()
id 381109
Gender 2
Age 66
Driving_License 2
Region_Code 53
Previously_Insured 2
Vehicle_Age 3
Vintage 290
Response 2
dtype: int64
# Terapkan penyembunyian atribut pada kolom id suppressed_df = cross_selling.drop('id', axis="columns")# Periksa head DataFrame hasil suppressed_df.head()
Gender Age Driving_License Region_Code Previously_Insured Vehicle_Age Vintage Response
0 Male 44 1 28.0 0 > 2 Years 217 1
1 Male 76 1 3.0 0 1-2 Year 183 0
2 Male 47 1 28.0 0 > 2 Years 27 1
3 Male 21 1 11.0 1 < 1 Year 203 0
4 Female 29 1 41.0 1 < 1 Year 39 0
# Hapus baris null dan NaN
cleaned_df = suppressed_df.dropna(axis="index")
cleaned_df.head()
Gender Age Driving_License Region_Code Previously_Insured Vehicle_Age Vintage Response
0 Male 44 1 28.0 0 > 2 Years 217 1
1 Male 76 1 3.0 0 1-2 Year 183 0
2 Male 47 1 28.0 0 > 2 Years 27 1
3 Male 21 1 11.0 1 < 1 Year 203 0
4 Female 29 1 41.0 1 < 1 Year 39 0
# Hitung distribusi probabilitas
cleaned_df['Gender'].value_counts(normalize=True)
Male 0.540957
Female 0.459043
Name: Gender, dtype: float64
# Ambil nilai distribusi probabilitas distributions = cleaned_df['Gender'].value_counts(normalize=True)# Sampling dari distribusi probabilitas yang dihitung cleaned_df['Gender'] = np.random.choice(distributions.index, p=distributions, size=len(cleaned_df))
# Lihat dataset hasilnya
cleaned_df
Gender Age Driving_License Region_Code Previously_Insured Vehicle_Age Vintage Response
0 Male 44 1 28.0 0 > 2 Years 217 1
1 Male 76 1 3.0 0 1-2 Year 183 0
2 Male 47 1 28.0 0 > 2 Years 27 1
3 Female 21 1 11.0 1 < 1 Year 203 0
4 Male 29 1 41.0 1 < 1 Year 39 0
... ... ... ... ... ... ... ... ...
381107 rows × 8 columns
# Hitung distribusi probabilitas
cleaned_df['Gender'].value_counts(normalize=True)
Male 0.541973
Female 0.458027
Name: Gender, dtype: float64
# Ganti nama kolom dengan nomor
cleaned_df.columns = range(len(df.columns))
0 1 2 3 4 5 6 7
0 Male 44 1 28.0 0 > 2 Years 217 1
1 Male 76 1 3.0 0 1-2 Year 183 0
2 Male 47 1 28.0 0 > 2 Years 27 1
3 Female 21 1 11.0 1 < 1 Year 203 0
4 Male 29 1 41.0 1 < 1 Year 39 0
Privasi Data dan Anonimisasi di Python