Veri setlerini güvenle kamuya açma

Python ile Veri Gizliliği ve Anonimleştirme

Rebeca Gonzalez

Data engineer

Veri setlerini keşfetme

Olası gizlilik risklerini analiz etmek için önce alan ve istatistik bilgisi edinin.

# Veri setini keşfedin
cross_selling.head()

    id    Gender  Age  Driving_License   Region_Code  Previously_Insured   Vehicle_Age   Vintage   Response
0    1    Male    44    1                28.0         0                    > 2 Years     217       1
1    2    Male    76    1                3.0          0                    1-2 Year      183       0
2    3    Male    47    1                28.0         0                    > 2 Years     27        1
3    4    Male    21    1                11.0         1                    < 1 Year      203       0
4    5    Female  29    1                41.0         1                    < 1 Year      39        0

Veri setlerini keşfetme

# Veri setini keşfedin
cross_selling.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 381109 entries, 0 to 381108
Data columns (total 9 columns):
 #   Column              Non-Null Count   Dtype  
     ------              --------------   -----  
 0   id                  381109 non-null  int64  
 1   Gender              381109 non-null  object 
 2   Age                 381109 non-null  int64  
 3   Driving_License     381109 non-null  int64  
 4   Region_Code         381109 non-null  float64
 5   Previously_Insured  381107 non-null  int64  
 6   Vehicle_Age         381109 non-null  object 
 7   Vintage             381109 non-null  int64  
 8   Response            381109 non-null  int64  
dtypes: float64(1), int64(6), object(2)
memory usage: 26.2+ MB

Veri setlerini keşfetme

# Bir DataFrame'deki benzersiz değer sayısını hesaplayın
cross_selling.nunique()

id                    381109
Gender                     2
Age                       66
Driving_License            2
Region_Code               53
Previously_Insured         2
Vehicle_Age                3
Vintage                  290
Response                   2
dtype: int64

Benzersiz öznitelikleri bastırma

# id sütununa öznitelik bastırma uygulayın
suppressed_df = cross_selling.drop('id', axis="columns")


# Ortaya çıkan DataFrame'in ilk satırlarını kontrol edin
suppressed_df.head()

    Gender   Age   Driving_License   Region_Code   Previously_Insured   Vehicle_Age   Vintage   Response
0    Male    44    1                 28.0          0                    > 2 Years     217       1
1    Male    76    1                 3.0           0                    1-2 Year      183       0
2    Male    47    1                 28.0          0                    > 2 Years     27        1
3    Male    21    1                 11.0          1                    < 1 Year      203       0
4    Female  29    1                 41.0          1                    < 1 Year      39        0

Veri temizleme

# Null ve NaN satırları düşürün
cleaned_df = suppressed_df.dropna(axis="index")

cleaned_df.head()

    Gender   Age   Driving_License   Region_Code   Previously_Insured   Vehicle_Age   Vintage   Response
0    Male    44    1                 28.0          0                    > 2 Years     217        1
1    Male    76    1                 3.0           0                    1-2 Year      183        0
2    Male    47    1                 28.0          0                    > 2 Years     27         1
3    Male    21    1                 11.0          1                    < 1 Year      203        0
4    Female  29    1                 41.0          1                    < 1 Year      39         0

Kategorik değerlerden örnekleme

# Olasılık dağılımını hesaplayın
cleaned_df['Gender'].value_counts(normalize=True)

Male      0.540957
Female    0.459043
Name: Gender, dtype: float64

Kategorik değerlerden örnekleme

# Olasılık dağılımı değerlerini alın
distributions = cleaned_df['Gender'].value_counts(normalize=True)


# Hesaplanan olasılık dağılımlarından örnekleyin
cleaned_df['Gender'] = np.random.choice(distributions.index, 
                                        p=distributions, 
                                        size=len(cleaned_df))

Kategorik değerlerden örnekleme

# Ortaya çıkan veri setini görün
cleaned_df

    Gender   Age   Driving_License   Region_Code   Previously_Insured   Vehicle_Age   Vintage   Response
0    Male    44    1                 28.0          0                    > 2 Years     217        1
1    Male    76    1                 3.0           0                    1-2 Year      183        0
2    Male    47    1                 28.0          0                    > 2 Years     27         1
3    Female  21    1                 11.0          1                    < 1 Year      203        0
4    Male    29    1                 41.0          1                    < 1 Year      39         0
...    ...    ...    ...    ...    ...    ...    ...    ...
381107 rows × 8 columns

Kategorik değerlerden örnekleme

# Olasılık dağılımını hesaplayın
cleaned_df['Gender'].value_counts(normalize=True)

Male      0.541973
Female    0.458027
Name: Gender, dtype: float64

Sütun adlarını kaldırma

# Sütun adlarını sayılarla değiştirin
cleaned_df.columns = range(len(df.columns))

     0        1    2    3       4    5            6      7
0    Male    44    1    28.0    0    > 2 Years    217    1
1    Male    76    1    3.0     0    1-2 Year     183    0
2    Male    47    1    28.0    0    > 2 Years    27     1
3    Female  21    1    11.0    1    < 1 Year     203    0
4    Male    29    1    41.0    1    < 1 Year     39     0

Hadi pratik yapalım!

Python ile Veri Gizliliği ve Anonimleştirme