Dataprivacy en anonimisering in Python
Rebeca Gonzalez
Data Engineer


$$
$$

# Import the scikit-learn naive Bayes classifier from sklearn.naive_bayes import GaussianNB# Import the differentially private naive Bayes classifier from diffprivlib.models import GaussianNB
from sklearn.naive_bayes import GaussianNB# Bouw de niet-private classifier nonprivate_clf = GaussianNB()# Fit het model op de data nonprivate_clf.fit(X_train, y_train)print("The accuracy of the non-private model is ", nonprivate_clf.score(X_test, y_test))
The accuracy of the non-private model is 0.8333333333333334
from diffprivlib.models import GaussianNB as dp_GaussianNB# Bouw de private classifier met lege constructor private_clf = dp_GaussianNB()# Fit het model op de data en bekijk de score private_clf.fit(X_train, y_train)print("The accuracy of the private model is ", private_clf.score(X_test, y_test))
The accuracy of the private model is 0.7
PrivacyLeakWarning: Bounds have not been specified and will be calculated
on the data provided. This will result in additional privacy leakage.
To ensure differential privacy and no additional privacy leakage, specify bounds for each dimension.
"privacy leakage, specify bounds for each dimension.", PrivacyLeakWarning)
Om datalek te voorkomen, kun je min- en max-waarden vervangen met het argument bounds. Het kan zijn:
(0,100)
([0,1,0,2],[10,80,5,70])
# Stel bounds in die minimaal de min- en max-waarden dekken bounds = (X_train.min(axis=0) - 1, X_train.max(axis=0) + 1)# Bouw de classifier met epsilon 0.5 dp_clf = dp_GaussianNB(epsilon=0.5, bounds=bounds)# Fit het model op de data en bekijk de score dp_clf.fit(X_train, y_train) print("The accuracy of the private model is ", private_clf.score(X_test, y_test))
The accuracy of the private model is 0.807000
# Importeer de random-module import random # Stel min en max van bounds in met wat ruis bounds = (X_train.min(axis=0) - random.sample(range(0, 30), 12), X_train.max(axis=0) + random.sample(range(0, 30), 12))# Bouw de classifier met epsilon 0.5 dp_clf = dp_GaussianNB(epsilon=0.5, bounds=bounds)# Fit het model op de data en bekijk de score dp_clf.fit(X_train, y_train) print("The accuracy of private classifier with bounds is ", dp_clf.score(X_test, y_test))
The accuracy of private classifier with bounds is 0.7544444444

Dataprivacy en anonimisering in Python