Differentially private clustering models

Data Privacy and Anonymization in Python

Rebeca Gonzalez

Data engineer

Comparing models

Image showing 3 clusters in data, after clustering with non-private k-means

Image showing 3 clusters in data, after clustering with private k-means

Data Privacy and Anonymization in Python

Comparing the models

Image showing the difference between the two clustering results displayed as pink points

  • Difference between the models resulting clusters
  • They have the majority of results in common
Data Privacy and Anonymization in Python

Building differently private clustering models

from diffprivlib.models import KMeans

# Computing the clusters with the DP model model = KMeans(epsilon=1, n_clusters=3)
# Run the model and obtain clusters clusters = model.fit_predict(X)
Data Privacy and Anonymization in Python

Improving DP clustering models

  • We can pre-process data before doing clustering.
  • Feature scaling such as StandardScaler and dimensionality reduction methods like PCA.
    • To reduce the inertia of the model
    • Get more accurate segmentation groups
  • We do this with diffprivlib just as you would do with sklearn models.
Data Privacy and Anonymization in Python

Improving DP clustering models

from sklearn.decomposition import PCA

# Initialize PCA pca = PCA()
# Fit transform data with PCA X = pca.fit_transform(X)
# Computing the clusters with the DP model model = dp_Kmeans(epsilon=1, n_clusters=3)
# Run the model and obtain clusters clusters = model.fit_predict(X)
Data Privacy and Anonymization in Python

Improving DP clustering models

Image showing two resulting scatter plots with the clusters

Data Privacy and Anonymization in Python

Improving DP clustering models

Image showing the difference between the two clustering results displayed as pink points

  • Improved results by using data transformations
Data Privacy and Anonymization in Python

Elbow method

Image of the resulting plot after applying the elbow method on data

Data Privacy and Anonymization in Python

Epsilon

from diffprivlib.models import KMeans as model

# Computing the clusters with the DP model model = dp_Kmeans(epsilon=0.2, n_clusters=3)
# Run the model and obtain clusters clusters = model.fit_predict(X)
Data Privacy and Anonymization in Python

Epsilon

Image showing two resulting scatter plots with the clusters

Data Privacy and Anonymization in Python

Let's practice!

Data Privacy and Anonymization in Python

Preparing Video For Download...