Creating synthetic datasets using scikit-learn

Data Privacy and Anonymization in Python

Rebeca Gonzalez

Data engineer

Generating datasets with Scikit-learn

  • We can create datasets that sample from probability distributions

  • Such as the normal distribution

Representation of a normal distributed histogram

Data Privacy and Anonymization in Python

Normal distribution

Often occur in nature

  • Heights
  • Blood pressure
  • IQ scores

Histogram of height values of a large dataset that follow a normal distribution

Data Privacy and Anonymization in Python

Sample from a normal distribution

import numpy as np


# Create new pandas DataFrame new_measures = pd.DataFrame()
# Selecting the mean/center values and the standard deviation of the sample mean = 65 standard_deviation = 2
# Generating the sample new_measures['Height'] = np.random.normal(mean, standard_deviation, 10000)
Data Privacy and Anonymization in Python

Sample from a normal distribution

# Draw histogram to see the resulting heights distribution
new_measures['Height'].hist(bins=50)

Histogram of the resulting data

Data Privacy and Anonymization in Python

Creating datasets using scikit-learn

Scikit-learn has simple and easy-to-use functions for generating datasets to perform:

  • Classification
  • Clustering
  • Regression
Data Privacy and Anonymization in Python

Synthetic data for classification and clustering

make_classification()

  • It allocates normally-distributed clusters of points
  • Can create correlated and uninformative features

make_blobs()

  • Greater control regarding the centers and standard deviations of clusters
Data Privacy and Anonymization in Python

Synthetic data for classification

# Import make_classification from sklearn datasets module
from sklearn.datasets import make_classification


# Generate the samples and their labels x, y = make_classification(n_samples=1000,
n_classes=2,
n_informative=2,
n_features=4,
n_clusters_per_class=2,
class_sep=1)
Data Privacy and Anonymization in Python

Synthetic data for classification

# See the generated data and labels
print(x.shape)
print(y.shape)
print(x)
(1000, 4)
(1000,)
[[ 1.22914870e+00 -2.62386795e+00  2.25878743e+00  2.55377055e+00]
 [-1.10279812e+00 -1.15816087e+00  1.55571279e+00  7.80565898e-02]
 [ 2.65581977e-03 -2.33278818e+00  2.37837858e+00  1.57533194e+00]
 ...
 [ 4.51006972e-01  7.53435745e-01 -9.21597108e-01 -2.20659747e-01]
 [ 5.31925876e-01  7.42210504e-01 -9.37625248e-01 -1.61488855e-01]
 [ 1.62862108e+00 -2.72435345e+00  2.22562940e+00  2.87628246e+00]]
Data Privacy and Anonymization in Python

Synthetic data for classification

Plot of data points in the generated 2 class dataset

Data Privacy and Anonymization in Python

Synthetic data for classification

Three plots showing the generated dataset and its data points with different class_sep values. In the left the data points are very closed together while the one on the right shows data points very separated

Data Privacy and Anonymization in Python

Synthetic data for clustering

# Import the datasets module for generating clustering datasets
from sklearn.datasets import make_blobs


# Specify a value for standard deviation standard_deviation = 1.5
# Generate the data and labels of the dataset x, labels = make_blobs(n_features=3, centers=4, cluster_std=standard_deviation)
# See the shape of the generated data print(x.shape)
(100, 3)
Data Privacy and Anonymization in Python

Synthetic data for clustering

Plot showing the resulting clustering data points, one color for each cluster. 4 centers for 4 clusters.

Data Privacy and Anonymization in Python

Synthetic data for clustering

Three plots showing how standard deviation affects the generated data points. In the left clusters only closed to their center while on the right the data points are very much dispersed.

Data Privacy and Anonymization in Python

Let's practice!

Data Privacy and Anonymization in Python

Preparing Video For Download...