Creating synthetic datasets using scikit-learn

Data Privacy and Anonymization in Python

Rebeca Gonzalez

Data engineer

Generating datasets with Scikit-learn

We can create datasets that sample from probability distributions
Such as the normal distribution

Representation of a normal distributed histogram

Normal distribution

Often occur in nature

Heights
Blood pressure
IQ scores

Histogram of height values of a large dataset that follow a normal distribution

Sample from a normal distribution

import numpy as np


# Create new pandas DataFrame
new_measures = pd.DataFrame()


# Selecting the mean/center values and the standard deviation of the sample
mean = 65
standard_deviation = 2


# Generating the sample 
new_measures['Height'] = np.random.normal(mean, standard_deviation, 10000)

Sample from a normal distribution

# Draw histogram to see the resulting heights distribution
new_measures['Height'].hist(bins=50)

Histogram of the resulting data

Creating datasets using scikit-learn

Scikit-learn has simple and easy-to-use functions for generating datasets to perform:

Classification
Clustering
Regression

Synthetic data for classification and clustering

`make_classification()`

It allocates normally-distributed clusters of points
Can create correlated and uninformative features

`make_blobs()`

Greater control regarding the centers and standard deviations of clusters

Synthetic data for classification

# Import make_classification from sklearn datasets module
from sklearn.datasets import make_classification


# Generate the samples and their labels
x, y = make_classification(n_samples=1000,

                           n_classes=2,

                           n_informative=2,

                           n_features=4,

                           n_clusters_per_class=2,

                           class_sep=1)

Synthetic data for classification

# See the generated data and labels
print(x.shape)
print(y.shape)
print(x)

(1000, 4)
(1000,)
[[ 1.22914870e+00 -2.62386795e+00  2.25878743e+00  2.55377055e+00]
 [-1.10279812e+00 -1.15816087e+00  1.55571279e+00  7.80565898e-02]
 [ 2.65581977e-03 -2.33278818e+00  2.37837858e+00  1.57533194e+00]
 ...
 [ 4.51006972e-01  7.53435745e-01 -9.21597108e-01 -2.20659747e-01]
 [ 5.31925876e-01  7.42210504e-01 -9.37625248e-01 -1.61488855e-01]
 [ 1.62862108e+00 -2.72435345e+00  2.22562940e+00  2.87628246e+00]]

Synthetic data for classification

Plot of data points in the generated 2 class dataset

Synthetic data for classification

Three plots showing the generated dataset and its data points with different class_sep values. In the left the data points are very closed together while the one on the right shows data points very separated

Synthetic data for clustering

# Import the datasets module for generating clustering datasets
from sklearn.datasets import make_blobs


# Specify a value for standard deviation
standard_deviation = 1.5


# Generate the data and labels of the dataset
x, labels = make_blobs(n_features=3,
                      centers=4,
                      cluster_std=standard_deviation)


# See the shape of the generated data
print(x.shape)

(100, 3)

Synthetic data for clustering

Plot showing the resulting clustering data points, one color for each cluster. 4 centers for 4 clusters.

Synthetic data for clustering

Three plots showing how standard deviation affects the generated data points. In the left clusters only closed to their center while on the right the data points are very much dispersed.

Let's practice!

Data Privacy and Anonymization in Python