Safely release datasets to the public

Data Privacy and Anonymization in Python

Rebeca Gonzalez

Data engineer

Exploring datasets

For analyzing possible privacy concerns, first obtain some domain and statistical knowledge of them.

# Explore the dataset
cross_selling.head()

    id    Gender  Age  Driving_License   Region_Code  Previously_Insured   Vehicle_Age   Vintage   Response
0    1    Male    44    1                28.0         0                    > 2 Years     217       1
1    2    Male    76    1                3.0          0                    1-2 Year      183       0
2    3    Male    47    1                28.0         0                    > 2 Years     27        1
3    4    Male    21    1                11.0         1                    < 1 Year      203       0
4    5    Female  29    1                41.0         1                    < 1 Year      39        0

Exploring datasets

# Explore the dataset
cross_selling.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 381109 entries, 0 to 381108
Data columns (total 9 columns):
 #   Column              Non-Null Count   Dtype  
     ------              --------------   -----  
 0   id                  381109 non-null  int64  
 1   Gender              381109 non-null  object 
 2   Age                 381109 non-null  int64  
 3   Driving_License     381109 non-null  int64  
 4   Region_Code         381109 non-null  float64
 5   Previously_Insured  381107 non-null  int64  
 6   Vehicle_Age         381109 non-null  object 
 7   Vintage             381109 non-null  int64  
 8   Response            381109 non-null  int64  
dtypes: float64(1), int64(6), object(2)
memory usage: 26.2+ MB

Exploring datasets

# Calculate the number of unique values in a DataFrame
cross_selling.nunique()

id                    381109
Gender                     2
Age                       66
Driving_License            2
Region_Code               53
Previously_Insured         2
Vehicle_Age                3
Vintage                  290
Response                   2
dtype: int64

Suppressing unique attributes

# Apply attribute suppression on the id column
suppressed_df = cross_selling.drop('id', axis="columns")


# Check the head of the resulting DataFrame
suppressed_df.head()

    Gender   Age   Driving_License   Region_Code   Previously_Insured   Vehicle_Age   Vintage   Response
0    Male    44    1                 28.0          0                    > 2 Years     217       1
1    Male    76    1                 3.0           0                    1-2 Year      183       0
2    Male    47    1                 28.0          0                    > 2 Years     27        1
3    Male    21    1                 11.0          1                    < 1 Year      203       0
4    Female  29    1                 41.0          1                    < 1 Year      39        0

Cleaning data

# Drop null and NaN rows
cleaned_df = suppressed_df.dropna(axis="index")

cleaned_df.head()

    Gender   Age   Driving_License   Region_Code   Previously_Insured   Vehicle_Age   Vintage   Response
0    Male    44    1                 28.0          0                    > 2 Years     217        1
1    Male    76    1                 3.0           0                    1-2 Year      183        0
2    Male    47    1                 28.0          0                    > 2 Years     27         1
3    Male    21    1                 11.0          1                    < 1 Year      203        0
4    Female  29    1                 41.0          1                    < 1 Year      39         0

Sampling from categorical values

# Compute the probability distribution 
cleaned_df['Gender'].value_counts(normalize=True)

Male      0.540957
Female    0.459043
Name: Gender, dtype: float64

Sampling from categorical values

# Get the probability distribution values 
distributions = cleaned_df['Gender'].value_counts(normalize=True)


# Sample from the calculated probability distributions
cleaned_df['Gender'] = np.random.choice(distributions.index, 
                                        p=distributions, 
                                        size=len(cleaned_df))

Sampling from categorical values

# See the resulting dataset
cleaned_df

    Gender   Age   Driving_License   Region_Code   Previously_Insured   Vehicle_Age   Vintage   Response
0    Male    44    1                 28.0          0                    > 2 Years     217        1
1    Male    76    1                 3.0           0                    1-2 Year      183        0
2    Male    47    1                 28.0          0                    > 2 Years     27         1
3    Female  21    1                 11.0          1                    < 1 Year      203        0
4    Male    29    1                 41.0          1                    < 1 Year      39         0
...    ...    ...    ...    ...    ...    ...    ...    ...
381107 rows × 8 columns

Sampling from categorical values

# Compute the probability distribution 
cleaned_df['Gender'].value_counts(normalize=True)

Male      0.541973
Female    0.458027
Name: Gender, dtype: float64

Removing column names

# Replace column names with numbers
cleaned_df.columns = range(len(df.columns))

     0        1    2    3       4    5            6      7
0    Male    44    1    28.0    0    > 2 Years    217    1
1    Male    76    1    3.0     0    1-2 Year     183    0
2    Male    47    1    28.0    0    > 2 Years    27     1
3    Female  21    1    11.0    1    < 1 Year     203    0
4    Male    29    1    41.0    1    < 1 Year     39     0

Let's practice!

Data Privacy and Anonymization in Python