Data Privacy and Anonymization in Python
Rebeca Gonzalez
Data engineer
For analyzing possible privacy concerns, first obtain some domain and statistical knowledge of them.
# Explore the dataset
cross_selling.head()
id Gender Age Driving_License Region_Code Previously_Insured Vehicle_Age Vintage Response
0 1 Male 44 1 28.0 0 > 2 Years 217 1
1 2 Male 76 1 3.0 0 1-2 Year 183 0
2 3 Male 47 1 28.0 0 > 2 Years 27 1
3 4 Male 21 1 11.0 1 < 1 Year 203 0
4 5 Female 29 1 41.0 1 < 1 Year 39 0
# Explore the dataset
cross_selling.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 381109 entries, 0 to 381108
Data columns (total 9 columns):
# Column Non-Null Count Dtype
------ -------------- -----
0 id 381109 non-null int64
1 Gender 381109 non-null object
2 Age 381109 non-null int64
3 Driving_License 381109 non-null int64
4 Region_Code 381109 non-null float64
5 Previously_Insured 381107 non-null int64
6 Vehicle_Age 381109 non-null object
7 Vintage 381109 non-null int64
8 Response 381109 non-null int64
dtypes: float64(1), int64(6), object(2)
memory usage: 26.2+ MB
# Calculate the number of unique values in a DataFrame
cross_selling.nunique()
id 381109
Gender 2
Age 66
Driving_License 2
Region_Code 53
Previously_Insured 2
Vehicle_Age 3
Vintage 290
Response 2
dtype: int64
# Apply attribute suppression on the id column suppressed_df = cross_selling.drop('id', axis="columns")
# Check the head of the resulting DataFrame suppressed_df.head()
Gender Age Driving_License Region_Code Previously_Insured Vehicle_Age Vintage Response
0 Male 44 1 28.0 0 > 2 Years 217 1
1 Male 76 1 3.0 0 1-2 Year 183 0
2 Male 47 1 28.0 0 > 2 Years 27 1
3 Male 21 1 11.0 1 < 1 Year 203 0
4 Female 29 1 41.0 1 < 1 Year 39 0
# Drop null and NaN rows
cleaned_df = suppressed_df.dropna(axis="index")
cleaned_df.head()
Gender Age Driving_License Region_Code Previously_Insured Vehicle_Age Vintage Response
0 Male 44 1 28.0 0 > 2 Years 217 1
1 Male 76 1 3.0 0 1-2 Year 183 0
2 Male 47 1 28.0 0 > 2 Years 27 1
3 Male 21 1 11.0 1 < 1 Year 203 0
4 Female 29 1 41.0 1 < 1 Year 39 0
# Compute the probability distribution
cleaned_df['Gender'].value_counts(normalize=True)
Male 0.540957
Female 0.459043
Name: Gender, dtype: float64
# Get the probability distribution values distributions = cleaned_df['Gender'].value_counts(normalize=True)
# Sample from the calculated probability distributions cleaned_df['Gender'] = np.random.choice(distributions.index, p=distributions, size=len(cleaned_df))
# See the resulting dataset
cleaned_df
Gender Age Driving_License Region_Code Previously_Insured Vehicle_Age Vintage Response
0 Male 44 1 28.0 0 > 2 Years 217 1
1 Male 76 1 3.0 0 1-2 Year 183 0
2 Male 47 1 28.0 0 > 2 Years 27 1
3 Female 21 1 11.0 1 < 1 Year 203 0
4 Male 29 1 41.0 1 < 1 Year 39 0
... ... ... ... ... ... ... ... ...
381107 rows × 8 columns
# Compute the probability distribution
cleaned_df['Gender'].value_counts(normalize=True)
Male 0.541973
Female 0.458027
Name: Gender, dtype: float64
# Replace column names with numbers
cleaned_df.columns = range(len(df.columns))
0 1 2 3 4 5 6 7
0 Male 44 1 28.0 0 > 2 Years 217 1
1 Male 76 1 3.0 0 1-2 Year 183 0
2 Male 47 1 28.0 0 > 2 Years 27 1
3 Female 21 1 11.0 1 < 1 Year 203 0
4 Male 29 1 41.0 1 < 1 Year 39 0
Data Privacy and Anonymization in Python