Data cleaning and exploratory analysis

A/B Testing in Python

Moe Lotfy, PhD

Principal Data Science Manager

Cleaning missing values

  • Missing values
    • Drop, ignore, impute
# Calculate the mean order value
checkout.order_value.mean()
30.0096
# Replace missing values with zeros and get mean
checkout['order_value'].fillna(0).mean()
25.3581
A/B Testing in Python

Cleaning duplicates

  • Duplicates
    • Identical rows should be dropped
# Check for duplicate rows due to logging issues 
print(len(checkout))
print(len(checkout.drop_duplicates(keep='first')))
9000
9000
A/B Testing in Python

Cleaning duplicates

  • Duplicates
    • Duplicate users should be handled with care.
# Unique users in group B
print(checkout[checkout['checkout_page'] == 'B']['user_id'].nunique())
# Unique users who purchased at least once
print(checkout[checkout['checkout_page'] == 'B'].groupby('user_id')['purchased'].max().sum())
# Total purchase events in group B
print(checkout[checkout['checkout_page'] == 'B']['purchased'].sum())
2938
2491.0
2541.0
A/B Testing in Python

EDA summary stats

  • Mean, count, and standard deviation summary
checkout.groupby('checkout_page')['order_value'].agg({'mean','std','count'})
                    mean     count    std
checkout_page            
        A        24.956437    2461    2.418837
        B        29.876202    2541    7.277644
        C        34.917589    2603    4.869816
A/B Testing in Python

EDA plotting

  • Bar plots
sns.barplot(x=checkout['checkout_page'], y=checkout['order_value'], estimator=np.mean)
plt.title('Average Order Value per Checkout Page Variant')
plt.xlabel('Checkout Page Variant')
plt.ylabel('Order Value [$]')

Bar plot of Average Order Value per Checkout Page Variant

A/B Testing in Python

EDA plotting

  • Histograms
sns.displot(data=checkout, x='order_value', hue = 'checkout_page', kde=True)

Histogram of order value

A/B Testing in Python

EDA plotting

  • Time series (line plots)
sns.lineplot(data=AdSmart,x='date', y='yes', hue='experiment', ci=False)

Time series plot of proportion of users responding yes to an ad

1 Adsmart Kaggle dataset: https://www.kaggle.com/datasets/osuolaleemmanuel/ad-ab-testing
A/B Testing in Python

Let's practice!

A/B Testing in Python

Preparing Video For Download...