Sanity checks: Internal validity

A/B Testing in Python

Moe Lotfy, PhD

Principal Data Science Manager

Sample Ratio Mismatch (SRM)

Sample Ration Mismatch (SRM)
- Allocation across variants deviates from design
Chi-square goodness of fit test

Chi-square formula

Sample ratio mismatch allocation example

SRM python example

# Calculate the unique IDs per variant
AdSmart.groupby('experiment')['auction_id'].nunique()

experiment
control    4071
exposed    4006

# Assign the unqiue counts to each variant
control_users=AdSmart[AdSmart['experiment']=='control']['auction_id'].nunique()
exposed_users=AdSmart[AdSmart['experiment']=='exposed']['auction_id'].nunique()
total_users=control_users+exposed_users
# Calculate allocation ratios per variant
control_perc = control_users / total_users
exposed_perc = exposed_users / total_users
print("Percentage of users in the Control group:",100*round(control_perc,5),"%")
print("Percentage of users in the Exposed group:",100*round(exposed_perc,5),"%")

Percentage of users in the Control group: 50.402 %
Percentage of users in the Exposed group: 49.598 %

¹ Adsmart Kaggle dataset: https://www.kaggle.com/datasets/osuolaleemmanuel/ad-ab-testing

SRM python example

# Creat lists of observed and expected counts per variant
observed = [ control_users, exposed_users ]
expected = [ total_users/2, total_users/2 ]
# Import chisquare from scipy library
from scipy.stats import chisquare
# Run chisquare test on observed and expected lists
chi = chisquare(observed, f_exp=expected)
# Print test results and interpretation
print(chi)
if chi[1] < 0.01:
    print("SRM may be present")
else:
    print("SRM likely not present")

Power_divergenceResult(statistic=0.5230902562832735, pvalue=0.4695264353014863)
SRM likely not present

¹ Adsmart Kaggle dataset: https://www.kaggle.com/datasets/osuolaleemmanuel/ad-ab-testing

SRM root-causing

Common causes of SRM:$^1$

Assignment: incorrect bucketing or faulty randomization functions
Execution: delayed variants starting time or ramp up rates
Data logging: logging delays or bot filtering
Interference: experimenter pausing a variant

¹ Diagnosing Sample Ratio Mismatch in Online Controlled Experiments: A Taxonomy and Rules of Thumb for Practitioners

A/A tests

A/A test
- Presents an identical experience to two groups of users
- Reveals bugs in experimental setup
- No statistically significance differences between the metrics
- False positives can still happen at the specified $\alpha$ (5% of the time)
- Reveals imbalances in distributions across groups (e.g. browsers, devices, etc.)

Distributions balance Python example

Balanced browsers distribution
Valid test

checkout.groupby('checkout_page')['browser'].value_counts(normalize=True)

checkout_page  browser
A              chrome     0.341333
               safari     0.332000
               firefox    0.326667
B              safari     0.352000
               firefox    0.325000
               chrome     0.323000
C              safari     0.346000
               chrome     0.330000
               firefox    0.324000

Imbalanced browsers distribution
Invalid test

 AdSmart.groupby('experiment')['browser'].value_counts(normalize=True)

experiment  browser                   
control     Chrome Mobile                 0.591992
            Facebook                      0.137804
            Samsung Internet              0.120855
            Chrome Mobile WebView         0.071727
            Mobile Safari                 0.060427
            Chrome Mobile iOS             0.008352
            Mobile Safari UI/WKWebView    0.007369
exposed     Chrome Mobile                 0.535197
            Chrome Mobile WebView         0.298802
            Samsung Internet              0.082876
            Facebook                      0.050674
            Mobile Safari                 0.022716
            Chrome Mobile iOS             0.004244

¹ Adsmart Kaggle dataset: https://www.kaggle.com/datasets/osuolaleemmanuel/ad-ab-testing

Let's practice!

A/B Testing in Python