Multiple comparisons tests

A/B Testing in Python

Moe Lotfy, PhD

Principal Data Science Manager

Introduction to the multiple comparisons problem

  • Single comparison:
    • Control (A) versus Treatment (B)
    • One metric
    • No subcategories

 

Bar chart of average sales for two variants.

  • Multiple comparisons:
    • Multiple variants (A/B/n tests)
    • Multiple metrics
    • Granular categories

 

Two bar charts with multiple variants and metrics.

A/B Testing in Python

Family-wise error rate

  • P(making Type I error) = $\alpha$ = 0.05
  • P(not making Type I error) = 1 - $\alpha$
  • P(not making Type I error in m tests) = (1 - $\alpha$)$^m$
  • P(making at least one Type I error in m tests) = 1 - (1 - $\alpha$)$^m$ = FWER

Family-wise Error Rate (FWER): the probability of making one or more type I errors when performing multiple hypothesis tests.

  • For a single test, FWER = 1 - (1 - $\alpha$)^1 = $\alpha$ = 0.05
  • But what if we perform more than one test?
A/B Testing in Python

Family-wise error rate

import matplotlib.pyplot as plt 
import numpy as np 
alpha = 0.05 
x = np.linspace(0, 20, 21) 
y = 1-(1-alpha)**x 
plt.plot(x,y, marker='o') 
plt.title('FWER vs Number of Tests') 
plt.xlabel('Number of Tests') 
plt.ylabel('FWER') 
plt.show()
  • FWER = 1 - (1 - $\alpha$)^10
  • FWER for 10 tests = 40%

Line plot of FWER as a function of number of tests. FWER rises as number of tests increase.

A/B Testing in Python

Correction methods

  • The simplest and most popular approach is the Bonferroni Correction
  • Set the adjusted $\alpha$* to the individual test $\alpha$ divided by the number of tests m

Bonferroni correction equation. Alpha divided by the number of comparisons 'm'

  • Less stringent Sidak correction
  • Set FWER to desired $\alpha$, then solve for $\alpha_s$

Sidak correction equation

A/B Testing in Python

Bonferroni correction example

  • Without correction, all three tests are considered significant
    • but the probability of making a type I error is inflated at 14%
  • With a Bonferroni Correction, A versus D is no longer significant, but FWER is controlled at 0.049

Bar chart of average sales for 4 variants with p-values.

A/B Testing in Python

statsmodels multipletests method

import statsmodels.stats.multitest as smt 
pvals = [0.023,0.0005,0.00004]
corrected = smt.multipletests(pvals, alpha=0.05, method='bonferroni')
print("Significant Test:", corrected[0])
print("Corrected P-values:", corrected[1])
print("Bonferroni Corrected alpha: {:.4f}".format(corrected[3]))
Significant Test: [False  True  True]
Corrected P-values: [0.069   0.0015  0.00012]
Bonferroni Corrected alpha: 0.0167
A/B Testing in Python

Let's practice!

A/B Testing in Python

Preparing Video For Download...