Experimental data setup

Experimental Design in Python

James Chapman

Curriculum Manager, DataCamp

The problem with randomization

1) Uneven issue: different number of subjects in groups

Two groups with different numbers of subjects in them.

Experimental Design in Python

The problem with randomization

2) Covariate issue: High variability in some covariates → group imbalances in randomization

Control and treatment groups designed to measure the impact ofo age on reaction times. The two groups have an unequal distribution of physical fitness, which although isn't the primary focus of the experiment, may impact the reaction times variable.

Result: harder to measure treatment effect!

Experimental Design in Python

Block randomization

 

  • Split into blocks of size $n$ first then randomly split
    • Fixes uneven issue

An example of block randomization showing two grids of orange and white squares

  • 24 subjects split into two groups then randomized
Experimental Design in Python

Our dataset

  • E-commerce dataset (ecom) (1000 subjects)
    • Average basket size
    • Average time on the website
    • Power user (Average 40+ daily minutes on website)
   basket_size  web_time  power_user
0          227         7           0
1          123         5           0
2           98        16           0
3          211        45           1
4          133        17           0
Experimental Design in Python

Block randomization in Python

group1 = ecom.sample(frac=0.5, random_state=42, replace=False)
group1['Block'] = 1
group2 = ecom.drop(group1.index)
group2['Block'] = 2
print(len(group1), len(group2))
500,500
Experimental Design in Python

Visualizing splits

 

import seaborn as sns
import matplotlib.pyplot as plt
sns.displot(data=ecom, 
            x='basket_size', 
            hue='power_user', 
            fill=True, 
            kind='kde')    
plt.show()

Confounding = variable might cause the effect rather than treatment

 

A seaborn distribution plot showing two normal-like distributions that are slightly see-through. A larger blue one and a smaller orange one that overlap slightly with the orange more to the right

Experimental Design in Python

Stratified randomization

 

  • Splitting based on covariate(s) first
    • Then randomization
  • Green = All power users (Yellow = Not power users)
    • Then split Treatment/Control
  • Fixes covariate/confounding issue
  • Can be done for multiple covariates - but gets complex!

An image of two grids of data, one green and one yellow. They both have either the letter T or C in all the cells and the yellow grid is larger

Experimental Design in Python

Our first strata

  • Separate power users
  • Sample into Treatment/Control
strata_1 = ecom[ecom['power_user'] == 1]
strata_1['Block'] = 1

strata_1_g1 = strata_1.sample(frac=0.5, replace=False) strata_1_g1['T_C'] = 'T'
strata_1_g2 = strata_1.drop(strata_1_g1.index) strata_1_g2['T_C'] = 'C'
Experimental Design in Python

The second strata

  • Separate not power users
  • Sample into Treatment/Control
strata_2 = ecom.drop(strata_1.index)
strata_2['Block'] = 2

strata_2_g1 = strata_2.sample(frac=0.5, replace=False) strata_2_g1['T_C'] = 'T' strata_2_g2 = strata_2.drop(strata_2_g1.index) strata_2_g2['T_C'] = 'C'
Experimental Design in Python

Confirming stratification

  • Join blocks and groups
  • Use groupby to check allocation
ecom_stratified = pd.concat([strata_1_g1, strata_1_g2, strata_2_g1, strata_2_g2])

ecom_stratified.groupby(['Block','T_C', 'power_user']).size()
Block  T_C  power_user
1      C    1              50
       T    1              50
2      C    0             450
       T    0             450
Experimental Design in Python

Let's practice!

Experimental Design in Python

Preparing Video For Download...