Experimental data setup

Experimental Design in Python

James Chapman

Curriculum Manager, DataCamp

The problem with randomization

1) Uneven issue: different number of subjects in groups

Two groups with different numbers of subjects in them.

The problem with randomization

2) Covariate issue: High variability in some covariates → group imbalances in randomization

Control and treatment groups designed to measure the impact ofo age on reaction times. The two groups have an unequal distribution of physical fitness, which although isn't the primary focus of the experiment, may impact the reaction times variable.

Result: harder to measure treatment effect!

Block randomization

Split into blocks of size $n$ first then randomly split
- Fixes uneven issue

An example of block randomization showing two grids of orange and white squares

24 subjects split into two groups then randomized

Our dataset

E-commerce dataset (ecom) (1000 subjects)
- Average basket size
- Average time on the website
- Power user (Average 40+ daily minutes on website)

   basket_size  web_time  power_user
0          227         7           0
1          123         5           0
2           98        16           0
3          211        45           1
4          133        17           0

Block randomization in Python

group1 = ecom.sample(frac=0.5, random_state=42, replace=False)
group1['Block'] = 1
group2 = ecom.drop(group1.index)
group2['Block'] = 2
print(len(group1), len(group2))

500,500

Visualizing splits

import seaborn as sns
import matplotlib.pyplot as plt
sns.displot(data=ecom, 
            x='basket_size', 
            hue='power_user', 
            fill=True, 
            kind='kde')    
plt.show()

Confounding = variable might cause the effect rather than treatment

A seaborn distribution plot showing two normal-like distributions that are slightly see-through. A larger blue one and a smaller orange one that overlap slightly with the orange more to the right

Stratified randomization

Splitting based on covariate(s) first
- Then randomization
Green = All power users (Yellow = Not power users)
- Then split Treatment/Control
Fixes covariate/confounding issue
Can be done for multiple covariates - but gets complex!

An image of two grids of data, one green and one yellow. They both have either the letter T or C in all the cells and the yellow grid is larger

Our first strata

Separate power users
Sample into Treatment/Control

strata_1 = ecom[ecom['power_user'] == 1]
strata_1['Block'] = 1

strata_1_g1 = strata_1.sample(frac=0.5, replace=False)
strata_1_g1['T_C'] = 'T'

strata_1_g2 = strata_1.drop(strata_1_g1.index)
strata_1_g2['T_C'] = 'C'

The second strata

Separate not power users
Sample into Treatment/Control

strata_2 = ecom.drop(strata_1.index)
strata_2['Block'] = 2

strata_2_g1 = strata_2.sample(frac=0.5, replace=False)
strata_2_g1['T_C'] = 'T'
strata_2_g2 = strata_2.drop(strata_2_g1.index)
strata_2_g2['T_C'] = 'C'

Confirming stratification

Join blocks and groups
Use groupby to check allocation

ecom_stratified = pd.concat([strata_1_g1, strata_1_g2, strata_2_g1, strata_2_g2])

ecom_stratified.groupby(['Block','T_C', 'power_user']).size()

Block  T_C  power_user
1      C    1              50
       T    1              50
2      C    0             450
       T    0             450

Let's practice!

Experimental Design in Python