Setting up experiments

Experimental Design in Python

James Chapman

Curriculum Manager, DataCamp

Experimental Design definition

A process
An objective and controlled way
Draw specific conclusions
- In reference to a hypothesis

¹ https://www.sciencedirect.com/topics/earth-and-planetary-sciences/experimental-design

Forming robust statements

X probably had an effect on Y. There is likely some small risk of error

Precise and Quantified language

P-value analysis indicates X had an effect on Y with a 10% risk of Type I error

Type I error: incorrectly reject null hypothesis

Goal: Experimental design and statistical analysis

Why experimental design?

Useful in many fields:

Medical research
Marketing
Product analytics
Agriculture
Government policy

Some terminology...

Subjects = what we are experimenting on (people, employees, users, etc.)

A subject: a person in this case.

Some terminology...

Subjects = what we are experimenting on (people, employees, users, etc.)
Treatment = some change given to one group

A treatment being applied to a group of subjects.

Some terminology...

Subjects = what we are experimenting on (people, employees, users, etc.)
Treatment = some change given to one group

A treatment being applied to a treatment group.

Some terminology...

Subjects = what we are experimenting on (people, employees, users, etc.)
Treatment = some change given to one group
Control = the group not given any change

Subjects split into treatment and control groups.

Assigning subjects to groups

How to assign subjects to groups?
- Option 1 - non-random ('split' the DataFrame)
- Option 2 - random assignment

Example: 200 subjects heights in heights DataFrame

    id  height
0    0  177.98
1    1  174.17
2    2  178.89

Non-random assignment

Assignment by slicing the DataFrame

group1_nonrandom = heights.iloc[0:100,:]
group2_nonrandom = heights.iloc[100:,:]

compare_df = pd.concat(
    [group1_nonrandom['height'].describe(), 
     group2_nonrandom['height'].describe()],
    axis=1)
compare_df.columns = ['group1', 'group2']
print(compare_df)

Very different! (Mean 9cm apart)

       group1  group2
count  100.00  100.00
mean   170.32  179.19 <--
std      3.28    3.50
min    159.28  175.03
25%    168.06  176.57
50%    170.75  178.03
75%    173.09  180.79
max    174.92  191.32

Random assignment

Use .sample()
- n or frac (fraction 0-1)

group1 = heights.sample(frac=0.5,
                        replace=False,
                        random_state=42)

group2 = heights.drop(group1.index)

print(compare_df)

Much closer! (<1cm)

       group1  group2
count  100.00  100.00
mean   175.10  174.41 <--
std      5.39    5.78
min    163.07  159.28
25%    171.32  170.17
50%    175.22  174.86
75%    178.32  177.85
max    189.78  191.32

Assignment summary

Subjects should be randomly assigned to groups
- Observed changes correctly attributed

Random subject assignment: .sample()
Verify differences: .describe()

Subjects assigned randomly to treatment and control groups.

Let's practice!

Experimental Design in Python