Power and sample size

A/B Testing in R

Lauryn Burleigh

Data Scientist

Power defined

Probability of rejecting the null hypothesis when false
Not making a Type II error of not rejecting the null
Ideal: small Type II error-rate, large power

A normal null hypothesis distribution on the left and alternative hypothesis distribution on the right, partially overlapped. A dashed horizontal line indicates the p-value where the distributions overlap. Left of the line under the alternative hypothesis indicates the Type II error and right of the line under the alternative hypothesis indicates the power.

Power benefits

Test usefulness

Rejecting null hypothesis when it should be

Power

Determine sample size needed
Check if results can be trusted

Sample size

Estimated effect size
Estimated power (commonly 0.8)
Alpha (commonly 0.05)

Overlapping distributions of various sample sizes where a smaller sample size has a more narrow distribution than larger sample sizes for a t-value, plotted on the x-axis. Smaller sample size line indicates smaller sample sizes need a larger t-value to reach significance.

library(pwr)
pwr.t.test(d = .8, power = 0.8,

           sig.level = 0.05,

           type = "one.sample",

           alternative = "two.sided")

     One-sample t test power calculation 
              n = 14.30276
              d = 0.8
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

Effect size

Expected size of effect
Mean of control group - mean of experimental group

Prior to analysis

Find effect size with:

Background information
Preliminary data

After analysis

Find effect size with:

Full data set

Power analysis of test

Higher power = higher probability to correctly reject null hypothesis

Three aspects needed:

Sample size
Effect size
Alpha

library(pwr)
pwr.t.test(n = 20, sig.level = 0.045, 
           d = .81, type = "one.sample")

     One-sample t test power calculation 
              n = 20
              d = 0.81
      sig.level = 0.045
          power = 0.9223189
    alternative = two.sided

Pizza distributions

Similar distributions

No significant difference

Two histograms, Pepperoni in pink and Cheese in blue, plotted with the values received in each are plotted on the x-axis and number of times each value appeared on the y-axis, with peaks near each other on the x-axis.

Different distributions

Likely significant difference

Pizza hypotheses

Null hypothesis distribution in pink with a mean difference of 0 and alternative hypothesis distribution in blue with a mean difference of 3.5 and the critical rejection value represented with a vertical line at 1.64.

Similar: left of rejection value
Different: right of rejection value

library(ggplot2)
ggplot(HypDists, 
       aes(x = Time, fill = Hypothesis)) + 
  geom_histogram() + 
  xlab("Difference Between Groups") + 
  geom_vline(xintercept = 1.64)

Pizza power

Similar: left of rejection value
Different: right of rejection value
Power: probability of not incorrectly finding same topping distributions (Type II error)

library(ggplot2)
ggplot(HypDists, 
       aes(x = Time, fill = Hypothesis)) + 
  geom_histogram() + 
  xlab("Difference Between Groups") + 
  geom_vline(xintercept = 1.64)

Let's practice!

A/B Testing in R