Power and sample size

A/B Testing in R

Lauryn Burleigh

Data Scientist

Power defined

  • Probability of rejecting the null hypothesis when false
  • Not making a Type II error of not rejecting the null
  • Ideal: small Type II error-rate, large power

A normal null hypothesis distribution on the left and alternative hypothesis distribution on the right, partially overlapped. A dashed horizontal line indicates the p-value where the distributions overlap. Left of the line under the alternative hypothesis indicates the Type II error and right of the line under the alternative hypothesis indicates the power.

A/B Testing in R

Power benefits

Test usefulness

  • Rejecting null hypothesis when it should be

A normal null hypothesis distribution on the left and alternative hypothesis distribution on the right, partially overlapped. A dashed horizontal line indicates the p-value where the distributions overlap. Left of the line under the alternative hypothesis indicates the Type II error and right of the line under the alternative hypothesis indicates the power.

Power

  • Determine sample size needed
  • Check if results can be trusted
A/B Testing in R

Sample size

  • Estimated effect size
  • Estimated power (commonly 0.8)
  • Alpha (commonly 0.05)

Overlapping distributions of various sample sizes where a smaller sample size has a more narrow distribution than larger sample sizes for a t-value, plotted on the x-axis. Smaller sample size line indicates smaller sample sizes need a larger t-value to reach significance.

library(pwr)
pwr.t.test(d = .8, power = 0.8,

sig.level = 0.05,
type = "one.sample",
alternative = "two.sided")
     One-sample t test power calculation 
              n = 14.30276
              d = 0.8
      sig.level = 0.05
          power = 0.8
    alternative = two.sided
A/B Testing in R

Effect size

  • Expected size of effect
  • Mean of control group - mean of experimental group

A normal null hypothesis distribution on the left and alternative hypothesis distribution on the right, partially overlapped. A dashed horizontal line indicates the p-value where the distributions overlap. Left of the line under the alternative hypothesis indicates the Type II error and right of the line under the alternative hypothesis indicates the power. A red bar indicates the effect size is the difference between the peak of the two distributions.

Prior to analysis

Find effect size with:

  • Background information
  • Preliminary data

 

After analysis

Find effect size with:

  • Full data set
A/B Testing in R

Power analysis of test

Higher power = higher probability to correctly reject null hypothesis

Three aspects needed:

  • Sample size
  • Effect size
  • Alpha
library(pwr)
pwr.t.test(n = 20, sig.level = 0.045, 
           d = .81, type = "one.sample")
     One-sample t test power calculation 
              n = 20
              d = 0.81
      sig.level = 0.045
          power = 0.9223189
    alternative = two.sided
A/B Testing in R

Pizza distributions

Similar distributions

No significant difference

Two histograms, Pepperoni in pink and Cheese in blue, plotted with the values received in each are plotted on the x-axis and number of times each value appeared on the y-axis, with peaks near each other on the x-axis.

Different distributions

Likely significant difference

Two histograms, Pepperoni in pink and Cheese in blue, plotted with the values received in each are plotted on the x-axis and number of times each value appeared on the y-axis, with peaks separated apart on the x-axis.

A/B Testing in R

Pizza hypotheses

Null hypothesis distribution in pink with a mean difference of 0 and alternative hypothesis distribution in blue with a mean difference of 3.5 and the critical rejection value represented with a vertical line at 1.64.

  • Similar: left of rejection value
  • Different: right of rejection value
library(ggplot2)
ggplot(HypDists, 
       aes(x = Time, fill = Hypothesis)) + 
  geom_histogram() + 
  xlab("Difference Between Groups") + 
  geom_vline(xintercept = 1.64)
A/B Testing in R

Pizza power

Null hypothesis distribution in pink with a mean difference of 0 and alternative hypothesis distribution in blue with a mean difference of 3.5 and the critical rejection value represented with a vertical line at 1.64.

  • Similar: left of rejection value
  • Different: right of rejection value
  • Power: probability of not incorrectly finding same topping distributions (Type II error)
library(ggplot2)
ggplot(HypDists, 
       aes(x = Time, fill = Hypothesis)) + 
  geom_histogram() + 
  xlab("Difference Between Groups") + 
  geom_vline(xintercept = 1.64)
A/B Testing in R

Let's practice!

A/B Testing in R

Preparing Video For Download...