Assumptions in hypothesis testing

Hypothesis Testing in R

Richie Cotton

Data Evangelist at DataCamp

Randomness

Assumption

The samples are random subsets of larger populations.

Consequence
  • Sample is not representative of population.
How to check this
  • Understand how your data was collected.
  • Speak to the data collector/domain expert.

A logo with the phrase 'Responsibly Sourced Ingredients'.

1 Sampling techniques are discussed in "Sampling in R".
Hypothesis Testing in R

Independence of observations

Assumption

Each observation (row) in the dataset is independent.

Consequence
  • Increased chance of false negative/positive error.
How to check this
  • Understand how your data was collected.
Hypothesis Testing in R

Large sample size

Assumption

The sample is big enough to mitigate uncertainty, and so that the Central Limit Theorem applies.

Consequence
  • Really wide confidence intervals.
  • Increased chance of false negative/positive error.
How to check this
  • It depends on the test.
Hypothesis Testing in R

Large sample size: t-test

One sample
  • At least 30$^{1}$ observations in the sample.

$n \ge 30$

$n$: sample size

Two samples
  • At least 30 observations in each sample.

$n_{1} \ge 30, n_{2} \ge 30$

$n_{i}$: sample size for group $i$

Paired samples
  • At least 30 pairs of observations across the samples.

Number of rows in your data $\ge 30$

ANOVA
  • At least pairs of 30 observations in each sample.

$n_{i} \ge 30$ for all values of $i$

1 Sometimes you can get away with less than 30; the important thing is that the null distribution appears normal.
Hypothesis Testing in R

Large sample size: proportion tests

One sample
  • Number of successes in sample is greater than or equal to 10.

$n \times \hat{p} \ge 10$

  • Number of failures in sample is greater than or equal to 10.

$n \times (1 - \hat{p}) \ge 10$

$n$: sample size
$\hat{p}$: proportion of successes in sample

Two samples
  • Number of successes in each sample is greater than or equal to 10.

$n_{1} \times \hat{p}_{1} \ge 10$

$n_{2} \times \hat{p}_{2} \ge 10$

  • Number of failures in each sample is greater than or equal to 10.

$n_{1} \times (1 - \hat{p}_{1}) \ge 10$

$n_{2} \times (1 - \hat{p}_{2}) \ge 10$

Hypothesis Testing in R

Large sample size: chi-square tests

  • The number of successes in each group in greater than or equal to 5.

$n_{i} \times \hat{p}_{i} \ge 5$ for all values of $i$

  • The number of failures in each group in greater than or equal to 5.

$n_{i} \times (1 - \hat{p}_{i}) \ge 5$ for all values of $i$

$n_{i}$: sample size for group $i$
$\hat{p}_{i}$: proportion of successes in sample group $i$

Hypothesis Testing in R

Sanity check

If the bootstrap distribution doesn't look normal, assumptions likely aren't valid.

Hypothesis Testing in R

Let's practice!

Hypothesis Testing in R

Preparing Video For Download...