Assumptions in hypothesis testing

Hypothesis Testing in R

Richie Cotton

Data Evangelist at DataCamp

Randomness

Assumption

The samples are random subsets of larger populations.

Consequence

Sample is not representative of population.

How to check this

Understand how your data was collected.
Speak to the data collector/domain expert.

A logo with the phrase 'Responsibly Sourced Ingredients'.

¹ Sampling techniques are discussed in "Sampling in R".

Independence of observations

Assumption

Each observation (row) in the dataset is independent.

Consequence

Increased chance of false negative/positive error.

How to check this

Understand how your data was collected.

Large sample size

Assumption

The sample is big enough to mitigate uncertainty, and so that the Central Limit Theorem applies.

Consequence

Really wide confidence intervals.
Increased chance of false negative/positive error.

How to check this

It depends on the test.

Large sample size: t-test

One sample

At least 30$^{1}$ observations in the sample.

$n \ge 30$

$n$: sample size

Two samples

At least 30 observations in each sample.

$n_{1} \ge 30, n_{2} \ge 30$

$n_{i}$: sample size for group $i$

Paired samples

At least 30 pairs of observations across the samples.

Number of rows in your data $\ge 30$

ANOVA

At least pairs of 30 observations in each sample.

$n_{i} \ge 30$ for all values of $i$

¹ Sometimes you can get away with less than 30; the important thing is that the null distribution appears normal.

Large sample size: proportion tests

One sample

Number of successes in sample is greater than or equal to 10.

$n \times \hat{p} \ge 10$

Number of failures in sample is greater than or equal to 10.

$n \times (1 - \hat{p}) \ge 10$

$n$: sample size
$\hat{p}$: proportion of successes in sample

Two samples

Number of successes in each sample is greater than or equal to 10.

$n_{1} \times \hat{p}_{1} \ge 10$

$n_{2} \times \hat{p}_{2} \ge 10$

Number of failures in each sample is greater than or equal to 10.

$n_{1} \times (1 - \hat{p}_{1}) \ge 10$

$n_{2} \times (1 - \hat{p}_{2}) \ge 10$

Large sample size: chi-square tests

The number of successes in each group in greater than or equal to 5.

$n_{i} \times \hat{p}_{i} \ge 5$ for all values of $i$

The number of failures in each group in greater than or equal to 5.

$n_{i} \times (1 - \hat{p}_{i}) \ge 5$ for all values of $i$

$n_{i}$: sample size for group $i$
$\hat{p}_{i}$: proportion of successes in sample group $i$

Sanity check

If the bootstrap distribution doesn't look normal, assumptions likely aren't valid.

Let's practice!

Hypothesis Testing in R