Non-parametric ANOVA and unpaired t-tests

Hypothesis Testing in R

Richie Cotton

Data Evangelist at DataCamp

Non-parametric tests

A non-parametric test is a hypothesis test that doesn't assume a probability distribution for the test statistic.

There are two types of non-parametric hypothesis test:

  1. Simulation-based.
  2. Rank-based.
Hypothesis Testing in R

t_test()

$H_{0}$: $\mu_{child} - \mu_{adult} = 0$     $H_{A}$: $\mu_{child} - \mu_{adult} > 0$

library(infer)
stack_overflow %>% 
  t_test(
    converted_comp ~ age_first_code_cut,
    order = c("child", "adult"),
    alternative = "greater"
  )
# A tibble: 1 x 6
  statistic  t_df p_value alternative lower_ci upper_ci
      <dbl> <dbl>   <dbl> <chr>          <dbl>    <dbl>
1      2.40 2083. 0.00814 greater        8438.      Inf
Hypothesis Testing in R

Calculating the null distribution

Simulation-based pipeline
null_distn <- stack_overflow %>% 
  specify(converted_comp ~ age_first_code_cut) %>%

hypothesize(null = "independence") %>%
generate(reps = 5000, type = "permute") %>%
calculate( stat = "diff in means", order = c("child", "adult") )
t-test, for comparison
library(infer)
stack_overflow %>% 
  t_test(
    converted_comp ~ age_first_code_cut,
    order = c("child", "adult"),
    alternative = "greater"
  )
Hypothesis Testing in R

Calculating the observed statistic

Simulation-based pipeline
obs_stat <- stack_overflow %>% 
  specify(converted_comp ~ age_first_code_cut) %>% 
  calculate(
    stat = "diff in means", 
    order = c("child", "adult")
  )
t-test, for comparison
library(infer)
stack_overflow %>% 
  t_test(
    converted_comp ~ age_first_code_cut,
    order = c("child", "adult"),
    alternative = "greater"
  )
Hypothesis Testing in R

Get the p-value

Simulation-based pipeline
get_p_value(
  null_distn, obs_stat, 
  direction = "greater"
)
# A tibble: 1 x 1
  p_value
    <dbl>
1  0.0066
t-test, for comparison
library(infer)
stack_overflow %>% 
  t_test(
    converted_comp ~ age_first_code_cut,
    order = c("child", "adult"),
    alternative = "greater"
  )
# A tibble: 1 x 6
  statistic  t_df p_value alternative lower_ci upper_ci
      <dbl> <dbl>   <dbl> <chr>          <dbl>    <dbl>
1      2.40 2083. 0.00814 greater        8438.      Inf
Hypothesis Testing in R

Ranks of vectors

x <- c(1, 15, 3, 10, 6)
rank(x)
1 5 2 4 3

A Wilcoxon-Mann-Whitney test (a.k.a. Wilcoxon rank sum test) is (very roughly) a t-test on the ranks of the numeric input.

Hypothesis Testing in R

Wilcoxon-Mann-Whitney test

wilcox.test(
  converted_comp ~ age_first_code_cut,
  data = stack_overflow,
  alternative = "greater",
  correct = FALSE
) 
    Wilcoxon rank sum test

data:  converted_comp by age_first_code_cut
W = 967298, p-value <2e-16
alternative hypothesis: true location shift is greater than 0
1 Also known as the "Wilcoxon rank-sum test" and the "Mann-Whitney U test".
Hypothesis Testing in R

Kruskal-Wallis test

Kruskal-Wallis test is to Wilcoxon-Mann-Whitney test as ANOVA is to t-test.

kruskal.test(
  converted_comp ~ job_sat,
  data = stack_overflow
)
    Kruskal-Wallis rank sum test

data:  converted_comp by job_sat
Kruskal-Wallis chi-square = 81, df = 4, p-value <2e-16
Hypothesis Testing in R

Let's practice!

Hypothesis Testing in R

Preparing Video For Download...