Non-parametric ANOVA and unpaired t-tests

Hypothesis Testing in R

Richie Cotton

Data Evangelist at DataCamp

Non-parametric tests

A non-parametric test is a hypothesis test that doesn't assume a probability distribution for the test statistic.

There are two types of non-parametric hypothesis test:

Simulation-based.
Rank-based.

t_test()

$H_{0}$: $\mu_{child} - \mu_{adult} = 0$ $H_{A}$: $\mu_{child} - \mu_{adult} > 0$

library(infer)
stack_overflow %>% 
  t_test(
    converted_comp ~ age_first_code_cut,
    order = c("child", "adult"),
    alternative = "greater"
  )

# A tibble: 1 x 6
  statistic  t_df p_value alternative lower_ci upper_ci
      <dbl> <dbl>   <dbl> <chr>          <dbl>    <dbl>
1      2.40 2083. 0.00814 greater        8438.      Inf

Calculating the null distribution

Simulation-based pipeline

null_distn <- stack_overflow %>% 
  specify(converted_comp ~ age_first_code_cut) %>%

  hypothesize(null = "independence") %>%

  generate(reps = 5000, type = "permute") %>%

  calculate(
    stat = "diff in means", 
    order = c("child", "adult")
  )

t-test, for comparison

library(infer)
stack_overflow %>% 
  t_test(
    converted_comp ~ age_first_code_cut,
    order = c("child", "adult"),
    alternative = "greater"
  )

Calculating the observed statistic

Simulation-based pipeline

obs_stat <- stack_overflow %>% 
  specify(converted_comp ~ age_first_code_cut) %>% 
  calculate(
    stat = "diff in means", 
    order = c("child", "adult")
  )

t-test, for comparison

library(infer)
stack_overflow %>% 
  t_test(
    converted_comp ~ age_first_code_cut,
    order = c("child", "adult"),
    alternative = "greater"
  )

Get the p-value

Simulation-based pipeline

get_p_value(
  null_distn, obs_stat, 
  direction = "greater"
)

# A tibble: 1 x 1
  p_value
    <dbl>
1  0.0066

t-test, for comparison

library(infer)
stack_overflow %>% 
  t_test(
    converted_comp ~ age_first_code_cut,
    order = c("child", "adult"),
    alternative = "greater"
  )

# A tibble: 1 x 6
  statistic  t_df p_value alternative lower_ci upper_ci
      <dbl> <dbl>   <dbl> <chr>          <dbl>    <dbl>
1      2.40 2083. 0.00814 greater        8438.      Inf

Ranks of vectors

x <- c(1, 15, 3, 10, 6)

rank(x)

1 5 2 4 3

A Wilcoxon-Mann-Whitney test (a.k.a. Wilcoxon rank sum test) is (very roughly) a t-test on the ranks of the numeric input.

Wilcoxon-Mann-Whitney test

wilcox.test(
  converted_comp ~ age_first_code_cut,
  data = stack_overflow,
  alternative = "greater",
  correct = FALSE
)

    Wilcoxon rank sum test

data:  converted_comp by age_first_code_cut
W = 967298, p-value <2e-16
alternative hypothesis: true location shift is greater than 0

¹ Also known as the "Wilcoxon rank-sum test" and the "Mann-Whitney U test".

Kruskal-Wallis test

Kruskal-Wallis test is to Wilcoxon-Mann-Whitney test as ANOVA is to t-test.

kruskal.test(
  converted_comp ~ job_sat,
  data = stack_overflow
)

    Kruskal-Wallis rank sum test

data:  converted_comp by job_sat
Kruskal-Wallis chi-square = 81, df = 4, p-value <2e-16

Let's practice!

Hypothesis Testing in R