Performing t-tests

Hypothesis Testing in R

Richie Cotton

Data Evangelist at DataCamp

Two-sample problems

Another problem is to compare sample statistics across groups of a variable.
converted_comp is a numerical variable.
age_first_code_cut is a categorical variable with levels ("child" and "adult").
Do users who first programmed as a child tend to be compensated higher than those that started as adults?

Hypotheses

$H_{0}$: The mean compensation (in USD) is the same for those that coded first as a child and those that coded first as an adult.

$H_{0}$: $\mu_{child} = \mu_{adult}$

$H_{0}$: $\mu_{child} - \mu_{adult} = 0$

$H_{A}$: The mean compensation (in USD) is greater for those that coded first as a child compared to those that coded first as an adult.

$H_{A}$: $\mu_{child} > \mu_{adult}$

$H_{A}$: $\mu_{child} - \mu_{adult} > 0$

Calculating groupwise summary statistics

stack_overflow %>% 
  group_by(age_first_code_cut) %>% 
  summarize(mean_compensation = mean(converted_comp))

# A tibble: 2 x 2
  age_first_code_cut mean_compensation
  <chr>                          <dbl>
1 adult                        111544.
2 child                        138275.

Test statistics

Sample mean estimates the population mean.
$\bar{x}$ denotes a sample mean.
$\bar{x}_{child}$ is the original sample mean compensation for coding first as a child.
$\bar{x}_{adult}$ is the original sample mean compensation for coding first as an adult.
$\bar{x}_{child} - \bar{x}_{adult}$ is a test statistic.
z-scores are one type of (standardized) test statistic.

Standardizing the test statistic

$z = \dfrac{\text{sample stat} - \text{population parameter}}{\text{standard error}}$

$t = \dfrac{\text{difference in sample stats} - \text{difference in population parameters}}{\text{standard error}}$

$t = \dfrac{(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}}) - (\mu_{\text{child}} - \mu_{\text{adult}})}{SE(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}})}$

Standard error

$SE(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}}) \approx \sqrt{\dfrac{s_{\text{child}}^2}{n_{\text{child}}} + \dfrac{s_{\text{adult}}^2}{n_{\text{adult}}}}$

$s$ is the standard deviation of the variable.

$n$ is the sample size (number of observations/rows in sample).

Assuming the null hypothesis is true

$t = \dfrac{(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}}) - (\mu_{\text{child}} - \mu_{\text{adult}})}{SE(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}})}$

$H_{0}$: $\mu_{\text{child}} - \mu_{\text{adult}} = 0$

$t = \dfrac{(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}}) }{SE(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}})}$

$t = \dfrac{(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}})}{\sqrt{\dfrac{s_{\text{child}}^2}{n_{\text{child}}} + \dfrac{s_{\text{adult}}^2}{n_{\text{adult}}}}}$

stack_overflow %>%
  group_by(age_first_code_cut) %>%
  summarize(
    xbar = mean(converted_comp),
    s = sd(converted_comp),
    n = n()
  )

# A tibble: 2 x 4
  age_first_code_cut    xbar       s     n
  <chr>                <dbl>   <dbl> <int>
1 adult              111544. 270381.  1579
2 child              138275. 278130.  1001

Calculating the test statistic

# A tibble: 2 x 4
  age_first_code_cut    xbar       s     n
  <chr>                <dbl>   <dbl> <int>
1 adult              111544. 270381.  1579
2 child              138275. 278130.  1001

$t = \dfrac{(\bar{x}_{\text{child}} - \bar{x}_{\text{adult}})}{\sqrt{\dfrac{s_{\text{child}}^2}{n_{\text{child}}} + \dfrac{s_{\text{adult}}^2}{n_{\text{adult}}}}}$

numerator <- xbar_child - xbar_adult
denominator <- sqrt(
  s_child ^ 2 / n_child + s_adult ^ 2 / n_adult
)
t_stat <- numerator / denominator

2.4046

Let's practice!

Hypothesis Testing in R