Gepaarde t-toetsen

Hypothesis Testing in R

Richie Cotton

Data Evangelist at DataCamp

Dataset Republikeinse stemmen VS

staat county repub_percent_08 repub_percent_12
Alabama Bullock 25.69 23.51
Alabama Chilton 78.49 79.78
Alabama Clay 73.09 72.31
Alabama Cullman 81.85 84.16
Alabama Escambia 63.89 62.46
Alabama Fayette 73.93 76.19
Alabama Franklin 68.83 69.68
... ... ... ...

500 rijen; elke rij is county-stemmen bij een presidentsverkiezing.

1 https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ
Hypothesis Testing in R

Hypothesen

Vraag: Was het percentage stemmen voor de Republikeinse kandidaat lager in 2008 dan in 2012?

$H_{0}$: $\mu_{2008} - \mu_{2012} = 0$

$H_{A}$: $\mu_{2008} - \mu_{2012} < 0$

Stel het significantieniveau $\alpha = 0.05$ in.

De data is gepaard, want elk percentage hoort bij dezelfde county.

Hypothesis Testing in R

Van twee steekproeven naar één

sample_data <- repub_votes_potus_08_12 %>% 
  mutate(diff = repub_percent_08 - repub_percent_12)
ggplot(sample_data, aes(x = diff)) +
  geom_histogram(binwidth = 1)

Histogram van de variabele diff - meeste waarden liggen tussen -10 en 10 met enkele uitschieters.

Hypothesis Testing in R

Steekproefstatistiek van het verschil berekenen

sample_data %>% 
  summarize(xbar_diff = mean(diff))
  xbar_diff
1 -2.643027
Hypothesis Testing in R

Aangepaste hypothesen

Oude hypothesen

$H_{0}$: $\mu_{2008} - \mu_{2012} = 0$

$H_{A}$: $\mu_{2008} - \mu_{2012} < 0$

 

Nieuwe hypothesen

$H_{0}$: $\mu_{\text{diff}} = 0$

$H_{A}$: $ \mu_{\text{diff}} < 0$

$t = \dfrac{\bar{x}_{\text{diff}} - \mu_{\text{diff}}}{\sqrt{\dfrac{s_{diff}^2}{n_{\text{diff}}}}}$

$df = n_{diff} - 1$

Hypothesis Testing in R

De p-waarde berekenen

n_diff <- nrow(sample_data)
s_diff <- sample_data %>% 
  summarize(sd_diff = sd(diff)) %>%
  pull(sd_diff)
t_stat <- (xbar_diff - 0) / sqrt(s_diff ^ 2 / n_diff)
-16.06374
degrees_of_freedom <- n_diff - 1
499

$t = \dfrac{\bar{x}_{\text{diff}} - \mu_{\text{diff}}}{\sqrt{\dfrac{s_{\text{diff}}^2}{n_{\text{diff}}}}}$

$df = n_{\text{diff}} - 1$

 

p_value <- pt(t_stat, df = degrees_of_freedom)
2.084965e-47
Hypothesis Testing in R

Verschillen testen tussen twee gemiddelden met t.test()

t.test(

# Vector met verschillen sample_data$diff,
# Kies uit "two.sided", "less", "greater" alternative = "less",
# Populatieparameter onder H0 mu = 0
)
    One Sample t-test

data:  sample_data$diff
t = -16.064, df = 499, p-value < 2.2e-16
alternative hypothesis: true mean is less than 0
95 percent confidence interval:
     -Inf -2.37189
sample estimates:
mean of x 
-2.643027
Hypothesis Testing in R

t.test() met paired = TRUE

t.test(
  sample_data$repub_percent_08,
  sample_data$repub_percent_12,
  alternative = "less",
  mu = 0,
  paired = TRUE
)
    Paired t-test

data:  sample_data$repub_percent_08 and 
       sample_data$repub_percent_12
t = -16.064, df = 499, p-value < 2.2e-16
alternative hypothesis: true difference in means 
                        is less than 0
95 percent confidence interval:
     -Inf -2.37189
sample estimates:
mean of the differences 
              -2.643027
Hypothesis Testing in R

Ongepaarde t.test()

t.test(
  x = sample_data$repub_percent_08,
  y = sample_data$repub_percent_12,
  alternative = "less",
  mu = 0
)

Ongepaarde t-toets heeft meer kans op een vals-negatieve fout (minder power).

    Welch Two Sample t-test

data:  sample_data$repub_percent_08 and 
       sample_data$repub_percent_12
t = -2.8788, df = 992.76, p-value = 0.002039
alternative hypothesis: true difference in means
                        is less than 0
95 percent confidence interval:
      -Inf -1.131469
sample estimates:
mean of x mean of y 
 56.52034  59.16337 
Hypothesis Testing in R

Laten we oefenen!

Hypothesis Testing in R

Preparing Video For Download...