Hypothesis Testing in R
Richie Cotton
Data Evangelist at DataCamp
Control
Treatment
library(dplyr)
glimpse(stack_overflow)
Rows: 2,261
Columns: 8
$ respondent <dbl> 36, 47, 69, 125, 147, 152, 166, 170, 187, 196, 221,…
$ age_first_code_cut <chr> "adult", "child", "child", "adult", "adult", "adult…
$ converted_comp <dbl> 77556, 74970, 594539, 2000000, 37816, 121980, 48644…
$ job_sat <fct> Slightly satisfied, Very satisfied, Very satisfied,…
$ purple_link <chr> "Hello, old friend", "Hello, old friend", "Hello, o…
$ age_cat <chr> "At least 30", "At least 30", "Under 30", "At least…
$ age <dbl> 34, 53, 25, 41, 28, 30, 28, 26, 43, 23, 24, 35, 37,…
$ hobbyist <chr> "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Ye…
A hypothesis:
The mean annual compensation of the population of data scientists is $110,000.
The point estimate (sample statistic):
mean_comp_samp <- mean(stack_overflow$converted_comp)
mean_comp_samp <- stack_overflow %>%
summarize(mean_compensation = mean(converted_comp)) %>%
pull(mean_compensation)
119574.7
# Step 3. Repeat steps 1 & 2 many times
so_boot_distn <- replicate(
n = 5000,
expr = {
# Step 1. Resample
stack_overflow %>%
slice_sample(prop = 1, replace = TRUE) %>%
# Step 2. Calculate point estimate
summarize(mean_compensation = mean(converted_comp)) %>%
pull(mean_compensation)
}
)
tibble(resample_mean = so_boot_distn) %>%
ggplot(aes(resample_mean)) +
geom_histogram(binwidth = 1000)
std_error <- sd(so_boot_distn)
5511.674
$\text{standardized value} = \dfrac{\text{value} - \text{mean}}{\text{standard deviation}}$
$z = \dfrac{\text{sample stat} - \text{hypoth. param. value}}{\text{standard error}}$
$z = \dfrac{\$119,574.7 - \$110,000}{\$5511.67} = 1.737$
mean_comp_samp
119574.7
mean_comp_hyp <- 110000
std_error
5511.674
z_score <- (mean_comp_samp - mean_comp_hyp) / std_error
1.737171
Determine whether sample statistics are close to or far away from expected (or "hypothesized" values).
Standard normal distribution: the normal distribution with mean zero, standard deviation 1.
tibble(x = seq(-4, 4, 0.01)) %>%
ggplot(aes(x)) +
stat_function(fun = dnorm) +
ylab("PDF(x)")
Hypothesis Testing in R