Statistical significance

Hypothesis Testing in R

Richie Cotton

Data Evangelist at DataCamp

p-value recap

  • p-values quantify evidence for the null hypothesis.
  • Large p-value → fail to reject null hypothesis.
  • Small p-value → reject null hypothesis.
  • Where is the cutoff point?
Hypothesis Testing in R

Significance level

The significance level of a hypothesis test ($\alpha$) is the threshold point for "beyond a reasonable doubt".

  • Common values of $\alpha$ are 0.1, 0.05, and 0.01.
  • If $p \le \alpha$, reject $H_{0}$, else fail to reject $H_{0}$.
  • $\alpha$ should be set prior to conducting the hypothesis test.
Hypothesis Testing in R

Calculating the p-value

alpha <- 0.05
prop_child_samp <- stack_overflow %>%
  summarize(
    point_estimate = mean(age_first_code_cut == "child")
  ) %>%
  pull(point_estimate)
prop_child_hyp <- 0.35
std_error <- 0.0096028
z_score <- (prop_child_samp - prop_child_hyp) / std_error
p_value <- pnorm(z_score, lower.tail = FALSE)
3.818e-05
p_value <= alpha
TRUE

p_value is less than or equal to alpha, so reject $H_{0}$ and accept $H_{A}$.

The proportion of data scientists starting programming as children is greater than 35%.

Hypothesis Testing in R

Confidence intervals

For a significance level of 0.05, it's common to choose a confidence interval of 1 - 0.05 = 0.95.

conf_int <- first_code_boot_distn %>%
  summarize(
    lower = quantile(first_code_child_rate, 0.025),
    upper = quantile(first_code_child_rate, 0.975)
  )
# A tibble: 1 x 2
  lower upper
  <dbl> <dbl>
1 0.369 0.407
Hypothesis Testing in R

Types of errors

Truly didn't commit crime Truly committed crime
Verdict not guilty correct they got away with it
Verdict guilty wrongful conviction correct

 

actual $H_{0}$ actual $H_{A}$
chosen $H_{0}$ correct false negative
chosen $H_{A}$ false positive correct

 

False positives are Type I errors; false negatives are Type II errors.

Hypothesis Testing in R

Possible errors in our example

If $p \le \alpha$, we reject $H_{0}$:

  • A false positive (Type I) error could have occurred: we thought that data scientists started coding as children at a higher rate when in reality they did not.

If $ p \gt \alpha$, we fail to reject $H_{0}$:

  • A false negative (Type II) error could have occurred: we thought that data scientists coded as children at the same rate as software engineers when in reality they coded as children at a higher rate.
Hypothesis Testing in R

Let's practice!

Hypothesis Testing in R

Preparing Video For Download...