Chi-square test of independence

Hypothesis Testing in R

Richie Cotton

Data Evangelist at DataCamp

Revisiting the proportion test

library(infer)
stack_overflow %>% 
  prop_test(
    hobbyist ~ age_cat,
    order = c("At least 30", "Under 30"),
    alternative = "two-sided",
    correct = FALSE
  )
# A tibble: 1 x 6
  statistic chisq_df   p_value alternative lower_ci upper_ci
      <dbl>    <dbl>     <dbl> <chr>          <dbl>    <dbl>
1      17.8        1 0.0000248 two.sided     0.0605    0.165
Hypothesis Testing in R

Independence of variables

Previous hypothesis test result: there is evidence that the hobbyistand age_cat variables have an association.

If the proportion of successes in the response variable is the same across all categories of the explanatory variable, the two variables are statistically independent.

1 Response and explanatory variables are defined in "Introduction to Regression in R", Chapter 1.
Hypothesis Testing in R

Job satisfaction and age category

stack_overflow %>% 
  count(age_cat)
# A tibble: 2 x 2
  age_cat         n
  <chr>       <int>
1 At least 30  1050
2 Under 30     1211
stack_overflow %>% 
  count(job_sat)
# A tibble: 5 x 2
  job_sat                   n
  <fct>                 <int>
1 Very dissatisfied       159
2 Slightly dissatisfied   342
3 Neither                 201
4 Slightly satisfied      680
5 Very satisfied          879
Hypothesis Testing in R

Declaring the hypotheses

$H_{0}$: Age categories are independent of job satisfaction levels.

$H_{A}$: Age categories are not independent of job satisfaction levels.

alpha <- 0.1
  • Test statistic denoted $\chi^{2}$.
  • Assuming independence, how far away are the observed results from the expected values?
Hypothesis Testing in R

Exploratory visualization: proportional stacked bar plot

ggplot(stack_overflow, aes(job_sat, fill = age_cat)) +
  geom_bar(position = "fill") +
  ylab("proportion")

Proportional stacked bar plot of job satisfaction filled by age category

Hypothesis Testing in R

Chi-square independence test using chisq_test()

library(infer)
stack_overflow %>% 
  chisq_test(age_cat ~ job_sat)
# A tibble: 1 x 3
  statistic chisq_df p_value
      <dbl>    <int>   <dbl>
1      5.55        4   0.235

Degrees of freedom:

$(\text{No. of response categories} - 1) \times (\text{No. of explanatory categories} - 1)$

$(2 - 1) * (5 - 1) = 4$

Hypothesis Testing in R

Swapping the variables?

ggplot(stack_overflow, aes(age_cat, fill = job_sat)) +
  geom_bar(position = "fill") +
  ylab("proportion")

Proportional stacked bar plot of age category filled by job satisfaction

Hypothesis Testing in R

chi-square both ways

library(infer)
stack_overflow %>% 
  chisq_test(age_cat ~ job_sat)
# A tibble: 1 x 3
  statistic chisq_df p_value
      <dbl>    <int>   <dbl>
1      5.55        4   0.235

Ask

Are the variables X and Y independent?

library(infer)
stack_overflow %>% 
  chisq_test(job_sat ~ age_cat)
# A tibble: 1 x 3
  statistic chisq_df p_value
      <dbl>    <int>   <dbl>
1      5.55        4   0.235

Not

Is variable X independent from variable Y?

Hypothesis Testing in R

What about direction and tails?

args(chisq_test)
function (x, formula, response = NULL, explanatory = NULL, ...)
  • Observed and expected counts squared must be non-negative.
  • chi-square tests are almost always right-tailed. $^{1}$
1 Left-tailed chi-square tests are used in statistical forensics to detect is a fit is suspiciously good because the data was fabricated. Chi-square tests of variance can be two-tailed. These are niche uses though.
Hypothesis Testing in R

Let's practice!

Hypothesis Testing in R

Preparing Video For Download...