Chi-square goodness of fit tests

Hypothesis Testing in Python

James Chapman

Curriculum Manager, DataCamp

Purple links

How do you feel when you discover that you've already visited the top resource?

purple_link_counts = stack_overflow['purple_link'].value_counts()

purple_link_counts = purple_link_counts.rename_axis('purple_link')\
                                       .reset_index(name='n')\
                                       .sort_values('purple_link')

         purple_link     n
2             Amused   368
3            Annoyed   263
0  Hello, old friend  1225
1        Indifferent   405

Declaring the hypotheses

hypothesized = pd.DataFrame({
  'purple_link': ['Amused', 'Annoyed', 'Hello, old friend', 'Indifferent'], 
  'prop': [1/6, 1/6, 1/2, 1/6]})

         purple_link      prop
0             Amused  0.166667
1            Annoyed  0.166667
2  Hello, old friend  0.500000
3        Indifferent  0.166667

$H_{0}$: The sample matches the hypothesized distribution

$H_{A}$: The sample does not match the hypothesized distribution

$\chi^{2}$ measures how far observed results are from expectations in each group

alpha = 0.01

Hypothesized counts by category

n_total = len(stack_overflow)
hypothesized["n"] = hypothesized["prop"] * n_total

         purple_link      prop            n
0             Amused  0.166667   376.833333
1            Annoyed  0.166667   376.833333
2  Hello, old friend  0.500000  1130.500000
3        Indifferent  0.166667   376.833333

Visualizing counts

import matplotlib.pyplot as plt

plt.bar(purple_link_counts['purple_link'], purple_link_counts['n'], 
        color='red', label='Observed')

plt.bar(hypothesized['purple_link'], hypothesized['n'], alpha=0.5, 
        color='blue', label='Hypothesized')

plt.legend()
plt.show()

Visualizing counts

Bar plot of number of answers versus purple_link answer, with the observed counts in red and the hypothesized counts in blue.

chi-square goodness of fit test

print(hypothesized)

         purple_link      prop            n
0             Amused  0.166667   376.833333
1            Annoyed  0.166667   376.833333
2  Hello, old friend  0.500000  1130.500000
3        Indifferent  0.166667   376.833333

from scipy.stats import chisquare
chisquare(f_obs=purple_link_counts['n'], f_exp=hypothesized['n'])

Power_divergenceResult(statistic=44.59840778416629, pvalue=1.1261810719413759e-09)

Let's practice!

Hypothesis Testing in Python