The Problem of Overdispersion

Generalized Linear Models in Python

Ita Cirovic Donev

Data Science Consultant

Understanding the data

Distribution plot of the number of satellites (crab)

# mean of y
y_mean = crab['sat'].mean()
2.919
# variance of y
y_variance = crab['sat'].var()
9.912
Generalized Linear Models in Python

Mean not equal to variance

  • $variance > mean$ $\rightarrow$ overdispersion
  • $variance < mean$ $\rightarrow$ underdispersion

Consequences:

  • Small standard errors
  • Small p-value
Generalized Linear Models in Python

How to check for overdispersion?

Summary of the fitted model with highlights on df residuals and Pearson Chi square statistic.

Generalized Linear Models in Python

Compute estimated overdispersion

ratio = crab_fit.pearson_chi2 / crab_fit.df_resid
print(ratio)
3.134
  • Ratio $ =1$ $\rightarrow$ approximately Poisson

  • Ratio $ <1$ $\rightarrow$ underdispersion

  • Ratio $ >1$ $\rightarrow$ overdispersion

Generalized Linear Models in Python

Negative Binomial Regression

  • $E(y)=\lambda$
  • $Var(y) = \lambda+\alpha\lambda^2$
  • $\alpha$ - dispersion parameter
Generalized Linear Models in Python

GLM negative Binomial in Python

import statsmodels.api as sm
from statsmodels.formula.api import glm
model = glm('y ~ x', data = my_data, 
            family = sm.families.NegativeBinomial(alpha = 1)).fit()
Generalized Linear Models in Python

Let's practice!

Generalized Linear Models in Python

Preparing Video For Download...