Intermediate Regression with statsmodels in Python
Maarten Van den Broeck
Content Developer at DataCamp
Simpson's Paradox occurs when the trend of a model on the whole dataset is very different from the trends shown by models on subsets of the dataset.
trend = slope coefficient
| x | y | group | 
|---|---|---|
| 62.24344 | 70.60840 | D | 
| 52.33499 | 14.70577 | B | 
| 56.36795 | 46.39554 | C | 
| 66.80395 | 66.17487 | D | 
| 66.53605 | 89.24658 | E | 
| 62.38129 | 91.45260 | E | 
mdl_whole = ols("y ~ x", 
                 data=simpsons_paradox).fit()
print(mdl_whole.params)
Intercept           -38.554  
x                     1.751  
mdl_by_group = ols("y ~ group + group:x + 0",
                   data = simpsons_paradox).fit()
print(mdl_by_group.params)
  groupA    groupB    groupC    groupD    groupE  
 32.5051   67.3886   99.6333  132.3932  123.8242  
groupA:x  groupB:x  groupC:x  groupD:x  groupE:x  
 -0.6266   -1.0105   -0.9940   -0.9908   -0.5364
sns.regplot(x="x",
            y="y",
            data=simpsons_paradox,
            ci=None)

sns.lmplot(x="x",
           y="y",
           data=simpsons_paradox,
           hue="group",
           ci=None)

If possible, try to plot the dataset.
You can't choose the best model in general – it depends on the dataset and the question you are trying to answer.
Articulate a question before you start modeling.




Intermediate Regression with statsmodels in Python