Parametric tests

Foundations of Inference in Python

Paul Savala

Assistant Professor or Mathematics

ANOVA

ANOVA - Compares mean response in each factor
Response - A numerical measured value
Factor - A categorical value defining groups

A table showing venture capital funding from several companies in several different markets.

ANOVA

investments_df.groupby('market')['funding_total_usd'].mean()

Market        Average funding
===========   ===============
Advertising      13806610
Analytics        14762930
Biotechnology    20838670
...              ...

Response: Funding
Factor: Market
ANOVA: Compare mean funding by market

Assumptions of ANOVA

Responses for each factor are normally distributed
- Funding amounts by market are normally distributed
Responses by factor has equal population variance
- Funding variation by market are normally distributed

Normally distributed response

health_df = investments_df[investments_df['market'] == 'Health and Wellness']
health_df['funding_total_usd'].plot(kind='hist')

A histogram with total funding by company on the x-axis, frequency on the y-axis, one very tall bar near zero, and several much smaller bars beyond that.

Log-transformations and normality

health_log = np.log(health_df['funding_total_usd'])

health_log.plot(kind='hist')

A histogram with total funding by company on the x-axis, frequency on the y-axis, one very tall bar near zero, and several much smaller bars beyond that.

Equal variance

investments_df['log_funding'] = np.log(investments_df['funding_total_usd'])

investments_df.groupby('market')['log_funding'].std()

Advertising            2.254390
Analytics              2.152852
Biotechnology          1.946059
...                    ...

Levene test of equal variance

$H_0:$ Populations have equal variance

$H_a:$ Populations have different variances

Equal variance

from scipy import stats

health_df = investments_df[investments_df['market'] == 'Health and Wellness']
analytics_df = investments_df[investments_df['market'] == 'Analytics']


s, p_value = stats.levene(health_df['log_funding'],
                          analytics_df['log_funding'])

print(p_value < 0.05)

False

Conclusion: Fail to reject null hypothesis. Markets have equal variance in funding.

ANOVA in SciPy

s, p_value = stats.f_oneway(health_df['log_funding'], 
                            analytics_df['log_funding'])

print(p_value < 0.05)

True

Conclusion: The markets have statistically significant different funding.

Inference based on ANOVA

$H_0:$ All means are the same
$H_a:$ At least one mean is different
Can't conclude which mean is different without further analysis.

Let's practice!

Foundations of Inference in Python