Model Errors and Randomness

Introduction to Linear Modeling in Python

Jason Vestuto

Data Scientist

Types of Errors

Measurement error
- e.g.: broken sensor, wrongly recorded measurements
Sampling bias
- e.g: temperatures only from August, when days are hottest
Random chance

Null Hypothesis

Question: Is our effect due a relationship or due to random chance?

Answer: check the Null Hypothesis.

Ordered Data

Scatter plot of hiking trip data, plotted on axes distance traveled in miles versus time duration in hours

Grouping Data

Scatter plot of hiking trip data, plotted on axes distance traveled in miles versus time duration in hours, with times less than 5 hours shown in red, and times greater shown in blue

Grouping Data

Two histograms plotted, each roughly bell shaped, one in red for short duration trips, centered near 5 miles, and one in blue for long duration trips, centered near 15 miles

Short Duration Group, mean = 5
Long Duration Group, mean = 15

Test Statistic

# Group into early and late times
group_short = sample_distances[times < 5]
group_long = sample_distances[times > 5]

# Resample distributions
resample_short = np.random.choice(group_short, size=500, replace=True)
resample_long = np.random.choice(group_long, size=500, replace=True)

# Test Statistic
test_statistic = resample_long - resample_short

# Effect size as mean of test statistic distribution
effect_size = np.mean(test_statistic)

Shuffle and Regrouping

Scatter plot of hiking trip data, plotted on axes distance traveled in miles versus time duration in hours, with red and blue appearing interspersed throughout all data for all time durations

Shuffling and Regrouping

2 histograms overlapping almost entirely, spreading from 0 to 25 miles, of roughly the same shape

Shuffle and Split

# Concatenate and Shuffle
shuffle_bucket = np.concatenate((group_short, group_long))
np.random.shuffle(shuffle_bucket)

# Split in the middle
slice_index = len(shuffle_bucket)//2
shuffled_half1 = shuffle_bucket[0:slice_index]
shuffled_half2 = shuffle_bucket[slice_index+1:]

Resample and Test Again

# Resample shuffled populations
shuffled_sample1 = np.random.choice(shuffled_half1, size=500, replace=True)
shuffled_sample2 = np.random.choice(shuffled_half2, size=500, replace=True)

# Recompute effect size
shuffled_test_statistic = shuffled_sample2 - shuffled_sample1
effect_size = np.mean(shuffled_test_statistic)

p-Value

Plotted on axes bin counts versus test statistic value bins, two histograms, one in blue centered on x=0 with wide spread, one in red centered on x=10 with narrower spread, and a vertical back line at x=10

Let's practice!

Introduction to Linear Modeling in Python