Model Errors and Randomness

Introduction to Linear Modeling in Python

Jason Vestuto

Data Scientist

Types of Errors

  1. Measurement error
    • e.g.: broken sensor, wrongly recorded measurements
  2. Sampling bias
    • e.g: temperatures only from August, when days are hottest
  3. Random chance
Introduction to Linear Modeling in Python

Null Hypothesis

Question: Is our effect due a relationship or due to random chance?

Answer: check the Null Hypothesis.

Introduction to Linear Modeling in Python

Ordered Data

Scatter plot of hiking trip data, plotted on axes distance traveled in miles versus time duration in hours

Introduction to Linear Modeling in Python

Grouping Data

Scatter plot of hiking trip data, plotted on axes distance traveled in miles versus time duration in hours, with times less than 5 hours shown in red, and times greater shown in blue

Introduction to Linear Modeling in Python

Grouping Data

Two histograms plotted, each roughly bell shaped, one in red for short duration trips, centered near 5 miles, and one in blue for long duration trips, centered near 15 miles

  • Short Duration Group, mean = 5
  • Long Duration Group, mean = 15
Introduction to Linear Modeling in Python

Test Statistic

# Group into early and late times
group_short = sample_distances[times < 5]
group_long = sample_distances[times > 5]
# Resample distributions
resample_short = np.random.choice(group_short, size=500, replace=True)
resample_long = np.random.choice(group_long, size=500, replace=True)
# Test Statistic
test_statistic = resample_long - resample_short
# Effect size as mean of test statistic distribution
effect_size = np.mean(test_statistic)
Introduction to Linear Modeling in Python

Shuffle and Regrouping

Scatter plot of hiking trip data, plotted on axes distance traveled in miles versus time duration in hours, with red and blue appearing interspersed throughout all data for all time durations

Introduction to Linear Modeling in Python

Shuffling and Regrouping

2 histograms overlapping almost entirely, spreading from 0 to 25 miles, of roughly the same shape

Introduction to Linear Modeling in Python

Shuffle and Split

# Concatenate and Shuffle
shuffle_bucket = np.concatenate((group_short, group_long))
np.random.shuffle(shuffle_bucket)
# Split in the middle
slice_index = len(shuffle_bucket)//2
shuffled_half1 = shuffle_bucket[0:slice_index]
shuffled_half2 = shuffle_bucket[slice_index+1:]
Introduction to Linear Modeling in Python

Resample and Test Again

# Resample shuffled populations
shuffled_sample1 = np.random.choice(shuffled_half1, size=500, replace=True)
shuffled_sample2 = np.random.choice(shuffled_half2, size=500, replace=True)
# Recompute effect size
shuffled_test_statistic = shuffled_sample2 - shuffled_sample1
effect_size = np.mean(shuffled_test_statistic)
Introduction to Linear Modeling in Python

p-Value

Plotted on axes bin counts versus test statistic value bins, two histograms, one in blue centered on x=0 with wide spread, one in red centered on x=10 with narrower spread, and a vertical back line at x=10

Introduction to Linear Modeling in Python

Let's practice!

Introduction to Linear Modeling in Python

Preparing Video For Download...