The modeling problem for explanation

Modeling with Data in the Tidyverse

Albert Y. Kim

Assistant Professor of Statistical and Data Sciences

Recall: General modeling framework formula

$$ y = f(\vec{x}) + \epsilon $$

Where:

  • $y$: outcome variable of interest
  • $\vec{x}$: explanatory/predictor variables
  • $f()$: function of the relationship between $y$ and $\vec{x}$ AKA the signal
  • $\epsilon$: unsystematic error component AKA the noise
Modeling with Data in the Tidyverse

The modeling problem

Consider $y = f(\vec{x}) + \epsilon$.

  1. $f()$ and $\epsilon$ are unknown
  2. $n$ observations of $y$ and $\vec{x}$ are known/given in the data
  3. Goal: Fit a model $\hat{f}()$ that approximates $f()$ while ignoring $\epsilon$
  4. Goal restated: Separate the signal from the noise
  5. Can then generate fitted/predicted values $\hat{y} = \hat{f}(\vec{x})$
Modeling with Data in the Tidyverse

Modeling for explanation example

Modeling with Data in the Tidyverse

EDA of relationship

library(ggplot2)
library(dplyr)
library(moderndive)

ggplot(evals, aes(x = age, y = score)) +
  geom_point() + 
  labs(x = "age", y = "score",
       title = "Teaching score over age")
Modeling with Data in the Tidyverse

EDA of relationship

Modeling with Data in the Tidyverse

Jittered scatterplot

library(ggplot2)
library(dplyr)
library(moderndive)

# Use geom_jitter() instead of geom_point()
ggplot(evals, aes(x = age, y = score)) +
  geom_jitter() + 
  labs(x = "age", y = "score",
       title = "Teaching score over age (jittered)")
Modeling with Data in the Tidyverse

Jittered scatterplot

Modeling with Data in the Tidyverse

Correlation coefficient

Modeling with Data in the Tidyverse

Computing the correlation coefficient

evals %>% 
  summarize(correlation = cor(score, age))
# A tibble: 1 x 1
  correlation
        <dbl>
1      -0.107
Modeling with Data in the Tidyverse

Let's practice!

Modeling with Data in the Tidyverse

Preparing Video For Download...