Background on modeling for prediction

Modeling with Data in the Tidyverse

Albert Y. Kim

Assistant Professor of Statistical and Data Sciences

Modeling for prediction example

A dataset of house prices in King County, Washington State, near Seattle (available at Kaggle.com).

Question: Can we predict the sale price of houses based on their features?

Variables:

  • $y$: House sale price is US dollars
  • $\vec{x}$: Features like sqft_living, condition, bedrooms, yr_built, waterfront
Modeling with Data in the Tidyverse

Modeling for prediction example

From the moderndive package for ModernDive:

library(dplyr)
library(moderndive)
glimpse(house_prices)
Observations: 21,613
Variables: 21
$ id            <chr> "7129300520", "6414100192"...
$ date          <dttm> 2014-10-13, 2014-12-09, 2015...
$ price         <dbl> 221900, 538000, 180000, 604000...
...
Modeling with Data in the Tidyverse

Exploratory data analysis

library(ggplot2)
ggplot(house_prices, aes(x = price)) +
  geom_histogram() + 
  labs(x = "house price", y = "count")
Modeling with Data in the Tidyverse

Histogram of outcome variable

Modeling with Data in the Tidyverse

Gapminder data

Modeling with Data in the Tidyverse

Log10 rescaling of x-axis

Modeling with Data in the Tidyverse

Log10 transformation

# log10() transform price and size
house_prices <- house_prices %>%
  mutate(log10_price = log10(price)) %>% 
  select(price, log10_price)
# A tibble: 21,613 x 2
     price log10_price
     <dbl>       <dbl>
 1  221900        5.35
 2  538000        5.73
 3  180000        5.26
 4  604000        5.78
 5  510000        5.71
 6 1225000        6.09
Modeling with Data in the Tidyverse

Histogram of new outcome variable

# Histogram of original outcome variable
ggplot(house_prices, aes(x = price)) +
  geom_histogram() + 
  labs(x = "house price", y = "count")
# Histogram of new, log10-transformed outcome variable
ggplot(house_prices, aes(x = log10_price)) +
  geom_histogram() + 
  labs(x = "log10 house price", y = "count")
Modeling with Data in the Tidyverse

Comparing before and after log10-transformation

Modeling with Data in the Tidyverse

Let's practice!

Modeling with Data in the Tidyverse

Preparing Video For Download...