Modeling with tidymodels in R
David Svancer
Data Scientist
Branch of machine learning that uses labeled data for model fitting
Regression
Classification
left_company | miles_from_home | salary |
---|---|---|
no | 1 | 84500 |
yes | 10 | 64820 |
no | 5 | 76490 |
yes | 19 | 68540 |
tidymodels
variable roles
Create training and test sets
Training data
Test data
Vehicle fuel efficiency data from the U.S. Environmental Protection Agency
hwy
- highway fuel efficiency in miles per gallon (mpg)mpg
# A tibble: 234 x 11
hwy cty displ cyl manufacturer model year trans drv fl class
<int> <int> <dbl> <int> <chr> <chr> <int> <chr> <chr> <chr> <chr>
1 29 18 1.8 4 audi a4 1999 auto(l5) f p compact
2 29 21 1.8 4 audi a4 1999 manual(m5) f p compact
3 31 20 2 4 audi a4 2008 manual(m6) f p compact
4 30 21 2 4 audi a4 2008 auto(av) f p compact
5 26 16 2.8 6 audi a4 1999 auto(l5) f p compact
# ... with 224 more rows
initial_split()
prop
specifies the proportion to place into trainingstrata
provides stratification by the outcome variablePass split object to training()
function
testing()
functionlibrary(tidymodels)
mpg_split <- initial_split(mpg,
prop = 0.75,
strata = hwy)
mpg_training <- mpg_split %>%
training()
mpg_test <- mpg_split %>%
testing()
Home sales from the Seattle, Washington area between 2015 and 2016
home_sales
# A tibble: 1,492 x 8
selling_price home_age bedrooms bathrooms sqft_living sqft_lot sqft_basement floors
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 487000 10 4 2.5 2540 5001 0 2
2 465000 10 3 2.25 1530 1245 480 2
3 411000 18 2 2 1130 1148 330 2
4 635000 4 3 2.5 3350 4007 800 2
5 380000 24 5 2.5 2130 8428 0 2
# ... with 1,482 more rows
Modeling with tidymodels in R