Introduction to model validation

Model Validation in Python

Kasey Jones

Data Scientist

What is model validation?

Model validation consists of:

  • Ensuring your model performs as expected on new data
  • Testing model performance on holdout datasets
  • Selecting the best model, parameters, and accuracy metrics
  • Achieving the best accuracy for the data given
Model Validation in Python

scikit-learn modeling review

Basic modeling steps:

model = RandomForestRegressor(n_estimators=500, random_state=1111)

model.fit(X=X_train, y=y_train)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
           oob_score=False, random_state=1111, verbose=0, warm_start=False)
Model Validation in Python

Modeling review continued

predictions = model.predict(X_test)

print("{0:.2f}".format(mae(y_true=y_test, y_pred=predictions)))
10.84

Mean Absolute Error Formula

$$ \frac{\sum_{i=1}^{n} |y_i - \hat{y}_i|}{n} $$

Model Validation in Python

Review prerequisites

Model Validation in Python

Fivethirtyeight has several datasets, including the Halloween candy power ranking dataset. Each candy has a head-to-head win percentage between 0 and 100%.

Model Validation in Python

Seen vs. unseen data

Training data = seen data

model = RandomForestRegressor(n_estimators=500, random_state=1111)
model.fit(X_train, y_train)
train_predictions = model.predict(X_train)

Testing data = unseen data

model = RandomForestRegressor(n_estimators=500, random_state=1111)
model.fit(X_train, y_train)
test_predictions = model.predict(X_test)
Model Validation in Python

Let's begin!

Model Validation in Python

Preparing Video For Download...