Introduction to model validation

Model Validation in Python

Kasey Jones

Data Scientist

What is model validation?

Model validation consists of:

Ensuring your model performs as expected on new data
Testing model performance on holdout datasets
Selecting the best model, parameters, and accuracy metrics
Achieving the best accuracy for the data given

scikit-learn modeling review

Basic modeling steps:

model = RandomForestRegressor(n_estimators=500, random_state=1111)

model.fit(X=X_train, y=y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
           oob_score=False, random_state=1111, verbose=0, warm_start=False)

Modeling review continued

predictions = model.predict(X_test)

print("{0:.2f}".format(mae(y_true=y_test, y_pred=predictions)))

10.84

Mean Absolute Error Formula

$$ \frac{\sum_{i=1}^{n} |y_i - \hat{y}_i|}{n} $$

Review prerequisites

Fivethirtyeight has several datasets, including the Halloween candy power ranking dataset. Each candy has a head-to-head win percentage between 0 and 100%.

Seen vs. unseen data

Training data = seen data

model = RandomForestRegressor(n_estimators=500, random_state=1111)
model.fit(X_train, y_train)
train_predictions = model.predict(X_train)

Testing data = unseen data

model = RandomForestRegressor(n_estimators=500, random_state=1111)
model.fit(X_train, y_train)
test_predictions = model.predict(X_test)

Let's begin!

Model Validation in Python