Creating train, test, and validation datasets

Model Validation in Python

Kasey Jones

Data Scientist

Traditional train/test split

  • Seen data (used for training)
  • Unseen data (unavailable for training)

Splitting data consists of using a chunk of all available data for training, and a smaller chunk of available data for a testing dataset.

Model Validation in Python

Dataset definitions and ratios

Dataset Definition
Train The sample of data used when fitting models
Test (holdout sample) The sample of data used to assess model performance

Ratio Examples

  • 80:20
  • 90:10 (used when we have little data)
  • 70:30 (used when model is computationally expensive)
Model Validation in Python

The X and y datasets

import pandas as pd

tic_tac_toe = pd.read_csv("tic-tac-toe.csv")
X = pd.get_dummies(tic_tac_toe.iloc[:,0:9])
y = tic_tac_toe.iloc[:, 9]

Python courses covering dummy variables:

Model Validation in Python

Creating holdout samples

X_train, X_test, y_train, y_test  =\
    train_test_split(X, y, test_size=0.2, random_state=1111)

Parameters:

  • test_size
  • train_size
  • random_state
Model Validation in Python

Dataset for preliminary testing?

What do we do when testing different model parameters?

  • 100 versus 1000 trees
Model Validation in Python

To test model parameters, we need to split available data into three chunks. One for training, one for validation, and one for testing.

Model Validation in Python

Train, validation, test continued

X_temp, X_test, y_temp, y_test  =\
    train_test_split(X, y, test_size=0.2, random_state=1111)
X_train, X_val, y_train, y_val =\
    train_test_split(X_temp, y_temp, test_size=0.25, random_state=11111)
Model Validation in Python

It's holdout time

Model Validation in Python

Preparing Video For Download...