Creating train, test, and validation datasets

Model Validation in Python

Kasey Jones

Data Scientist

Traditional train/test split

Splitting data consists of using a chunk of all available data for training, and a smaller chunk of available data for a testing dataset.

Dataset	Definition
Train	The sample of data used when fitting models
Test (holdout sample)	The sample of data used to assess model performance

Ratio Examples

import pandas as pd

tic_tac_toe = pd.read_csv("tic-tac-toe.csv")
X = pd.get_dummies(tic_tac_toe.iloc[:,0:9])
y = tic_tac_toe.iloc[:, 9]

Python courses covering dummy variables:

X_train, X_test, y_train, y_test  =\
    train_test_split(X, y, test_size=0.2, random_state=1111)

Parameters:

What do we do when testing different model parameters?

To test model parameters, we need to split available data into three chunks. One for training, one for validation, and one for testing.

X_temp, X_test, y_temp, y_test  =\
    train_test_split(X, y, test_size=0.2, random_state=1111)

X_train, X_val, y_train, y_val =\
    train_test_split(X_temp, y_temp, test_size=0.25, random_state=11111)

Model Validation in Python