Creating train, test, and validation datasets

Validazione dei modelli in Python

Kasey Jones

Data Scientist

Traditional train/test split

  • Seen data (used for training)
  • Unseen data (unavailable for training)

Splitting data consists of using a chunk of all available data for training, and a smaller chunk of available data for a testing dataset.

Validazione dei modelli in Python

Dataset definitions and ratios

Dataset Definition
Train The sample of data used when fitting models
Test (holdout sample) The sample of data used to assess model performance

Ratio Examples

  • 80:20
  • 90:10 (used when we have little data)
  • 70:30 (used when model is computationally expensive)
Validazione dei modelli in Python

The X and y datasets

import pandas as pd

tic_tac_toe = pd.read_csv("tic-tac-toe.csv")
X = pd.get_dummies(tic_tac_toe.iloc[:,0:9])
y = tic_tac_toe.iloc[:, 9]

Python courses covering dummy variables:

Validazione dei modelli in Python

Creating holdout samples

X_train, X_test, y_train, y_test  =\
    train_test_split(X, y, test_size=0.2, random_state=1111)

Parameters:

  • test_size
  • train_size
  • random_state
Validazione dei modelli in Python

Dataset for preliminary testing?

What do we do when testing different model parameters?

  • 100 versus 1000 trees
Validazione dei modelli in Python

To test model parameters, we need to split available data into three chunks. One for training, one for validation, and one for testing.

Validazione dei modelli in Python

Train, validation, test continued

X_temp, X_test, y_temp, y_test  =\
    train_test_split(X, y, test_size=0.2, random_state=1111)
X_train, X_val, y_train, y_val =\
    train_test_split(X_temp, y_temp, test_size=0.25, random_state=11111)
Validazione dei modelli in Python

It's holdout time

Validazione dei modelli in Python

Preparing Video For Download...