The problems with holdout sets

Model Validation in Python

Kasey Jones

Data Scientist

Transition validation

The traditional training and testing split consists of using most of the overall data for training, and a smaller portion for only testing.

X_train, X_val, y_train, y_val =
    train_test_split(X, y,
    test_size=0.2)

rf = RandomForestRegressor()

rf.fit(X_train, y_train)

out_of_sample = rf.predict(X_test) print(mae(y_test, out_of_sample))
10.24
Model Validation in Python

Traditional training splits

cd = pd.read_csv("candy-data.csv")
s1 = cd.sample(60, random_state=1111)
s2 = cd.sample(60, random_state=1112)

Overlapping candies:

print(len([i for i in s1.index if i in s2.index]))
39
Model Validation in Python

Traditional training splits

Chocolate Candies:

print(s1.chocolate.value_counts()[0])
print(s2.chocolate.value_counts()[0])
34
30
Model Validation in Python

The split matters

Sample 1 Testing Error

print('Testing error: {0:.2f}'.format(mae(s1_y_test, rfr.predict(s1_X_test))))
10.32

Sample 2 Testing Error

print('Testing error: {0:.2f}'.format(mae(s2_y_test, rfr.predict(s2_X_test))))
11.56
Model Validation in Python

Train, validation, test

X_temp, X_val, y_temp, y_val = train_test_split(..., random_state=1111)
X_train, X_test, y_train, y_test = train_test_split(..., random_state=1111)

rfr = RandomForestRegressor(n_estimators=25, random_state=1111, max_features=4)
rfr.fit(X_train, y_train)

print('Validation error: {0:.2f}'.format(mae(y_test, rfr.predict(X_test))))
9.18
print('Testing error: {0:.2f}'.format(mae(y_val, rfr.predict(X_val))))
8.98
Model Validation in Python

Round 2

X_temp, X_val, y_temp, y_val = train_test_split(..., random_state=1171)
X_train, X_test, y_train, y_test = train_test_split(..., random_state=1171)

rfr = RandomForestRegressor(n_estimators=25, random_state=1111, max_features=4)
rfr.fit(X_train, y_train)

print('Validation error: {0:.2f}'.format(mae(y_test, rfr.predict(X_test))))
8.73
print('Testing error: {0:.2f}'.format(mae(y_val, rfr.predict(X_val))))
10.91
Model Validation in Python

Holdout set exercises

Model Validation in Python

Preparing Video For Download...