The problems with holdout sets

Model Validation in Python

Kasey Jones

Data Scientist

Transition validation

The traditional training and testing split consists of using most of the overall data for training, and a smaller portion for only testing.

X_train, X_val, y_train, y_val =
    train_test_split(X, y,
    test_size=0.2)

rf = RandomForestRegressor()

rf.fit(X_train, y_train)

out_of_sample = rf.predict(X_test)
print(mae(y_test, out_of_sample))

10.24

Traditional training splits

cd = pd.read_csv("candy-data.csv")
s1 = cd.sample(60, random_state=1111)
s2 = cd.sample(60, random_state=1112)

Overlapping candies:

print(len([i for i in s1.index if i in s2.index]))

Traditional training splits

Chocolate Candies:

print(s1.chocolate.value_counts()[0])
print(s2.chocolate.value_counts()[0])

34
30

The split matters

Sample 1 Testing Error

print('Testing error: {0:.2f}'.format(mae(s1_y_test, rfr.predict(s1_X_test))))

10.32

Sample 2 Testing Error

print('Testing error: {0:.2f}'.format(mae(s2_y_test, rfr.predict(s2_X_test))))

11.56

Train, validation, test

X_temp, X_val, y_temp, y_val = train_test_split(..., random_state=1111)
X_train, X_test, y_train, y_test = train_test_split(..., random_state=1111)

rfr = RandomForestRegressor(n_estimators=25, random_state=1111, max_features=4)
rfr.fit(X_train, y_train)

print('Validation error: {0:.2f}'.format(mae(y_test, rfr.predict(X_test))))

9.18

print('Testing error: {0:.2f}'.format(mae(y_val, rfr.predict(X_val))))

8.98

Round 2

X_temp, X_val, y_temp, y_val = train_test_split(..., random_state=1171)
X_train, X_test, y_train, y_test = train_test_split(..., random_state=1171)

rfr = RandomForestRegressor(n_estimators=25, random_state=1111, max_features=4)
rfr.fit(X_train, y_train)

print('Validation error: {0:.2f}'.format(mae(y_test, rfr.predict(X_test))))

8.73

print('Testing error: {0:.2f}'.format(mae(y_val, rfr.predict(X_val))))

10.91

Holdout set exercises

Model Validation in Python