The problems with holdout sets

Validazione dei modelli in Python

Kasey Jones

Data Scientist

Transition validation

The traditional training and testing split consists of using most of the overall data for training, and a smaller portion for only testing.

X_train, X_val, y_train, y_val =
    train_test_split(X, y,
    test_size=0.2)

rf = RandomForestRegressor()

rf.fit(X_train, y_train)

out_of_sample = rf.predict(X_test) print(mae(y_test, out_of_sample))
10.24
Validazione dei modelli in Python

Traditional training splits

cd = pd.read_csv("candy-data.csv")
s1 = cd.sample(60, random_state=1111)
s2 = cd.sample(60, random_state=1112)

Overlapping candies:

print(len([i for i in s1.index if i in s2.index]))
39
Validazione dei modelli in Python

Traditional training splits

Chocolate Candies:

print(s1.chocolate.value_counts()[0])
print(s2.chocolate.value_counts()[0])
34
30
Validazione dei modelli in Python

The split matters

Sample 1 Testing Error

print('Testing error: {0:.2f}'.format(mae(s1_y_test, rfr.predict(s1_X_test))))
10.32

Sample 2 Testing Error

print('Testing error: {0:.2f}'.format(mae(s2_y_test, rfr.predict(s2_X_test))))
11.56
Validazione dei modelli in Python

Train, validation, test

X_temp, X_val, y_temp, y_val = train_test_split(..., random_state=1111)
X_train, X_test, y_train, y_test = train_test_split(..., random_state=1111)

rfr = RandomForestRegressor(n_estimators=25, random_state=1111, max_features=4)
rfr.fit(X_train, y_train)

print('Validation error: {0:.2f}'.format(mae(y_test, rfr.predict(X_test))))
9.18
print('Testing error: {0:.2f}'.format(mae(y_val, rfr.predict(X_val))))
8.98
Validazione dei modelli in Python

Round 2

X_temp, X_val, y_temp, y_val = train_test_split(..., random_state=1171)
X_train, X_test, y_train, y_test = train_test_split(..., random_state=1171)

rfr = RandomForestRegressor(n_estimators=25, random_state=1111, max_features=4)
rfr.fit(X_train, y_train)

print('Validation error: {0:.2f}'.format(mae(y_test, rfr.predict(X_test))))
8.73
print('Testing error: {0:.2f}'.format(mae(y_val, rfr.predict(X_val))))
10.91
Validazione dei modelli in Python

Holdout set exercises

Validazione dei modelli in Python

Preparing Video For Download...