The bias-variance tradeoff

Model Validation in Python

Kasey Jones

Data Scientist

Variance

Variance: following the training data too closely
- Fails to generalize to the test data
- Low training error but high testing error
- Occurs when models are overfit and have high complexity

Overfitting models (high variance)

Overfitting occurs when our predictions follow the training data too closely. If we drew a scatter plot, and all our predictions were exactly in-line with the real values, we are probably overfit.

Bias

Bias: failing to find the relationship between the data and the response
- High training/testing error
- Occurs when models are underfit

Underfitting models (high bias)

Underfitting occurs when there is a relationship between the variable we are predicting and the predictive variables in the model, but we failed to find this relationship.

Optimal performance

Well-fit models find the relationship between the predictive variables and the response, but also generalize well to new data.

Bias-Variance Tradeoff

Parameters causing over/under fitting

rfc = RandomForestClassifier(n_estimators=100, max_depth=4)
rfc.fit(X_train, y_train)


print("Training: {0:.2f}".format(accuracy_score(y_train, train_predictions)))

Training: .84

print("Testing: {0:.2f}".format(accuracy_score(y_test, test_predictions)))

Testing: .77

rfc = RandomForestClassifier(n_estimators=100, max_depth=14)
rfc.fit(X_train, y_train)


print("Training: {0:.2f}".format(accuracy_score(y_train, train_predictions)))

Training: 1.0

print("Testing: {0:.2f}".format(accuracy_score(y_test, test_predictions)))

Testing: .83

rfc = RandomForestClassifier(n_estimators=100, max_depth=10)
rfc.fit(X_train, y_train)


print("Training: {0:.2f}".format(accuracy_score(y_train, train_predictions)))

Training: .89

print("Testing: {0:.2f}".format(accuracy_score(y_test, test_predictions)))

Testing: .86

Remember, only you can prevent overfitting!

Model Validation in Python