Cross-validation

Supervised Learning with scikit-learn

George Boorman

Core Curriculum Manager, DataCamp

Cross-validation motivation

  • Model performance is dependent on the way we split up the data

  • Not representative of the model's ability to generalize to unseen data

  • Solution: Cross-validation!

Supervised Learning with scikit-learn

Cross-validation basics

table headings: split 1, fold 1, fold 2, fold 3, fold 4, and fold 5

Supervised Learning with scikit-learn

Cross-validation basics

split 1 reserved as a test set

Supervised Learning with scikit-learn

Cross-validation basics

folds 2-5 used as training data

Supervised Learning with scikit-learn

Cross-validation basics

compute metric on these folds

Supervised Learning with scikit-learn

Cross-validation basics

Fold 2 as test data

Supervised Learning with scikit-learn

Cross-validation basics

folds 1, 3, 4, and 5 as training data

Supervised Learning with scikit-learn

Cross-validation basics

calculate metric again

Supervised Learning with scikit-learn

Cross-validation basics

repeat with the third fold

Supervised Learning with scikit-learn

Cross-validation basics

repeat with fourth fold

Supervised Learning with scikit-learn

Cross-validation basics

repeat with the fifth fold

Supervised Learning with scikit-learn

Cross-validation and model performance

  • 5 folds = 5-fold CV

  • 10 folds = 10-fold CV

  • k folds = k-fold CV

  • More folds = More computationally expensive

Supervised Learning with scikit-learn

Cross-validation in scikit-learn

from sklearn.model_selection import cross_val_score, KFold

kf = KFold(n_splits=6, shuffle=True, random_state=42)
reg = LinearRegression()
cv_results = cross_val_score(reg, X, y, cv=kf)
Supervised Learning with scikit-learn

Evaluating cross-validation peformance

print(cv_results)
[0.70262578, 0.7659624, 0.75188205, 0.76914482, 0.72551151, 0.73608277]
print(np.mean(cv_results), np.std(cv_results))
0.7418682216666667 0.023330243960652888
print(np.quantile(cv_results, [0.025, 0.975]))
array([0.7054865, 0.76874702])
Supervised Learning with scikit-learn

Let's practice!

Supervised Learning with scikit-learn

Preparing Video For Download...