Supervised Learning with scikit-learn
George Boorman
Core Curriculum Manager, DataCamp
Model performance is dependent on the way we split up the data
Not representative of the model's ability to generalize to unseen data
Solution: Cross-validation!
5 folds = 5-fold CV
10 folds = 10-fold CV
k folds = k-fold CV
More folds = More computationally expensive
from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits=6, shuffle=True, random_state=42)
reg = LinearRegression()
cv_results = cross_val_score(reg, X, y, cv=kf)
print(cv_results)
[0.70262578, 0.7659624, 0.75188205, 0.76914482, 0.72551151, 0.73608277]
print(np.mean(cv_results), np.std(cv_results))
0.7418682216666667 0.023330243960652888
print(np.quantile(cv_results, [0.025, 0.975]))
array([0.7054865, 0.76874702])
Supervised Learning with scikit-learn