Memenangi Kompetisi Kaggle dengan Python
Yauhen Babakhin
Kaggle Grandmaster

Kebocoran di fitur – menggunakan data yang tidak tersedia di situasi nyata
Kebocoran di strategi validasi – strategi validasi berbeda dari kondisi nyata


# Import TimeSeriesSplit
from sklearn.model_selection import TimeSeriesSplit
# Create a TimeSeriesSplit object
time_kfold = TimeSeriesSplit(n_splits=5)
# Sort train by date
train = train.sort_values('date')
# Loop through each cross-validation split
for train_index, test_index in time_kfold.split(train):
cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
# List for the results fold_metrics = []for train_index, test_index in CV_STRATEGY.split(train): cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]# Train a model model.fit(cv_train)# Make predictions predictions = model.predict(cv_test)# Calculate the metric metric = evaluate(cv_test, predictions) fold_metrics.append(metric)
| Nomor fold | MSE Model A | MSE Model B |
|---|---|---|
| Fold 1 | 2,95 | 2,97 |
| Fold 2 | 2,84 | 2,45 |
| Fold 3 | 2,62 | 2,73 |
| Fold 4 | 2,79 | 2,83 |
import numpy as np
# Simple mean over the folds
mean_score = np.mean(fold_metrics)
# Overall validation score
overall_score_minimizing = np.mean(fold_metrics) + np.std(fold_metrics)
# Or
overall_score_maximizing = np.mean(fold_metrics) - np.std(fold_metrics)
| Nomor fold | MSE Model A | MSE Model B |
|---|---|---|
| Fold 1 | 2,95 | 2,97 |
| Fold 2 | 2,84 | 2,45 |
| Fold 3 | 2,62 | 2,73 |
| Fold 4 | 2,79 | 2,83 |
| Rata-rata | 2,80 | 2,75 |
| Keseluruhan | 2,919 | 2,935 |
Memenangi Kompetisi Kaggle dengan Python