Supervised Learning with scikit-learn
George Boorman
Core Curriculum Manager, DataCamp
Measuring model performance with accuracy:
Fraction of correctly classified samples
Not always a useful metric
Classification for predicting fraudulent bank transactions
Could build a classifier that predicts NONE of the transactions are fraudulent
99% accurate!
But terrible at actually predicting fraudulent transactions
Fails at its original purpose
Class imbalance: Uneven frequency of classes
Need a different way to assess performance
from sklearn.metrics import classification_report, confusion_matrix
knn = KNeighborsClassifier(n_neighbors=7)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(confusion_matrix(y_test, y_pred))
[[1106 11]
[ 183 34]]
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.86 0.99 0.92 1117
1 0.76 0.16 0.26 217
accuracy 0.85 1334
macro avg 0.81 0.57 0.59 1334
weighted avg 0.84 0.85 0.81 1334
Supervised Learning with scikit-learn