How good is your model?

Supervised Learning with scikit-learn

George Boorman

Core Curriculum Manager, DataCamp

Classification metrics

  • Measuring model performance with accuracy:

    • Fraction of correctly classified samples

    • Not always a useful metric

Supervised Learning with scikit-learn

Class imbalance

  • Classification for predicting fraudulent bank transactions

    • 99% of transactions are legitimate; 1% are fraudulent
  • Could build a classifier that predicts NONE of the transactions are fraudulent

    • 99% accurate!

    • But terrible at actually predicting fraudulent transactions

    • Fails at its original purpose

  • Class imbalance: Uneven frequency of classes

  • Need a different way to assess performance

Supervised Learning with scikit-learn

Confusion matrix for assessing classification performance

  • Confusion matrix

confusion_matrix.png

Supervised Learning with scikit-learn

Assessing classification performance

 

predicted_labels.png

Supervised Learning with scikit-learn

Assessing classification performance

 

actual_labels.png

Supervised Learning with scikit-learn

Assessing classification performance

 

confusion_matrix.png

Supervised Learning with scikit-learn

Assessing classification performance

 

true_positive.png

Supervised Learning with scikit-learn

Assessing classification performance

 

true_negative.png

Supervised Learning with scikit-learn

Assessing classification performance

 

false_negative.png

Supervised Learning with scikit-learn

Assessing classification performance

 

false_positive.png

Supervised Learning with scikit-learn

Assessing classification performance

confusion_matrix.png

  • Accuracy:

ch3_1_v3.030.png

Supervised Learning with scikit-learn

Precision

precision.png

  • Precision

precision_formula.png

  • High precision = lower false positive rate
  • High precision: Not many legitimate transactions are predicted to be fraudulent
Supervised Learning with scikit-learn

Recall

recall.png

  • Recall

recall_formula.png

  • High recall = lower false negative rate
  • High recall: Predicted most fraudulent transactions correctly
Supervised Learning with scikit-learn

F1 score

  • F1 Score: $2 * \frac{precision \ * \ recall}{precision \ + \ recall}$
Supervised Learning with scikit-learn

Confusion matrix in scikit-learn

from sklearn.metrics import classification_report, confusion_matrix

knn = KNeighborsClassifier(n_neighbors=7)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
Supervised Learning with scikit-learn

Confusion matrix in scikit-learn

print(confusion_matrix(y_test, y_pred))
[[1106   11]
 [ 183   34]]
Supervised Learning with scikit-learn

Classification report in scikit-learn

print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.86      0.99      0.92      1117
           1       0.76      0.16      0.26       217

    accuracy                           0.85      1334
   macro avg       0.81      0.57      0.59      1334
weighted avg       0.84      0.85      0.81      1334
Supervised Learning with scikit-learn

Let's practice!

Supervised Learning with scikit-learn

Preparing Video For Download...