How good is your model?

Supervised Learning with scikit-learn

George Boorman

Core Curriculum Manager, DataCamp

Classification metrics

Measuring model performance with accuracy:
- Fraction of correctly classified samples
- Not always a useful metric

Class imbalance

Classification for predicting fraudulent bank transactions
- 99% of transactions are legitimate; 1% are fraudulent
Could build a classifier that predicts NONE of the transactions are fraudulent
- 99% accurate!
- But terrible at actually predicting fraudulent transactions
- Fails at its original purpose
Class imbalance: Uneven frequency of classes
Need a different way to assess performance

Confusion matrix for assessing classification performance

Confusion matrix

Assessing classification performance

Assessing classification performance

Assessing classification performance

Assessing classification performance

Assessing classification performance

Assessing classification performance

Assessing classification performance

Assessing classification performance

Accuracy:

ch3_1_v3.030.png

Precision

Precision

High precision = lower false positive rate
High precision: Not many legitimate transactions are predicted to be fraudulent

Recall

Recall

High recall = lower false negative rate
High recall: Predicted most fraudulent transactions correctly

F1 score

F1 Score: $2 * \frac{precision \ * \ recall}{precision \ + \ recall}$

Confusion matrix in scikit-learn

from sklearn.metrics import classification_report, confusion_matrix

knn = KNeighborsClassifier(n_neighbors=7)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, 
                                                    random_state=42)

knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

Confusion matrix in scikit-learn

print(confusion_matrix(y_test, y_pred))

[[1106   11]
 [ 183   34]]

Classification report in scikit-learn

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.86      0.99      0.92      1117
           1       0.76      0.16      0.26       217

    accuracy                           0.85      1334
   macro avg       0.81      0.57      0.59      1334
weighted avg       0.84      0.85      0.81      1334

Let's practice!

Supervised Learning with scikit-learn

Preparing Video For Download...