Evaluation metrics for text classification

Deep Learning for Text with PyTorch

Shubham Jain

Instructor

Why evaluation metrics matter

Spotlight on Book Reviews:

Imagine a model that assesses the sentiment of book reviews
The model claims a best-selling novel is poorly reviewed. Do we accept this?
Use evaluation metrics

Book review

Evaluation RNN Models

# Initialize model, criterion, and optimizer
rnn_model = RNNModel(input_size, hidden_size, num_layers, num_classes)
...
# Model training
for epoch in range(10): 
    outputs = rnn_model(X_train)
    ...
    print(f'Epoch: {epoch+1}, Loss: {loss.item()}')


outputs = rnn_model(X_test) 
_, predicted = torch.max(outputs, 1)

Accuracy

The ratio of correct predictions to the total predictions

from torchmetrics import Accuracy

actual = torch.tensor([0, 1, 1, 0, 1, 0])
predicted = torch.tensor([0, 0, 1, 0, 1, 1])

accuracy = Accuracy(task="binary", num_classes=2)

acc = accuracy(predicted, actual)
print(f"Accuracy: {acc}")

Accuracy: 0.6666666666666666

Beyond accuracy

10,000 reviews: 9,800 are positive
- A model that always predicts positive: 98% accuracy
  - The model failed to classify negative reviews

Precision: confidence in labeling a review as negative
Recall: how well the model spots negative reviews
F1 Score: balance between precision and recall

Precision and Recall

Precision: correctly predicted positive observations / total predicted positives
Recall: correctly predicted positive observations / all observations in the positive class

from torchmetrics import Precision, Recall

precision = Precision(task="binary", num_classes=2)
recall = Recall(task="binary", num_classes=2)

prec = precision(predicted, actual)
rec = recall(predicted, actual)

print(f"Precision: {prec}")
print(f"Recall: {rec}")

Precision: 0.6666666666666666
Recall: 0.5

Precision and Recall

Precision: 0.6666666666666666
Recall: 0.5

Precision: 66.66% accurately predicted as positive
Recall: captured 50% of positives

F1 score

Harmonizes precision and recall
Better measure for imbalanced classes

from torchmetrics import F1Score
f1 = F1Score(task="binary", num_classes=2)
f1_score = f1(predicted, actual)
print(f"F1 Score: {f1_score}")

F1 Score: 0.5714285714285715

F1 Score of 1 = perfect precision and recall
F1 Score of 0 = worst performance

Considerations

Multiclass cores may be identical
- Can indicate good model performance
Always consider the problem when interpreting results!

Let's practice!

Deep Learning for Text with PyTorch