Evaluating image classifiers

Intermediate Deep Learning with PyTorch

Michal Oleszak

Machine Learning Engineer

Data augmentation at test time

Data augmentation for training data:

train_transforms = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(45),
    transforms.RandomAutocontrast(),
    transforms.ToTensor(),
    transforms.Resize((64, 64)),
])

dataset_train = ImageFolder(
  "clouds_train", 
  transform=train_transforms,
)

Data augmentation for test data:

test_transforms = transforms.Compose([
    #
    # NO DATA AUGMENTATION AT TEST TIME
    #
    transforms.ToTensor(),
    transforms.Resize((64, 64)),
])

dataset_test = ImageFolder(
  "clouds_test", 
  transform=test_transforms,
)

Precision & Recall: binary classification

In binary classification:

Precision: Fraction of correct positive predictions
Recall: Fraction of all positive examples correctly predicted

A 2 by 2 confusion matrix with each of the four fields marked in a different color; next to it, formulas for recall and precision are expressed in terms of the color codes.

Precision & Recall: multi-class classification

In multi-class classification: separate precision and recall for each class

Precision: Fraction of cumulus-predictions that were correct
Recall: Fraction of all cumulus examples correctly predicted

Cumulus cloud picture

Averaging multi-class metrics

With 7 classes, we have 7 precision and 7 recall scores
We can analyze them per-class, or aggregate:
- Micro average: global calculation
- Macro average: mean of per-class metrics
- Weighted average: weighted mean of per-class metrics

Averaging multi-class metrics

from torchmetrics import Recall

recall_per_class = Recall(task="multiclass", num_classes=7, average=None)
recall_micro = Recall(task="multiclass", num_classes=7, average="micro")
recall_macro = Recall(task="multiclass", num_classes=7, average="macro")
recall_weighted = Recall(task="multiclass", num_classes=7, average="weighted")

When to use each:

Micro: Imbalanced datasets
Macro: Care about performance on small classes
Weighted: Consider errors in larger classes as more important

Evaluation loop

from torchmetrics import Precision, Recall

metric_precision = Precision(
  task="multiclass", num_classes=7, average="macro"
)
metric_recall = Recall(
  task="multiclass", num_classes=7, average="macro"
)

net.eval()
with torch.no_grad():
    for images, labels in dataloader_test:

        outputs = net(images)
        _, preds = torch.max(outputs, 1)
        metric_precision(preds, labels)
        metric_recall(preds, labels)

precision = metric_precision.compute()
recall = metric_recall.compute()

Import and define precision and recall metrics
Iterate over test examples with no gradient
For each test batch, get model outputs, take most likely class, and pass it to metric functions along with the labels
Compute the metrics

print(f"Precision: {precision}")
print(f"Recall: {recall}")

Precision: 0.7284010648727417
Recall: 0.763038694858551

Analyzing performance per class

metric_recall = Recall(
  task="multiclass", num_classes=7, average=None
)
net.eval()
with torch.no_grad():
    for images, labels in dataloader_test:
        outputs = net(images)
        _, preds = torch.max(outputs, 1)
        metric_recall(preds, labels)
recall = metric_recall.compute()

print(recall)

tensor([0.6364, 1.0000, 0.9091, 0.7917, 
        0.5049, 0.9500, 0.5493],
       dtype=torch.float32)

Compute metric with average=None
This gives one score per class
Dataset's .class_to_idx attribute maps class names to indices

dataset_test.class_to_idx

{'cirriform clouds': 0,
 'clear sky': 1,
 'cumulonimbus clouds': 2,
 'cumulus clouds': 3,
 'high cumuliform clouds': 4,
 'stratiform clouds': 5,
 'stratocumulus clouds': 6}

Analyzing performance per class

{
  k: recall[v].item() 
  for k, v 
  in dataset_test.class_to_idx.items()
}

{'cirriform clouds': 0.6363636255264282,
 'clear sky': 1.0,
 'cumulonimbus clouds': 0.9090909361839294,
 'cumulus clouds': 0.7916666865348816,
 'high cumuliform clouds': 0.5048543810844421,
 'stratiform clouds': 0.949999988079071,
 'stratocumulus clouds': 0.5492957830429077}

k = class name, e.g. cirriform clouds
v = class index, e.g. 0
recall[v] = tensor(0.6364, dtype=torch.float32)
recall[v].item() = 0.6364

Let's practice!

Intermediate Deep Learning with PyTorch