The evaluate library

Introduction to LLMs in Python

Jasmin Ludolf

Senior Data Science Content Developer, DataCamp

The evaluate library

import evaluate

accuracy = evaluate.load("accuracy")
print(accuracy.description)
Accuracy is the proportion of correct
predictions among the total number of cases
processed. It can be computed with:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Where:
TP: True positive
TN: True negative
FP: False positive
FN: False negative

 

  • Metric: evaluate model performance based on ground truth

 

  • Comparison: compare two models

 

  • Measurement: insight on dataset properties
Introduction to LLMs in Python

Features attribute

print(accuracy.features)
{'predictions': Value(dtype='int32', id=None),
 'references': Value(dtype='int32', id=None)}

Inspecting required inputs by a metric

  • 'predictions': model outputs
  • 'references': ground truth
  • .features: indicates the type supported for class labels, e.g. 'int32' or 'float32'
f1 = evaluate.load("f1")
print(f1.features)
{'predictions': Value(dtype='int32', id=None),
 'references': Value(dtype='int32', id=None)}
pearson_corr = evaluate.load("pearsonr")
print(pearson_corr.features)
{'predictions': Value(dtype='float32', id=None),
'references': Value(dtype='float32', id=None)}
Introduction to LLMs in Python

LLM tasks and metrics

 

Evaluation metrics for language tasks

Introduction to LLMs in Python

LLM tasks and metrics

 

Evaluation metrics for language tasks

Introduction to LLMs in Python

Classification metrics

accuracy = evaluate.load("accuracy")
precision = evaluate.load("precision")
recall = evaluate.load("recall")
f1 = evaluate.load("f1")
from transformers import pipeline

classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

predictions = classifier(evaluation_text)

predicted_labels = [1 if pred["label"] == "POSITIVE" else 0 for pred in predictions]
Introduction to LLMs in Python

Metric outputs

real_labels = [0,1,0,1,1]
predicted_labels = [0,0,0,1,1]

print(accuracy.compute(references=real_labels, predictions=predicted_labels))
print(precision.compute(references=real_labels, predictions=predicted_labels))
print(recall.compute(references=real_labels, predictions=predicted_labels))
print(f1.compute(references=real_labels, predictions=predicted_labels))
{'accuracy': 0.8}
{'precision': 1.0}
{'recall': 0.6666666666666666}
{'f1': 0.8}
Introduction to LLMs in Python

Evaluating our fine-tuned model

# Load saved model and tokenizer with 
# .from_pretrained("my_finetuned_files")


new_data = ["This is movie was disappointing!", "This is the best movie ever!"] new_input = tokenizer(new_data, return_tensors="pt", padding=True, truncation=True, max_length=64) with torch.no_grad(): outputs = model(**new_input) predicted = torch.argmax(outputs.logits, dim=1).tolist()
real = [0,1]
print(accuracy.compute(references=real,
                       predictions=predicted))
print(precision.compute(references=real,
                        predictions=predicted))
print(recall.compute(references=real,
                     predictions=predicted))
print(f1.compute(references=real, 
                 predictions=predicted))
{'accuracy': 1.0}
{'precision': 1.0}
{'recall': 1.0}
{'f1': 1.0}
Introduction to LLMs in Python

Choosing the right metric

 

  • Be aware: each metric brings its own insights, but they also have their limitations

 

  • Be comprehensive: use a combination of metrics (and domain-specific KPIs where possible)

Illustration of a brain with a lightbulb to show awareness, thinking, and decision making.

Introduction to LLMs in Python

Let's practice!

Introduction to LLMs in Python

Preparing Video For Download...