Evaluation metrics for text generation

Deep Learning for Text with PyTorch

Shubham Jain

Instructor

Evaluating text generation

Text Generation tasks create human-like text
Standard accuracy metrics such as accuracy, F1 fall short for these tasks
We need metrics that evaluate the quality of generated text
BLEU and ROUGE

Dalle Chatbot for text generation

BLEU (Bilingual Evaluation Understudy)

Compares the generated text and the reference text
Checks for the occurrence of n-grams
In the sentence "The cat is on the mat"
- 1-grams (uni-gram): [the ,cat, is, on, the, mat]
- 2-grams (bi-gram): ["the cat", "cat is", "is on", "on the", and "the mat"]
- and so on for n-grams
A perfect match: Score of 1.0
- 0 means no match

Calculating BLEU score with PyTorch

from torchmetrics.text import BLEUScore


generated_text = ['the cat is on the mat']
real_text = [['there is a cat on the mat', 'a cat is on the mat']]


bleu = BLEUScore()
bleu_metric = bleu(generated_text, real_text)
print("BLEU Score: ", bleu_metric.item())

BLEU Score: tensor(0.7598)

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

Compares a generated text to a reference text in two ways
ROUGE-N: Considers overlapping n-grams (N=1 for unigrams, 2 for bigrams, etc.) in both texts
ROUGE-L: Looks at the longest common subsequence (LCS) between the texts
ROUGE Metrics:
- F-measure: Harmonic mean of precision and recall
- Precision: Matches of n-grams in generated text within the reference text
- Recall: Matches of n-grams in reference text within the generated text
'rouge1', 'rouge2', and 'rougeL' prefixes refer to 1-gram, 2-gram, or LCS, respectively

Calculating ROUGE score with PyTorch

from torchmetrics.text import  ROUGEScore


generated_text='Hello, how are you doing?'
real_text= "Hello, how are you?"


rouge = ROUGEScore()


rouge_score = rouge([generated_text], [[real_text]])
print("ROUGE Score:", rouge_score)

ROUGE score: output

ROUGE Score: {'rouge1_fmeasure': tensor(0.8889), 
              'rouge1_precision': tensor(0.8000), 
              'rouge1_recall': tensor(1.),

              'rouge2_fmeasure': tensor(0.8571),
              'rouge2_precision': tensor(0.7500),
              'rouge2_recall': tensor(1.),

              'rougeL_fmeasure': tensor(0.8889),
              'rougeL_precision': tensor(0.8000), 
              'rougeL_recall': tensor(1.),

              'rougeLsum_fmeasure': tensor(0.8889),
              'rougeLsum_precision': tensor(0.8000), 
              'rougeLsum_recall': tensor(1.)}

Considerations and limitations

Evaluate word presence, not semantic understanding
Sensitive to the length of the generated text
Quality of reference text affects the scores

Let's practice!

Deep Learning for Text with PyTorch