Evaluation metrics for text generation

Deep Learning for Text with PyTorch

Shubham Jain

Instructor

Evaluating text generation

  • Text Generation tasks create human-like text
  • Standard accuracy metrics such as accuracy, F1 fall short for these tasks
  • We need metrics that evaluate the quality of generated text
  • BLEU and ROUGE

Dalle Chatbot for text generation

Deep Learning for Text with PyTorch

BLEU (Bilingual Evaluation Understudy)

  • Compares the generated text and the reference text
  • Checks for the occurrence of n-grams
  • In the sentence "The cat is on the mat"
    • 1-grams (uni-gram): [the ,cat, is, on, the, mat]
    • 2-grams (bi-gram): ["the cat", "cat is", "is on", "on the", and "the mat"]
    • and so on for n-grams
  • A perfect match: Score of 1.0
    • 0 means no match
Deep Learning for Text with PyTorch

Calculating BLEU score with PyTorch

from torchmetrics.text import BLEUScore

generated_text = ['the cat is on the mat'] real_text = [['there is a cat on the mat', 'a cat is on the mat']]
bleu = BLEUScore() bleu_metric = bleu(generated_text, real_text) print("BLEU Score: ", bleu_metric.item())
BLEU Score: tensor(0.7598)
Deep Learning for Text with PyTorch

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

  • Compares a generated text to a reference text in two ways
  • ROUGE-N: Considers overlapping n-grams (N=1 for unigrams, 2 for bigrams, etc.) in both texts
  • ROUGE-L: Looks at the longest common subsequence (LCS) between the texts
  • ROUGE Metrics:
    • F-measure: Harmonic mean of precision and recall
    • Precision: Matches of n-grams in generated text within the reference text
    • Recall: Matches of n-grams in reference text within the generated text
  • 'rouge1', 'rouge2', and 'rougeL' prefixes refer to 1-gram, 2-gram, or LCS, respectively
Deep Learning for Text with PyTorch

Calculating ROUGE score with PyTorch

from torchmetrics.text import  ROUGEScore

generated_text='Hello, how are you doing?' real_text= "Hello, how are you?"
rouge = ROUGEScore()
rouge_score = rouge([generated_text], [[real_text]]) print("ROUGE Score:", rouge_score)
Deep Learning for Text with PyTorch

ROUGE score: output

ROUGE Score: {'rouge1_fmeasure': tensor(0.8889), 
              'rouge1_precision': tensor(0.8000), 
              'rouge1_recall': tensor(1.),

'rouge2_fmeasure': tensor(0.8571), 'rouge2_precision': tensor(0.7500), 'rouge2_recall': tensor(1.),
'rougeL_fmeasure': tensor(0.8889), 'rougeL_precision': tensor(0.8000), 'rougeL_recall': tensor(1.),
'rougeLsum_fmeasure': tensor(0.8889), 'rougeLsum_precision': tensor(0.8000), 'rougeLsum_recall': tensor(1.)}
Deep Learning for Text with PyTorch

Considerations and limitations

  • Evaluate word presence, not semantic understanding
  • Sensitive to the length of the generated text
  • Quality of reference text affects the scores
Deep Learning for Text with PyTorch

Let's practice!

Deep Learning for Text with PyTorch

Preparing Video For Download...