Model evaluation

Generative AI Concepts

Daniel Tedesco

Data Lead, Google

Why evaluate anyway?

Assess performance and effectiveness of a model:

  • Measure progress
  • Rigorous model comparison
  • Benchmark human performance
Generative AI Concepts

Evaluating generative AIs

Quantitative Metrics

Numbers, representing quantitative metrics

  • Discriminative model evaluation metrics
  • Generative model-specific metrics

Human-centric Metrics

Conversation bubbles, representing human-centric metrics

  • Human performance comparison
  • Intelligent evaluation
Generative AI Concepts

Discriminative model evaluation techniques

Measure performance on well-defined tasks

Pros:

  • Widely accepted and understood
  • Easy to calculate and compare

Cons:

  • Do not capture subjective nature of generated content

A dartboard with darts near the bullseye.

Generative AI Concepts

Generative model-specific metrics

Customized for particular generative tasks

Pros:

  • Nuanced criteria, like realism, diversity, and novelty
  • Many well-known metrics

Cons:

  • Cannot capture many subjective elements
  • Often do not generalize

Illustrations of cows.

Generative AI Concepts

Human performance comparison

 

Pros:

  • Benchmarks against human abilities
  • Demonstrates practical applicability

Cons:

  • Unfair comparison

An AI competing against a human.

Generative AI Concepts

Award-winning AIs

Human Competitions

Award winning AI-generated art piece, featured with it's winning blue ribbon

Human Standardized Tests

A chart showing GPT-4's performance on several human standardized tests. It outperforms most students in well-known tests such as the Uniform Bar Exam and GRE.

1 https://twitter.com/colostatefair/status/1565486317839863809, OpenAI
Generative AI Concepts

The gold standard

Intelligent evaluation by humans or other AIs

Pros:

  • Captures subjective aspects

Cons:

  • Slow, costly, and difficult to standardize
  • Subject to human biases and irregularity
Generative AI Concepts

Turing's classic test

 

  • Proposed by computer scientist Alan Turing
  • Human evaluator judges AI-generated content
  • Passes if evaluator cannot distinguish AI from human
  • But human behavior is not always the right standard

A depiction of the Turing test setup, with a human evaluator, a computer screen displaying AI-generated content, and a human-generated content, illustrating the process of distinguishing between the two.

Generative AI Concepts

Let's practice!

Generative AI Concepts

Preparing Video For Download...