Testing

LLMOps Concepts

Max Knobbout, PhD

Applied Scientist, Uber

LLM lifecyle: Testing

Overview of the LLM application lifecycle phases

LLMOps Concepts

Why do we need to test?

 

A playful image of two cartoon characters holding a thumbs up and a thumbs down sign

 

 

  • LLMs make mistakes
  • Testing is vital for assessing the application's readiness for deployment
  • We will address evaluating the output
LLMOps Concepts

Traditional ML versus LLM application testing

Traditional supervised machine learning:

  • Need labeled train and test data
  • Metrics focusing on accuracy or closeness to target

Picture of train and test set for traditional ML

LLM applications:

  • Need test data, not necessarily labeled
  • Quality of output using a variety of     metrics

Picture of train and test set for LLM applications

LLMOps Concepts

Step 1: Building a test set

 

Playful image of cartoon characters collecting data

 

 

  • Building the test set should now be completed
  • Test data must closely resemble real-world scenarios
  • Various tools can help us in this process
LLMOps Concepts

Step 2: Choosing our metric

If there is a correct answer...

  • ... use machine learning metrics. Example:
    • Accuracy

Flowchart pointing to "Use ML metrics"

LLMOps Concepts

Step 2: Choosing our metric

If there is a reference answer...

  • ... use statistical methods.
  • ... use model-based methods. Example:
    • LLM judges

Flowchart pointing to "Use text comparison metrics"

LLMOps Concepts

Step 2: Choosing our metric

If we have access to human feedback...

  • ... let humans rate the text. Examples:
    • Rate quality
    • Rate relevance
    • Rate coherence
  • ... use model-based approach. Example:
    • Predict rating based on past feedback
    • Ask LLM judge if feedback was incorporated

Flowchart pointing to "Use feedback score metrics"

LLMOps Concepts

Step 2: Choosing our metric

If there's no human feedback...

  • ... use unsupervised metrics. Examples:
    • Coherence
    • Fluency
    • Diversity

Flowchart pointing to "Use unsupervised metrics"

LLMOps Concepts

Step 3: Define optional secondary metrics

 

 

Output characteristics:

  • 🎭 Bias
  • ☠ Toxicity
  • 🤝 Helpfulness

 

 

Operational characteristics:

  • ⏱ Latency
  • 💰 Total incurred cost
  • 💻 Memory usage
LLMOps Concepts

The development cycle

Development cycle where we added the activity of fine-tuning

LLMOps Concepts

The development cycle

Development cycle where we added the activity of testing

LLMOps Concepts

The development cycle

Development cycle where we added the activity of testing

LLMOps Concepts

The development cycle

Development cycle where we added the activity of deploying

LLMOps Concepts

Let's practice!

LLMOps Concepts

Preparing Video For Download...