Introduction à l’évaluation RAG

Retrieval Augmented Generation (RAG) avec LangChain

Meri Nova

Machine Learning Engineer

Types d’évaluation RAG

Un workflow RAG mettant en évidence les processus évaluables : récupération, hallucination du LLM, pertinence de la réponse à la question d’entrée et comparaison à une réponse de référence.

¹ Crédit image : LangSmith

Précision de sortie : évaluation de chaîne

query = "What are the main components of RAG architecture?"
predicted_answer = "Training and encoding"
ref_answer = "Retrieval and Generation"

Précision de sortie : évaluation de chaîne

prompt_template = """You are an expert professor specialized in grading students' answers to questions.
You are grading the following question:{query}
Here is the real answer:{answer}
You are grading the following predicted answer:{result}
Respond with CORRECT or INCORRECT:
Grade:"""

prompt = PromptTemplate(
    input_variables=["query", "answer", "result"],
    template=prompt_template
)

eval_llm = ChatOpenAI(temperature=0, model="gpt-4o-mini", openai_api_key='...')

Précision de sortie : évaluation de chaîne

from langsmith.evaluation import LangChainStringEvaluator

qa_evaluator = LangChainStringEvaluator(
    "qa",
    config={
        "llm": eval_llm,
        "prompt": PROMPT
    }
)


score = qa_evaluator.evaluator.evaluate_strings(
    prediction=predicted_answer,
    reference=ref_answer,
    input=query
)

Précision de sortie : évaluation de chaîne

print(f"Score: {score}")

Score: {'reasoning': 'INCORRECT', 'value': 'INCORRECT', 'score': 0}

query = "What are the main components of RAG architecture?"
predicted_answer = "Training and encoding"
ref_answer = "Retrieval and Generation"

Framework Ragas

Un tableau comparant les métriques de génération et de récupération.

¹ Crédit image : Ragas

Fidélité

La sortie générée reflète-t-elle fidèlement le contexte ?

$$ \text{Fidélité} = \frac{\text{Nb. d’assertions déductibles du contexte}}{\text{Nb. total d’assertions}} $$

Normalisée sur (0, 1)

Évaluer la fidélité

from langchain_openai import ChatOpenAI, OpenAIEmbeddings

from ragas.integrations.langchain import EvaluatorChain
from ragas.metrics import faithfulness


llm = ChatOpenAI(model="gpt-4o-mini", api_key="...")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small", api_key="...")


faithfulness_chain = EvaluatorChain(
    metric=faithfulness,
    llm=llm,
    embeddings=embeddings
)

Évaluer la fidélité

eval_result = faithfulness_chain({

  "question": "How does the RAG model improve question answering with LLMs?",

  "answer": "The RAG model improves question answering by combining the retrieval of documents...",

  "contexts": [
    "The RAG model integrates document retrieval with LLMs by first retrieving relevant passages...",
    "By incorporating retrieval mechanisms, RAG leverages external knowledge sources, allowing the...",
  ]

})


print(eval_result)

'faithfulness': 1.0

Précision du contexte

Les documents récupérés sont-ils pertinents pour la requête ?
Normalisée sur (0, 1) → 1 = très pertinent

from ragas.metrics import context_precision

llm = ChatOpenAI(model="gpt-4o-mini", api_key="...")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small", api_key="...")

context_precision_chain = EvaluatorChain(
    metric=context_precision,
    llm=llm,
    embeddings=embeddings
)

Évaluer la précision du contexte

eval_result = context_precision_chain({
  "question": "How does the RAG model improve question answering with large language models?",
  "ground_truth": "The RAG model improves question answering by combining the retrieval of...",
  "contexts": [
    "The RAG model integrates document retrieval with LLMs by first retrieving...",
    "By incorporating retrieval mechanisms, RAG leverages external knowledge sources...",
  ]
})


print(f"Context Precision: {eval_result['context_precision']}")

Context Precision: 0.99999999995

Passons à la pratique !

Retrieval Augmented Generation (RAG) avec LangChain