RAGAS: Automated Evaluation of Retrieval Augmented Generation

2023

RAGAS: Automated Evaluation of Retrieval Augmented Generation

Shahul Es, Jithin James, Luis Espinosa-Anke, and 1 more author

Sep 2023

Paper Abstract

We introduce RAGAs (Retrieval Augmented Generation Assessment), a framework for reference-free evaluation of Retrieval Augmented Generation (RAG) pipelines. RAG systems are composed of a retrieval and an LLM based generation module, and provide LLMs with knowledge from a reference textual database, which enables them to act as a natural language layer between a user and textual databases, reducing the risk of hallucinations. Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself. With RAGAs, we put forward a suite of metrics which can be used to evaluate these different dimensions \textitwithout having to rely on ground truth human annotations. We posit that such a framework can crucially contribute to faster evaluation cycles of RAG architectures, which is especially important given the fast adoption of LLMs.

@article{2309.15217v1,
  author = {Es, Shahul and James, Jithin and Espinosa-Anke, Luis and Schockaert, Steven},
  title = {RAGAS: Automated Evaluation of Retrieval Augmented Generation},
  eprint = {2309.15217v1},
  archiveprefix = {arXiv},
  primaryclass = {cs.CL},
  year = {2023},
  month = sep,
  url = {http://arxiv.org/abs/2309.15217v1},
  file = {2309.15217v1.pdf},
  eprintnover = {2309.15217}
}

Three Important Things

The paper proposes metrics to use for RAG evaluation:

1. Faithfulness

First decompose answer into smaller statements:

Given a question and answer, create one
or more statements from each sentence
in the given answer.
question: [question]
answer: [answer]

Then verify if each of the statements are supported:

Consider the given context and following
statements, then determine whether they
are supported by the information present
in the context. Provide a brief explanation for each statement before arriving
at the verdict (Yes/No). Provide a final
verdict for each statement in order at the
end in the given format. Do not deviate
from the specified format.
statement: [statement 1]
...
statement: [statement n]

Final faithfulness score is the number of supported statements divided by the total number of statements:

\[F = \frac{|V|}{|S|}\]

2. Answer Relevance

This metric assesses whether the answer actually addresses the question, but not the factuality of the answer.

This is done by generating questions using the answer:

Generate a question for the given answer.
answer: [answer]

and then taking embeddings of these question.

The answer relevance score is the average of the cosine similarity between the embeddings of each of the generated questions, and the actual question:

\[\mathrm{AR}=\frac{1}{n} \sum_{i=1}^n \operatorname{sim}\left(q, q_i\right)\]

3. Context Relevance

This metric helps to measure how much of the retrieved context is actually relevant to answer the question, and hence penalizes the retrieval of extraneous information.

Provided with a question and the retrieved context, the following prompt is used to extract the relevant statements:

Please extract relevant sentences from
the provided context that can potentially
help answer the following question. If no
relevant sentences are found, or if you
believe the question cannot be answered
from the given context, return the phrase
"Insufficient Information". While extracting
candidate sentences you’re not
allowed to make any changes to sentences
from given context.

The context relevance score is then computed as:

\[\mathrm{CR}=\frac{\text { number of extracted sentences }}{\text { total number of sentences in } c(q)}\]

Most Glaring Deficiency

Very little experimental data in the paper. Some people on Reddit also complained that it doesn’t work well in practice as reasonable answers are scored very poorly, and it fails to differentiate good/bad responses.

Conclusions for Future Work

Despite its limitations RAGAS is still pretty influential in the sphere of LLM metrics, and some of these can probably be used to evaluate RAG systems with a YMMV caveat.