Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy

2023

Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy

Zhihong Shao, Yeyun Gong, Yelong Shen, and 3 more authors

May 2023

Paper Abstract

Large language models are powerful text processors and reasoners, but are still subject to limitations including outdated knowledge and hallucinations, which necessitates connecting them to the world. Retrieval-augmented large language models have raised extensive attention for grounding model generation on external knowledge. However, retrievers struggle to capture relevance, especially for queries with complex information needs. Recent work has proposed to improve relevance modeling by having large language models actively involved in retrieval, i.e., to improve retrieval with generation. In this paper, we show that strong performance can be achieved by a method we call Iter-RetGen, which synergizes retrieval and generation in an iterative manner. A model output shows what might be needed to finish a task, and thus provides an informative context for retrieving more relevant knowledge which in turn helps generate a better output in the next iteration. Compared with recent work which interleaves retrieval with generation when producing an output, Iter-RetGen processes all retrieved knowledge as a whole and largely preserves the flexibility in generation without structural constraints. We evaluate Iter-RetGen on multi-hop question answering, fact verification, and commonsense reasoning, and show that it can flexibly leverage parametric knowledge and non-parametric knowledge, and is superior to or competitive with state-of-the-art retrieval-augmented baselines while causing fewer overheads of retrieval and generation. We can further improve performance via generation-augmented retrieval adaptation.

@article{2305.15294v2,
  author = {Shao, Zhihong and Gong, Yeyun and Shen, Yelong and Huang, Minlie and Duan, Nan and Chen, Weizhu},
  title = {Enhancing Retrieval-Augmented Large Language Models with Iterative
    Retrieval-Generation Synergy},
  eprint = {2305.15294v2},
  archiveprefix = {arXiv},
  primaryclass = {cs.CL},
  year = {2023},
  month = may,
  url = {http://arxiv.org/abs/2305.15294v2},
  file = {2305.15294v2.pdf},
  eprintnover = {2305.15294}
}

Three Important Things

1. Iterative Retrieval-Generation Synergy

Instead of just retrieving context and generating a response for RAG just once, we can iteratively perform this pgocess and give the LLM the opportunity to retrieve more relevant information in the next iteration. This is because there may be a semantic gap between the original question and the context needed to answer it.

For instance, in the example below which requires multi-hop reasoning, the model first retrieves that Jesse Hogan was the player who won the award, however hallucinated the wrong height.

During the second retrieval, it was able to retrieve the right context with the actual height, and can now craft the final answer.

They call their technique ITER-RETGEN. It works as follows:

Start with user question \(q\)
Query initial paragraphs \(D_{q}\)
Get answer generation \(y_1\)
Query new context given query and first geneartion, \(D_{y_1 \|\| q}\)
Get answer generation \(y_2\)
…and so on, until we have all \(T\) iterations.
Return \(y_T\) as the final response

2. Evaluation Format

They used the following prompt to determine if the RAG answer is correct. They eschewed exact match (EM) metrics as it significantly understates the performance of the system, and was not sensitive to actual improvements in it. They called this method of evaluation Acc\(^\dagger\).

In the following task, you are given a
Question, a model Prediction for the
Question, and a Ground-truth Answer to the
Question. You should decide whether the
model Prediction implies the Ground-truth
Answer.

Question
{question}

Prediction
{model output}

Ground-truth Answer
{answer}

Does the Prediction imply the Ground-truth
Answer? Output Yes or No:

3. Results

They found that the second iteration brought the most performance boost, and higher iterations generally performed better. The second iteration (ITER-RETGEN 2) was competitive with other SOTA methods like ReAct, Self-Ask, and DSP.

They also validated that their Acc\(^\dagger\) metrics were more accurate than EM by manually inspecting samples, and finding that in the overwhelming majority of the cases where Acc\(^\dagger\) and EM disagrees, Acc\(^\dagger\) was correct.

Most Glaring Deficiency

In practice due to varying difficulty of tasks it may be interesting to explore taking an adaptive number of steps based on whether the model now already has enough knowledge to answer the question.

In addition, there may also be value to keeping previously retrieved context, which could be explored.

Conclusions for Future Work

I suspect this paradigm works because it can be viewed as a form of advanced query re-writing, where the query now looks very similar to the target data due to inclusion of similarly relevant data chunks.