Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

2022

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and 1 more author

Dec 2022

Paper Abstract

Prompting-based large language models (LLMs) are surprisingly powerful at generating natural language reasoning steps or Chains-of-Thoughts (CoT) for multi-step question answering (QA). They struggle, however, when the necessary knowledge is either unavailable to the LLM or not up-to-date within its parameters. While using the question to retrieve relevant text from an external knowledge source helps LLMs, we observe that this one-step retrieve-and-read approach is insufficient for multi-step QA. Here, \textitwhat to retrieve depends on \textitwhat has already been derived, which in turn may depend on \textitwhat was previously retrieved. To address this, we propose IRCoT, a new approach for multi-step QA that interleaves retrieval with steps (sentences) in a CoT, guiding the retrieval with CoT and in turn using retrieved results to improve CoT. Using IRCoT with GPT3 substantially improves retrieval (up to 21 points) as well as downstream QA (up to 15 points) on four datasets: HotpotQA, 2WikiMultihopQA, MuSiQue, and IIRC. We observe similar substantial gains in out-of-distribution (OOD) settings as well as with much smaller models such as Flan-T5-large without additional training. IRCoT reduces model hallucination, resulting in factually more accurate CoT reasoning. Code, data, and prompts are available at \urlhttps://github.com/stonybrooknlp/ircot

@article{2212.10509v2,
  author = {Trivedi, Harsh and Balasubramanian, Niranjan and Khot, Tushar and Sabharwal, Ashish},
  title = {Interleaving Retrieval with Chain-of-Thought Reasoning for
    Knowledge-Intensive Multi-Step Questions},
  eprint = {2212.10509v2},
  archiveprefix = {arXiv},
  primaryclass = {cs.CL},
  year = {2022},
  month = dec,
  url = {http://arxiv.org/abs/2212.10509v2},
  file = {2212.10509v2.pdf},
  eprintnover = {2212.10509}
}

Two Important Things

1. Interleaved Retrieval guided by Chain-of-Thought

The paper’s insight is to use CoT to guide retrieval, and use the retrieved contents to then guide CoT again.

This is done as follows:

Generate one sentence of CoT
Use CoT sentence to retrieve additional piece of context
Using new context, repeat the previous steps until answer is provided, or reached max number of steps

The retrieved context is ordered randomly at each step. As the LLM may output multiple sentences of CoT each time, they just take one newly generated sentence and drop the rest.

Here’s the overall structure of the prompt:

Wikipedia Title: <Page Title>
<Paragraph Text>
...
Wikipedia Title: <Page Title>
<Paragraph Text>
Q: <Question>
A: <CoT-Sent-1> ... <CoT-Sent-n>

2. Results

Unsurprisingly outperforms one-step retrievers and when no retrievers were used.

They also found that IRCoT’s CoT trace had fewer factual errors, and remains effective on smaller models (0.2B to 11B).

Most Glaring Deficiency

Marginal novelty given other techniques like self-ask that came previously.

Conclusions for Future Work

Could use a LLM to drive querying of RAG datastores guided by CoT to resolve complex queries.