Dense X Retrieval: What Retrieval Granularity Should We Use?

2023

Dense X Retrieval: What Retrieval Granularity Should We Use?

Tong Chen, Hongwei Wang, Sihao Chen, and 5 more authors

Dec 2023

Paper Abstract

Dense retrieval has become a prominent method to obtain relevant context or world knowledge in open-domain NLP tasks. When we use a learned dense retriever on a retrieval corpus at inference time, an often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence. We discover that the retrieval unit choice significantly impacts the performance of both retrieval and downstream tasks. Distinct from the typical approach of using passages or sentences, we introduce a novel retrieval unit, proposition, for dense retrieval. Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid and presented in a concise, self-contained natural language format. We conduct an empirical comparison of different retrieval granularity. Our results reveal that proposition-based retrieval significantly outperforms traditional passage or sentence-based methods in dense retrieval. Moreover, retrieval by proposition also enhances the performance of downstream QA tasks, since the retrieved texts are more condensed with question-relevant information, reducing the need for lengthy input tokens and minimizing the inclusion of extraneous, irrelevant information.

@article{2312.06648v2,
  author = {Chen, Tong and Wang, Hongwei and Chen, Sihao and Yu, Wenhao and Ma, Kaixin and Zhao, Xinran and Zhang, Hongming and Yu, Dong},
  title = {Dense X Retrieval: What Retrieval Granularity Should We Use?},
  eprint = {2312.06648v2},
  archiveprefix = {arXiv},
  primaryclass = {cs.CL},
  year = {2023},
  month = dec,
  url = {http://arxiv.org/abs/2312.06648v2},
  file = {2312.06648v2.pdf},
  eprintnover = {2312.06648}
}

Three Important Things

1. What Retrieval Granularity Should We Use?

The paper finds that using the right retrieval granularity matters on improving downstream performance of RAG systems.

For instance, one can consider granularities at the passage or sentence level. This paper proposes going further by using a proposition as a novel retrieval unit, where a proposition is defined as an atomic factoid that is contextualized and self-contained.

2. FactoidWiki

To test how their proposition-level approach compares to baseline approaches of using 100-word passages and sentences, they built the FactoidWiki dataset. How they build this is interesting because it informs how one might build a similar proposition-granularity dataset off their own data.

To do so, they fine-tuned a Flan-T5-large model on demonstrations of generating propositions based off a passage. These demonstrations were generated by GPT-4 using a set of 42k passages and the prompt below:

3. Results

They used the Recall@k metric to evaluate the performance of retrieval systems, where is the percentage of times the right answer was in one of the top \(k\) returned entries.

Using embeddings at the proposition level provided a clear win on the non-finetuned retrievers, but in many cases performed worse than their passage and sentence counterparts when the retrievers were fine-tuned. They speculate that this is due fine-tuning being performed on the traditional passage-query pairs.

They also separately conducted another experiment to find that proposition-level retrieval helps most when the target entity is not popular, and the performance gap narrows as the target entity becomes more common.

Most Glaring Deficiency

Would have been interesting to see how supervised dense retrievers fine-tuned on proposition level retrieval would perform

Conclusions for Future Work

Retrieval by proposition wins out because it is self-contained and only contains the necessary relevant context. Similarly, we can think about improving the performance of related systems by only including the necessary and sufficient information.