InPars: Data Augmentation for Information Retrieval using Large Language Models

2022

InPars: Data Augmentation for Information Retrieval using Large Language Models

Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, and 1 more author

Feb 2022

Paper Abstract

The information retrieval community has recently witnessed a revolution due to large pretrained transformer models. Another key ingredient for this revolution was the MS MARCO dataset, whose scale and diversity has enabled zero-shot transfer learning to various tasks. However, not all IR tasks and domains can benefit from one single dataset equally. Extensive research in various NLP tasks has shown that using domain-specific training data, as opposed to a general-purpose one, improves the performance of neural models. In this work, we harness the few-shot capabilities of large pretrained language models as synthetic data generators for IR tasks. We show that models finetuned solely on our unsupervised dataset outperform strong baselines such as BM25 as well as recently proposed self-supervised dense retrieval methods. Furthermore, retrievers finetuned on both supervised and our synthetic data achieve better zero-shot transfer than models finetuned only on supervised data. Code, models, and data are available at https://github.com/zetaalphavector/inpars .

@article{2202.05144v1,
  author = {Bonifacio, Luiz and Abonizio, Hugo and Fadaee, Marzieh and Nogueira, Rodrigo},
  title = {InPars: Data Augmentation for Information Retrieval using Large Language
    Models},
  eprint = {2202.05144v1},
  archiveprefix = {arXiv},
  primaryclass = {cs.CL},
  year = {2022},
  month = feb,
  url = {http://arxiv.org/abs/2202.05144v1},
  file = {2202.05144v1.pdf},
  eprintnover = {2202.05144}
}

LLMs can be used with high performance on IR tasks, yet they are not widely used because of its high cost and latency. For instance, re-ranking 1000 documents of 250 tokens each would have costed 15 USD in Feb 2022.

This paper proposes how LLMs can be used in IR by using it to generate synthetic data that can then be used to fine-tune IR models.

They call their technique InPars to generate queries for a given document, using the MS MARCO dataset. The MS MARCO dataset is a large dataset from anonymized Bing search queries that comprises 500k pairs of queries and relevant documents.

The vanilla approach is just few-shot prompting, while the GBQ (Good-Bad Question) apporach works as follows:

The original query from the MS MARCO is now a “bad question”
An annotator manually writes a more complicated one that is more contextually aware, which is now a “good question”

Note that since the same few-shot examples can be used throughout to generate synthetic examples, this annotation is a one-time cost.

The GBQ approach is cited in some other IR papers on generating synthetic data.

Once many synthetic queries have been generated, they filtered it down to just the top \(k\) queries by query likelihood,

\[p_q=\frac{1}{|q|} \sum_{i=1}^{|q|} \log p\left(q_i \mid t, d, q_{<i}\right)\]

Most Glaring Deficiency

In their results table, it was unclear whether they were using the vanilla or GBQ generated queries. The criteria for “Good Questions” for the GBQ prompts also feel pretty subjective, and there was no scientific comparison between the vanilla and GBQ approach (they only mentioned “Preliminary experiments demonstrate that GBQ leads to better performance than the Vanilla prompt when used in combination with in-domain input documents”).

Conclusions for Future Work

Probably a pretty obvious paradigm these days, but training on synthetic data generated by LLMs can give good results as you can better control the structure and type of data that it generates to get exactly what you want.