LLMs can be used with high performance on IR tasks, yet they are not widely used because of its high cost and latency. For instance, re-ranking 1000 documents of 250 tokens each would have costed 15 USD in Feb 2022.

This paper proposes how LLMs can be used in IR by using it to generate synthetic data that can then be used to fine-tune IR models.

They call their technique InPars to generate queries for a given document, using the MS MARCO dataset. The MS MARCO dataset is a large dataset from anonymized Bing search queries that comprises 500k pairs of queries and relevant documents.

The vanilla approach is just few-shot prompting, while the GBQ (Good-Bad Question) apporach works as follows:

  1. The original query from the MS MARCO is now a “bad question”
  2. An annotator manually writes a more complicated one that is more contextually aware, which is now a “good question”

Note that since the same few-shot examples can be used throughout to generate synthetic examples, this annotation is a one-time cost.

The GBQ approach is cited in some other IR papers on generating synthetic data.

Once many synthetic queries have been generated, they filtered it down to just the top \(k\) queries by query likelihood,

\[p_q=\frac{1}{|q|} \sum_{i=1}^{|q|} \log p\left(q_i \mid t, d, q_{<i}\right)\]

Most Glaring Deficiency

In their results table, it was unclear whether they were using the vanilla or GBQ generated queries. The criteria for “Good Questions” for the GBQ prompts also feel pretty subjective, and there was no scientific comparison between the vanilla and GBQ approach (they only mentioned “Preliminary experiments demonstrate that GBQ leads to better performance than the Vanilla prompt when used in combination with in-domain input documents”).

Conclusions for Future Work

Probably a pretty obvious paradigm these days, but training on synthetic data generated by LLMs can give good results as you can better control the structure and type of data that it generates to get exactly what you want.