Unsupervised Dense Information Retrieval with Contrastive Learning

2021

Unsupervised Dense Information Retrieval with Contrastive Learning

Gautier Izacard, Mathilde Caron, Lucas Hosseini, and 4 more authors

Dec 2021

Paper Abstract

Recently, information retrieval has seen the emergence of dense retrievers, using neural networks, as an alternative to classical sparse methods based on term-frequency. These models have obtained state-of-the-art results on datasets and tasks where large training sets are available. However, they do not transfer well to new applications with no training data, and are outperformed by unsupervised term-frequency methods such as BM25. In this work, we explore the limits of contrastive learning as a way to train unsupervised dense retrievers and show that it leads to strong performance in various retrieval settings. On the BEIR benchmark our unsupervised model outperforms BM25 on 11 out of 15 datasets for the Recall@100. When used as pre-training before fine-tuning, either on a few thousands in-domain examples or on the large MS MARCO dataset, our contrastive model leads to improvements on the BEIR benchmark. Finally, we evaluate our approach for multi-lingual retrieval, where training data is even scarcer than for English, and show that our approach leads to strong unsupervised performance. Our model also exhibits strong cross-lingual transfer when fine-tuned on supervised English data only and evaluated on low resources language such as Swahili. We show that our unsupervised models can perform cross-lingual retrieval between different scripts, such as retrieving English documents from Arabic queries, which would not be possible with term matching methods.

@article{2112.09118v4,
  author = {Izacard, Gautier and Caron, Mathilde and Hosseini, Lucas and Riedel, Sebastian and Bojanowski, Piotr and Joulin, Armand and Grave, Edouard},
  title = {Unsupervised Dense Information Retrieval with Contrastive Learning},
  eprint = {2112.09118v4},
  archiveprefix = {arXiv},
  primaryclass = {cs.IR},
  year = {2021},
  month = dec,
  url = {http://arxiv.org/abs/2112.09118v4},
  file = {2112.09118v4.pdf},
  eprintnover = {2112.09118}
}

Three Important Things

1. Existing Retrievers Suck

Existing dense retrieval techniques rely on supervised labels of queries to relevant documents. Using too few labels leads to worse performance than BM25 on any example that is out of domain.

One might then attempt to use a dense retriever trained on a large retrieval dataset like MS MARCO and use it zero-shot on new domains, but this is also frequently outperformed by BM25.

Is there any hope for a dense retriever that can be trained on the data in an unsupervised fashion that can match the performance of BM25?

To this end, the paper gives an affirmative answer, and introduces Contriever, an unsupervised technique that utilizes contrastive learning.

2. Contriever

In an unsupervised context, the only assumption made is that each document is unique in some manner.

Given the relevance score \(s(q, d)=\left\langle f_\theta(q), f_\theta(d)\right\rangle\) for a query \(q\) and document \(d\) where \(f_\theta\) is typically a Transformer-based model, the contrastive InfoNCE loss is given as:

\[\mathcal{L}\left(q, k_{+}\right)=-\frac{\exp \left(s\left(q, k_{+}\right) / \tau\right)}{\exp \left(s\left(q, k_{+}\right) / \tau\right)+\sum_{i=1}^K \exp \left(s\left(q, k_i\right) / \tau\right)}\]

This is almost like a softmax with \(\tau\) being a temperature parameter. Minimizing the loss thus encourages it to assign high scores for relevant queries and documents, and vice versa.

3. Building positive & negative pairs

To create positive pairs, they considered the inverse cloze task, whcih uses a segment of the document as the query, and its complement as the document.

They also used independent cropping, which takes two independent spans in the document.

They also added random word deletion, replacement, and masking for data augmentation.

To create negative pairs, a simple approach is to take all other documents in a batch except the current document to be negatives, called “in-batch negatives”. However, this requires large batch sizes to work well.

To address the above, one method is to re-use samples from previous batches as negatives, allowing for more diversity and smaller batch sizes. They noted that this generates asymmetry in training, since the set of queries and keys are different, and during backpropagation the loss is only backpropagated through the queries whilst the keys are held constant. This could lead to training instability.

To alleviate that, they use a technique called MoCo, where two different networks for the queries and keys are used. The query network is trained with SGD as per usual, but the key network is actually updated as an exponential moving average of the parameters of the query network:

\[\theta_k \leftarrow m \theta_k+(1-m) \theta_q\]

4. Results

It’s pretty good against BM25 on unsupervised tasks, and both of them are generally much better than other unsupervised dense methods. It was surprising to me just how strong BM25 is as a baseline.

In the supervised setting, they’re also SOTA for some datasets, though I haven’t had time to look into any of the other methods other than ColBERT. +CE means that they added a cross encoder for re-ranking, which helps with nDCG scores as it is position-sensitive.

Most Glaring Deficiency

Would be interesting to know if there’s any kind of “scaling laws” if the data that it was trained in an unsupervised fashion was scaled up.

Conclusions for Future Work

Contriever is one of the most competitive & popular baselines for retrievers, and shows how unsupervised techniques have broad appeal.

Not really the contribution of this paper, but I also thought the method of using momentum updates for the key network in MuCo was really cool.