Three Important Things
1. Existing Neural Information Retrieval Techniques
Neural information retrieval techniques use neural networks instead of hand-crafted features (i.e BM25) to rank similarity between queries and documents. This paper introduces ColBERT, which improves on BERT-based retrieval techniques by being computationally more efficient.

Let’s first survey the landscape of retrieval techniques, with respect to the figure above:
-
(a) Representation-based Similarity: this is probably the most well-known approach, where the document chunks are embedded offline through some deep neural network, and the query chunks are embedded similarly but online. Pairwise similarity scores are computed between the query against all documents, and the top ones are returned.
-
(b) Query-Document Interaction: in this approach, word and phrase-level relationships (using n-grams) are computed between all words in the query and documents as a form of feature engineering, and these relationships are converted into an interaction matrix that is fed as input to a deep neural network.
-
(c) All-to-all Interaction: This can be viewed as a generalization of (b), where all pairwise interactions both within and across the query and document are considered. This is achieved via self-attention, with BERT being used in practice since it is bi-directional and hence can actually model all pairwise interactions, instead of just causal relationships.
-
(d) Late Interaction: this is the approach introduced by the paper, where the document embeddings can be computed offline with BERT, and then compute the query embeddings online. They then take what they call the MaxSim operator (explained later) between the embeddings from the query and document. This architecture is visualized below:

2. ColBERT Query and Document Encoder
Let’s understand how query encoder and document encoder works in detail. Below,
To distinguish between the query and documents when they are passed as input into BERT, the authors prepend a special token [Q]
and [D]
before queries and documents respectively. Mask tokens are used to pad inputs up to the context window length, denoted by
Next, in both cases we pass the inputs to BERT, before applying them to a fully connected layer (without an activation). I’m not sure why they denoted this by
Finally, the
3. ColBERT Relevancy Scores
With all the embedding steps behind us, how do we compute a score between a query
This is saying that the score between
Their formulation is somewhat reminiscent of Gaussian complexity, where we try to understand how expressive a function class
Finally, to train ColBERT, they used triples
Most Glaring Deficiency
At the end of the day, I did not really find this paper very novel, as their main technical contribution is just how they computed the relevancy scores when viewed from the lens of the representation-based similarity approach (as this already makes use of late interaction).
Indeed, this technique could have also been extended to work with representation-based similarity, and so I feel like they missed a chance to solve a more general problem which seemed quite obvious.
However, I’ll admit I may have been unable to fully appreciate the merits of this paper, as I’m not very well-versed in the information retrieval literature.
Conclusions for Future Work
The technique of late interaction can be used to optimize systems, since it offers opportunities for offline pre-computation, and also may require less data to be processed at once (which can be an issue for superlinear-scaling algorithms like Transformers), at the cost of potentially slightly worse performance.