BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

2024

BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Jianlv Chen, Shitao Xiao, Peitian Zhang, and 3 more authors

Feb 2024

Paper Abstract

In this paper, we present a new embedding model, called M3-Embedding, which is distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity. It can support more than 100 working languages, leading to new state-of-the-art performances on multi-lingual and cross-lingual retrieval tasks. It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval, which provides a unified model foundation for real-world IR applications. It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens. The effective training of M3-Embedding involves the following technical contributions. We propose a novel self-knowledge distillation approach, where the relevance scores from different retrieval functionalities can be integrated as the teacher signal to enhance the training quality. We also optimize the batching strategy, enabling a large batch size and high training throughput to ensure the discriminativeness of embeddings. To the best of our knowledge, M3-Embedding is the first embedding model which realizes such a strong versatility. The model and code will be publicly available at https://github.com/FlagOpen/FlagEmbedding.

@article{2402.03216v4,
  author = {Chen, Jianlv and Xiao, Shitao and Zhang, Peitian and Luo, Kun and Lian, Defu and Liu, Zheng},
  title = {BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity
    Text Embeddings Through Self-Knowledge Distillation},
  eprint = {2402.03216v4},
  archiveprefix = {arXiv},
  primaryclass = {cs.CL},
  year = {2024},
  month = feb,
  url = {http://arxiv.org/abs/2402.03216v4},
  file = {2402.03216v4.pdf},
  eprintnover = {2402.03216}
}

Three Important Things

1. Foo

Unsupervised data curation: title-body, title-abstract, instruction-output

Synthetic data: choose paragraphs, use GPT-3.5 to generate questions, added to fine-tuning data

Their embedding model can support all 3 common retrieval functions:

Dense retrieval
Lexical (sparse) retrieval - this is not in the tf-idf/BM25 sense, but rather passing the text encoder outputs \(H_q[i]\) through a projection \(W_{lex}\) & ReLU to get \(w_{q_t} \leftarrow \textrm{ReLU}(W^T_{lex} H_q[i])\), and then taking the sum of the product of the activations between each term that appears in both the query and passage (using the maximum value if there are duplicates)
Multi-vector retrieval - i.e ColBERT style

2024