BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also report ablation experiments that replicate other pretraining schemes within the BART framework, to better measure which factors most influence end-task performance.

Three Important Things

1. Bidirectional and Auto-Regressive Transformers

BART has an architecture comprising a bi-directional encoder (similar to BERT) and an autoregressive decoder (similar to GPT), which gives it its name.

2. Corrupting Text with Arbitrary Noising Functions

BART generalizes the corruption scheme of masked language modeling (MLM) introduced in BERT by introducing the following transformations for corruption during pre-training:

Token Masking - similar to BERT, where random tokens are masked with the [MASK] token
Token Deletion - tokens are deleted from random positions, and the model must figure out what these positions are
Text Infilling - a generalization of token masking, where instead of a single token, a span of tokens with length drawn from a Poisson distribution (possibly 0-length) are deleted and replaced with a single [MASK] token. The model must predict the number of missing tokens.
Sentence Permutation - sentences delineated by full stops are randomly shuffled, and the model has to predict the original permutation
Document Rotation - the document is rotated uniformly randomly, and the model has to predict the original starting token.

These data corruption schemes can be combined via composition during pre-training.

3. Text Infilling Produces the Best Results

When each of these pre-training objectives above is trained in isolation on a BART base, text infilling results in the strongest performance.

Most Glaring Deficiency

In general, I thought the paper did not introduce many novel ideas, though the results from their experiments are still useful for guiding incremental performance gains in large language models.

It would have also been interesting to see the performance of BART with all of the transformations applied, or even at least with more variations of the transformations, since the only compound transformations that were performed were with text infilling and sentence shuffling, which resulted in the best performance for a third of the tasks that it was evaluated on. This could indicate that there may still be room for further gains from combining more of the proposed pre-training tasks.

Conclusions for Future Work

Token masking, and more generally token infilling is essential to the performance of language models on language tasks.