ASQA: Factoid Questions Meet Long-Form Answers

2022

ASQA: Factoid Questions Meet Long-Form Answers

Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and 1 more author

Apr 2022

Paper Abstract

An abundance of datasets and availability of reliable evaluation metrics have resulted in strong progress in factoid question answering (QA). This progress, however, does not easily transfer to the task of long-form QA, where the goal is to answer questions that require in-depth explanations. The hurdles include (i) a lack of high-quality data, and (ii) the absence of a well-defined notion of the answer’s quality. In this work, we address these problems by (i) releasing a novel dataset and a task that we call ASQA (Answer Summaries for Questions which are Ambiguous); and (ii) proposing a reliable metric for measuring performance on ASQA. Our task focuses on factoid questions that are ambiguous, that is, have different correct answers depending on interpretation. Answers to ambiguous questions should synthesize factual information from multiple sources into a long-form summary that resolves the ambiguity. In contrast to existing long-form QA tasks (such as ELI5), ASQA admits a clear notion of correctness: a user faced with a good summary should be able to answer different interpretations of the original ambiguous question. We use this notion of correctness to define an automated metric of performance for ASQA. Our analysis demonstrates an agreement between this metric and human judgments, and reveals a considerable gap between human performance and strong baselines.

@article{2204.06092v2,
  author = {Stelmakh, Ivan and Luan, Yi and Dhingra, Bhuwan and Chang, Ming-Wei},
  title = {ASQA: Factoid Questions Meet Long-Form Answers},
  eprint = {2204.06092v2},
  archiveprefix = {arXiv},
  primaryclass = {cs.CL},
  year = {2022},
  month = apr,
  url = {http://arxiv.org/abs/2204.06092v2},
  file = {2204.06092v2.pdf},
  eprintnover = {2204.06092}
}

Two Important Things

1. Answer Summaries for Questions which are Ambiguous (ASQA)

Answering a long-form answer based off a set of relevant documents is essentially query-based multi-document summarization. Furthermore, such questions may be ambiguous.

A good answer should address the ambiguity in the question, and synthesize all the valid short answers into a coherent long answer. For instance, the question Who was the ruler of France in 1830 is ambiguous since there were two rulers.

They use the following desiderata for a good long answer:

Completeness. The long answer should contain all valid short answers.
Comprehensiveness. The long answer should address source of initial ambiguity, and map the relationship between the provided short answers
Fluency. Should be coherent and fluent
Attributability. Long answer should be grounded in provided documents

They hired crowdworkers to annotate AmbigQA dataset to create the ASQA dataset that respects the above guidelines.

2. Evaluation Pipeline

They also came up with a new metric, the DR score for evaluating responses on their new ASQA dataset.

This DR score uses both the ROUGE score (measure of similarity between two texts based on overlapping n-grams) and they call disambiguation accuracy.

Disambiguation accuracy helps to address the many shortcomings of ROUGE (i.e semantically similar but syntactically different responses). It works by using an encoder model (Roberta) pre-trained on the SQUADv2 question answering dataset to generate the short answer based off of the long answer, and then computing the F1 score in terms of the set of tokens that appear (after normalizing it like removing irrelevant characters like punctuation).

Overall workflow:

Most Glaring Deficiency

I was really confused how they could use Roberta to generate text, since it’s an encoder-only model. Maybe they meant another model when they said SQUADv2 model.

Conclusions for Future Work

A good dataset should address the limitations of previous datasets that come before it, and with that, limitations in the evaluation techniques used.