Long-Form QA beyond ELI5: A Task Generating Elaborate Answers to Open-Ended Questions

Every day we turn on our laptops and smartphones, open our web browser, navigate to our favorite search engine, and search for the answers to our questions. This routine frames our personal and professional lives. Modern search engines provide website recommendations in lieu of straightforward answers. But that’s about to change.

The prototypes of the next-gen search engines are already here, they are just hidden from the public and currently confined to the research labs of companies like OpenAI and DeepMind. Just a few days apart, in December of 2021, both OpenAI and DeepMind released previews of these systems. OpenAI showcased a system that could synthesize original answers to our web search-like questions while citing web references supporting the answers. DeepMind released a Retrieval-Enhanced Transformer (RETRO) specializing in knowledge-intensive open-ended question answering.

While these are exciting developments, systems based on so-called long-form question answering (LFQA) are still not ready for widespread adoption. We hope today’s release of our LFQA dataset is another step forward in the right direction. Along with the LFQA dataset, we are releasing a new version of the BART-based Seq2Seq model for abstractive answer generation and matching DPR-based encoders for questions and context passages.

Let’s do a deep dive into the dataset, an important piece of the puzzle behind LFQA systems — and our work to improve it. We’ll cover the initial stumbling blocks and our attempts to move forward and resolve some of the obstacles impeding progress in this exciting field.

Background

We have witnessed astonishing advancements in Natural Language Processing (NLP) in the last few years. Question Answering (QA) systems built on top of the recent language models (BERT, Roberta, etc.) can answer factoid-based questions with relative ease and excellent precision. These QA systems take a question, find the relevant document passages, and extract the most probable answer by scanning the correct word token span.

Recently, researchers have developed more sophisticated QA systems that can answer open-ended questions requiring paragraph-long answers. These systems are usually classified as Long-Form Question Answering (LFQA) systems. They function by querying large document stores for relevant information and subsequently using the retrieved documents to generate accurate, multi-sentence answers. The documents related to a given query, colloquially called context passages, are not used merely as source tokens for extracted answers, but instead provide a larger context for the synthesis of original, abstractive long-form answers. LFQA systems usually consist of three components:

A document store including content passages for a variety of topics
Retriever models to encode documents/questions such that it is possible to query the document store
A Seq2Seq language model capable of generating paragraph-long answers when given a question and context passages retrieved from the document store

Motivation

These components present several challenges, where recent progress has been made. In 2019, Angel Fan et al. officially introduced the task of long-form question answering in their seminal research paper, “ELI5: Long Form Question Answering”. It contained the dataset of 270K question/answer pairs from the Reddit forum, “Explain Like I’m Five” (ELI5), an online community dedicated to answering complex questions using language comprehensible to a five-year-old. The ELI5 dataset consists of diverse questions that require paragraph-long answers. Although the ELI5 answer-generating Seq2Seq model produced relatively understandable answers, human evaluators preferred gold responses in over 86% of cases, showing the potential for future improvement.

More recently, Krishna et al. described how they improved the original LFQA task with a REALM-initialized, BERT-based retriever and Seq2Seq answer generating model based on Routing Transformer in their paper “Hurdles to Progress in Long-form Question Answering”. In addition to achieving ‘state-of-the-art’ on the LFQA task KILT benchmark, Krishna et al. identified four still-outstanding issues with LFQA:

The ELI5 dataset has significant train/validation/test dataset overlaps
Answer generations are not grounded in retrievals (context passages)
LFQA task metrics need improvement
Human evaluation is challenging but necessary for evaluating LFQA

Each of these issues is non-trivial and equally important to improving the LFQA task as a whole.

LFQA dataset — reducing ELI5 train/validation/test overlaps

With an eye on advancing and improving LFQA, we created a train/validation/test dataset that has reduced overlap compared to the ELI5 dataset. We suspected that reduced overlap might improve the grounding of answers in their respective retrievals.

We compared questions in the train, test, and validation sets using the Sentence-BERT (SBERT), semantic search utility, and the HuggingFace (HF) ELI5 dataset to gauge semantic similarity. More precisely, we compared top-K similarity scores (for K = 1, 2, 3) of the dataset questions and confirmed the overlap results reported by Krishna et al.

Using human judgement, we have found questions above the cosine similarity threshold of 0.8 to be very similar and above 0.84 almost paraphrasing of each other. Below are a few selected example questions from the KILT ELI5 validation set, the most similar “sibling” question in the train set, and their respective cosine similarity.

ELI5 question examples — image by author

Krishna et al. conducted a human study and found that “81% of validation set questions have at least one paraphrase in the training set, while all annotated questions have at least one topically similar question in the training set, which indicates substantial training/validation overlap.”

Using SBert, we searched for the most semantically similar questions in the train set for each question in the validation and test sets. We then plotted a probability of semantic similarity between train/validation and train/test questions in the datasets. As shown in the overlap probability graph below, there is indeed a significant probability of a validation question having a very similar question “sibling” in the train set. However, our estimate is somewhat smaller than the 81% reported by Krishna et al. and closer to 60–65%.

ELI5 train/validation and train/test overlap (50 bins for 0–1 similarity range) — image by author

Although somewhat smaller than train/validation overlap, the probability of significant semantic similarity between test and train set questions is also high.

The new LFQA dataset contains question/answer pairs from three Reddit forums: r/explainlikeimfive, r/AskHistorians, and r/askscience. We set a goal to create a new LFQA dataset with train/validation/test splits such that:

all three subreddits are equally represented in all splits
train/validation/test overlap is minimized while maintaining the size of the dataset compared to the old ELI5 dataset
the validation/test splits contain all the ELI5 test/validation examples from the KILT version of the dataset
validation and test splits include questions with high Reddit scores (votes)

Since the question embeddings are not evenly distributed in space, we resorted to hierarchical clustering. Two questions with a 0.9 dot product in a dense part of the question space will likely need different answers, while in contrast, questions with a 0.8 dot product in a sparse part of the space are probably the same question.

After hierarchical question clustering, we trialled a heuristic of selecting only leaf nodes, but that approach resulted in a dataset with too few data points. As some of the non-leaf nodes are pretty close to each other, we added an adaptive stopping criterion that only goes to a certain depth, joining similarity threshold, or a number of items in the subtrees. As we’ll see below, a dot product similarity threshold of 0.83 resulted in an appropriately sized dataset yet a smaller train/validation/test overlap. As you can see on the similarity probability graphs below, the new LFQA dataset (right) has a significantly reduced train/validation overlap for higher similarity compared to the ELI5 dataset (left). There is only a small portion of questions with a maximum 0.83 similarity — most of the overlap questions have a similarity of 0.75 and below.

ELI5 vs LFQA train/validation overlap (50 bins for 0–1 similarity range) — image by author

Similarly, train/test overlap has been reduced as well.

ELI5 vs LFQA train/test overlap (50 bins for 0–1 similarity range) — image by author

Below are a few selected example questions from the LFQA validation set, their most similar “sibling” question in the train set, and their respective cosine similarity. There are no questions in train/validation and train/test sets with a similarity of more than 0.83.

LFQA question examples — image by author

LFQA DPR-based retriever

To generate answers for the asked questions, the abstractive answer generation model relies on retrievers to find relevant context passages. The LFQA retriever consists of DPR-based questions and context encoders.

We trained our DPR-based retriever using FAIR’s dpr-scale in two stages. In the first stage, we used the PAQ-based pretrained checkpoint and fine-tuned the retriever on the question-answer pairs from the LFQA dataset. As dpr-scale requires a DPR-formatted training set input with positive, negative, and hard negative samples, we created a training file with positive corresponding to an answer, negatives being question unrelated answers, and hard negative samples chosen from answers to questions between 0.55 and 0.65 of cosine similarity. In the second stage, we created a new DPR training set using positives, negatives, and hard negatives from the Wikipedia/Faiss index created in the first stage instead of from LFQA dataset answers. More precisely, for each dataset question, we queried the first stage Wikipedia/Faiss index and subsequently used the SBert cross-encoder to score question/answer (passage) pairs with topk=50. The cross-encoder selected the positive to correspond to the passage with the highest score, the bottom seven answers were selected for hard-negatives, and negative samples were again answers unrelated to a given dataset question. After creating a DPR-formatted training file with Wikipedia sourced positive, negative, and hard negative passages, we trained DPR-based question/passage encoders using dpr-scale. The new LFQA DPR-based retriever slightly underperforms the REALM-based retriever used by Krishna et al. and had a KILT benchmark performance of 11.2 for R-precision and 19.5 for Recall@5.

Demo

To mark the release of the new LFQA dataset, LFQA BART-based Seq2Seq model, and DPR-based retriever models, we are also releasing a HuggingFace Spaces demo of the Wikipedia Assistant at: https://huggingface.co/spaces/lfqa/lfqa

To find question-relevant Wikipedia context passages, we deployed a context server on the AWS cloud using the cortex.dev ML deployment platform. The context server contains an encoded version of the Wikipedia snapshot loaded into the FAISS index. Users can ask long-form, open-ended questions related to science, history, and other diverse topics found on Wikipedia. Given an arbitrary user question, the DPR retriever will retrieve relevant documents from the context server and pass them to the BART LFQA model which will generate an abstractive answer, as well as to the text-to-speech model which will generate an audio response. The demo, along with the model training code and utilities are available at the aforementioned HuggingFace Spaces location.

The LFQA dataset we are releasing today has 226147 train, 3020 validation, and 10000 test samples. The LFQA dataset, with a significantly reduced train/validation/test overlap, could be a stepping stone to resolve the remaining hurdles in Long-Form Question Answering.

The LFQA dataset, DPR-based retrievers, and new LFQA BART-based model for abstractive answer generation are all available on the HuggingFace hub at https://huggingface.co/vblagoje

We want to thank Yacine Jernite for his valuable feedback throughout the project. Branden Chan, Pablo Rodriguez Bertorello, Suchismit Mahapatra and Christina Cressler provided feedback on early drafts of this article.

BibTeX Citation

@article{blagojevic2022lfqa, author={Vladimir Blagojevic}, title={Long-Form QA beyond ELI5: an updated dataset and approach}, year={2022}, url={towardsdatascience.com/long-form-qa-beyond-eli5-an-updated-dataset-and-approach-319cb841aabb},}

Appendix

Example questions from the validation set and their generated answers

What causes the trail behind jets at high altitude?

It’s called a contrail, and it’s caused by the condensation of water vapor in the exhaust of a jet engine. The water vapor condenses on the exhaust plume of the jet, and when it condenses, it makes a trail behind the jet.

Why can I fall asleep in noisy environments (school lectures, public transport, cinemas, etc) but an even lesser amount of noise can disturb my sleep when I’m in bed?

I think it has to do with the way your brain works. When you’re in a quiet environment, your brain doesn’t have to process as much information as it does when you’re awake, so you don’t need as much processing power to fall asleep. This is why you can fall asleep in noisy environments (school lectures, public transport, cinemas etc.) but an even lesser amount of noise can disturb your sleep.

What is happening in Mali and what do Germany and France have to do with it?

Mali is a country in the Sahel region of Africa. It has a large population of people who are Muslim. In the past few years, there have been a number of terrorist attacks in Mali. The government of Mali has been unable to stop these attacks, and has been forced to resort to military action to stop them. Germany and France have decided to send troops to Mali to help the Mali government fight the terrorists. This is the first deployment of EU troops in Africa.

What does the Pope actually do?

The Pope is the head of the Roman Catholic Church. He is also the leader of the Catholic Church as a whole. The Pope is elected by the College of Cardinals, a group of cardinals who are chosen by the bishops of the Church. The College is made up of bishops from all over the world, and the Pope is chosen by them to be the Bishop of Rome.

Stay ahead. Stay updated.