IMPORTANCE OF THE SINGLE-SPAN TASK FORMULATION TO EXTRACTIVE QUESTION-ANSWERING (original) (raw)

Ensemble ALBERT and RoBERTa for Span Prediction in Question Answering

2021

Retrieving relevant answers from heterogeneous data formats, for given for questions, is a challenging problem. The process of pinpointing relevant information suitable to answer a question is further compounded in large document collections containing documents of substantial length. This paper presents the models designed as part of our submission to the DialDoc21 Shared Task (Document-grounded Dialogue and Conversational Question Answering) for span prediction in question answering. The proposed models leverage the superior predictive power of pretrained transformer models like RoBERTa, ALBERT and ELECTRA, to identify the most relevant information in an associated passage for the next agent turn. To further enhance the performance, the models were fine-tuned on different span selection based question answering datasets like SQuAD2.0 and Natural Questions (NQ) corpus. We also explored ensemble techniques for combining multiple models to achieve enhanced performance for the task. O...

Question Answering Using Hierarchical Attention on Top of BERT Features

Proceedings of the 2nd Workshop on Machine Reading for Question Answering

Machine Comprehension (MC) tests the ability of the machine to answer a question about a given passage. It requires modeling complex interactions between the passage and the question. Recently, attention mechanisms have been successfully extended to machine comprehension. In this work, the question and passage are encoded using BERT language embeddings to better capture the respective representations at a semantic level. Then, attention and fusion are conducted horizontally and vertically across layers at different levels of granularity between question and paragraph. Our experiments were performed on the datasets provided in MRQA shared task 2019 1

Span Selection Pre-training for Question Answering

arXiv (Cornell University), 2019

BERT (Bidirectional Encoder Representations from Transformers) and related pre-trained Transformers have provided large gains across many language understanding tasks, achieving a new state-of-the-art (SOTA). BERT is pretrained on two auxiliary tasks: Masked Language Model and Next Sentence Prediction. In this paper we introduce a new pre-training task inspired by reading comprehension to better align the pre-training from memorization to understanding. Span Selection Pre-Training (SSPT) poses cloze-like training instances, but rather than draw the answer from the model's parameters, it is selected from a relevant passage. We find significant and consistent improvements over both BERT BASE and BERT LARGE on multiple Machine Reading Comprehension (MRC) datasets. Specifically, our proposed model has strong empirical evidence as it obtains SOTA results on Natural Questions, a new benchmark MRC dataset, outperforming BERT LARGE by 3 F1 points on short answer prediction. We also show significant impact in HotpotQA, improving answer prediction F1 by 4 points and supporting fact prediction F1 by 1 point and outperforming the previous best system. Moreover, we show that our pre-training approach is particularly effective when training data is limited, improving the learning curve by a large amount.

NLQuAD: A Non-Factoid Long Question Answering Data Set

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

We introduce NLQuAD, the first data set with baseline methods for non-factoid long question answering, a task requiring documentlevel language understanding. In contrast to existing span detection question answering data sets, NLQuAD has non-factoid questions that are not answerable by a short span of text and demanding multiple-sentence descriptive answers and opinions. We show the limitation of the F1 score for evaluation of long answers and introduce Intersection over Union (IoU), which measures position-sensitive overlap between the predicted and the target answer spans. To establish baseline performances, we compare BERT, RoBERTa, and Longformer models. Experimental results and human evaluations show that Longformer outperforms the other architectures, but results are still far behind a human upper bound, leaving substantial room for improvements. NLQuAD's samples exceed the input limitation of most pretrained Transformer-based models, encouraging future research on long sequence language models. 1

Hurdles to Progress in Long-form Question Answering

ArXiv, 2021

The task of long-form question answering (LFQA) involves retrieving documents relevant to a given question and using them to generate a paragraph-length answer. While many models have recently been proposed for LFQA, we show in this paper that the task formulation raises fundamental challenges regarding evaluation and dataset creation that currently preclude meaningful modeling progress. To demonstrate these challenges, we first design a new system that relies on sparse attention and contrastive retriever learning to achieve state-of-the-art performance on the ELI5 LFQA dataset. While our system tops the public leaderboard, a detailed analysis reveals several troubling trends: (1) our system’s generated answers are not actually grounded in the documents that it retrieves; (2) ELI5 contains significant train / validation overlap, as at least 81% of ELI5 validation questions occur in paraphrased form in the training set; (3) ROUGE-L is not an informative metric of generated answer qua...

Rethinking the objectives of extractive question answering

2020

This paper describes two generally applicable approaches towards the significant improvement of the performance of state-of-the-art extractive question answering (EQA) systems. Firstly, contrary to a common belief, it demonstrates that using the objective with independence assumption for span probability P(as,ae)=P(as)P(ae)P(a_s,a_e) = P(a_s)P(a_e)P(as,ae)=P(as)P(ae) of span starting at position asa_sas and ending at position aea_eae may have adverse effects. Therefore we propose a new compound objective that models joint probability P(as,ae)P(a_s,a_e)P(as,ae) directly, while still keeping the objective with independency assumption as an auxiliary objective. Our second approach shows the beneficial effect of distantly semi-supervised shared-normalization objective known from (Clark and Gardner, 2017). We show that normalizing over a set of documents similar to the golden passage, and marginalizing over all ground-truth answer string positions leads to the improvement of results from smaller statistical models. Our results are supported...

FAT ALBERT: Finding Answers in Large Texts using Semantic Similarity Attention Layer based on BERT

2020

Machine based text comprehension has always been a significant research field in natural language processing. Once a full understanding of the text context and semantics is achieved, a deep learning model can be trained to solve a large subset of tasks, e.g. text summarization, classification and question answering. In this paper we focus on the question answering problem, specifically the multiple choice type of questions. We develop a model based on BERT, a state-of-the-art transformer network. Moreover, we alleviate the ability of BERT to support large text corpus by extracting the highest influence sentences through a semantic similarity model. Evaluations of our proposed model demonstrate that it outperforms the leading models in the MovieQA challenge and we are currently ranked first in the leader board with test accuracy of 87.79%. Finally, we discuss the model shortcomings and suggest possible improvements to overcome these limitations.

QuesBELM: A BERT based Ensemble Language Model for Natural Questions

2020 5th International Conference on Computing, Communication and Security (ICCCS), 2020

A core goal in artificial intelligence is to build systems that can read the web, and then answer complex questions related to random searches about any topic. These question-answering (QA) systems could have a big impact on the way that we access information. In this paper, we addressed the task of question-answering (QA) systems on Google’s Natural Questions (NQ) dataset containing real user questions issued to Google search and the answers found from Wikipedia by annotators. In our work, we systematically compare the performance of powerful variant models of Transformer architectures-`BERTbase, BERT-large-WWM and ALBERT-XXL’ over Natural Questions dataset. We also propose a state-of-the-art BERT based ensemble language model-QuesBELM. QuesBELM leverages the power of existing BERT variants combined together to build a more accurate stacking ensemble model for question answering (QA) system. The model integrates top-K predictions from single language models to determine the best an...

CNN for Text-Based Multiple Choice Question Answering

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2018

The task of Question Answering is at the very core of machine comprehension. In this paper, we propose a Convolutional Neural Network (CNN) model for textbased multiple choice question answering where questions are based on a particular article. Given an article and a multiple choice question, our model assigns a score to each question-option tuple and chooses the final option accordingly. We test our model on Textbook Question Answering (TQA) and SciQ dataset. Our model outperforms several LSTM-based baseline models on the two datasets.

Training Question Answering Models From Synthetic Data

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

Question and answer generation is a data augmentation method that aims to improve question answering (QA) models given the limited amount of human labeled data. However, a considerable gap remains between synthetic and human-generated question-answer pairs. This work aims to narrow this gap by taking advantage of large language models and explores several factors such as model size, quality of pretrained models, scale of data synthesized, and algorithmic choices. On the SQUAD1.1 question answering task, we achieve higher accuracy using solely synthetic questions and answers than when using the SQUAD1.1 training set questions alone. Removing access to real Wikipedia data, we synthesize questions and answers from a synthetic corpus generated by an 8.3 billion parameter GPT-2 model. With no access to human supervision and only access to other models, we are able to train state of the art question answering networks on entirely model-generated data that achieve 88.4 Exact Match (EM) and 93.9 F1 score on the SQUAD1.1 dev set. We further apply our methodology to SQUAD2.0 and show a 2.8 absolute gain on EM score compared to prior work using synthetic data. Consistent with prior work (Alberti et al., 2019a; Dong et al., 2019), we use a 3-step modeling pipeline consisting of unconditional answer extraction from text, question generation, and question filtration. Our approach for training question generators on labeled data uses pretrained GPT-2 decoder models and a next-token-prediction language modeling objective, trained using a concatenation of context, answer, and question tokens. As demonstrated in sections 5.1 and 6.1, pretraining large generative transformer models up to 8.3B parameters improves the quality of generated questions. Additionally, we propose an overgenerate and filter approach to further improve question filtration. The quality of questions produced by this pipeline can be assessed quantitatively by finetuning QA models and evaluating results on the SQUAD dataset. We demonstrate generated questions to be comparable to supervised training with real data. For answerable SQUAD1.1