ECG-QALM: Entity-Controlled Synthetic Text Generation using Contextual Q&A for NER (original) (raw)
Related papers
ECG-QALM: Entity-Controlled Synthetic Text Generation using Contextual Q&A for NER
Findings of the Association for Computational Linguistics: ACL 2023
Named Entity Recognition (NER) state-ofthe-art methods requires high-quality labeled datasets. Issues such as scarcity of labeled data, under-representation of entities, and privacy concerns with using sensitive data for training, can be significant barriers. Generating synthetic data to train models is a promising solution to mitigate these problems. We propose ECG-QALM, a contextual question and answering approach using pre-trained language models to synthetically generate entitycontrolled text. Generated text is then used to augment small labeled datasets for downstream NER tasks. We evaluate our method on two publicly available datasets. We find ECG-QALM is capable of producing full text samples with desired entities appearing in a controllable way, while retaining sentence coherence closest to the real world data. Evaluations on NER tasks show significant improvements (75%-140%) in low-labeled data regimes.
Context-NER : Contextual Phrase Generation at Scale
arXiv (Cornell University), 2021
NLP research has been focused on NER extraction and how to efficiently extract them from a sentence. However, generating relevant context of entities from a sentence has remained under-explored. In this work we introduce the task CONTEXT-NER in which relevant context of an entity has to be generated. The extracted context may not be found exactly as a substring in the sentence. We also introduce the EDGAR10-Q dataset for the same, which is a corpus of 1,500 publicly traded companies. It is a manually created complex corpus and one of the largest in terms of number of sentences and entities (1 M and 2.8 M). We introduce a baseline approach that leverages phrase generation algorithms and uses the pre-trained BERT model to get 33% ROUGE-L score. We also do a one shot evaluation with GPT-3 and get 39% score, signifying the hardness and future scope of this task. We hope that addition of this dataset and our study will pave the way for further research in this domain. 1 .
Asking Questions the Human Way: Scalable Question-Answer Generation from Text Corpus
Proceedings of The Web Conference 2020, 2020
The ability to ask questions is important in both human and machine intelligence. Learning to ask questions helps knowledge acquisition, improves question-answering and machine reading comprehension tasks, and helps a chatbot to keep the conversation flowing with a human. Existing question generation models are ineffective at generating a large amount of high-quality question-answer pairs from unstructured text, since given an answer and an input passage, question generation is inherently a one-to-many mapping. In this paper, we propose Answer-Clue-Style-aware Question Generation (ACS-QG), which aims at automatically generating high-quality and diverse question-answer pairs from unlabeled text corpus at scale by imitating the way a human asks questions. Our system consists of: i) an information extractor, which samples from the text multiple types of assistive information to guide question generation; ii) neural question generators, which generate diverse and controllable questions, leveraging the extracted assistive information; and iii) a neural quality controller, which removes low-quality generated data based on text entailment. We compare our question generation models with existing approaches and resort to voluntary human evaluation to assess the quality of the generated question-answer pairs. The evaluation results suggest that our system dramatically outperforms state-of-the-art neural question generation models in terms of the generation quality, while being scalable in the meantime. With models trained on a relatively smaller amount of data, we can generate 2.8 million quality-assured question-answer pairs from a million sentences found in Wikipedia. CCS CONCEPTS • Computing methodologies → Natural language processing; Natural language generation; Machine translation.
Named entity recognition for question answering
2006
Current text-based question answering (QA) systems usually contain a named entity recogniser (NER) as a core component. Named entity recognition has traditionally been developed as a component for information extraction systems, and current techniques are focused on this end use. However, no formal assessment has been done on the characteristics of a NER within the task of question answering. In this paper we present a NER that aims at higher recall by allowing multiple entity labels to strings. The NER is embedded in a question answering system and the overall QA system performance is compared to that of one with a traditional variation of the NER that only allows single entity labels. It is shown that the added noise produced introduced by the additional labels is offset by the higher recall gained, therefore enabling the QA system to have a better chance to find the answer.
Controllable Question Generation via Sequence-to-Sequence Neural Model with Auxiliary Information
2020 International Joint Conference on Neural Networks (IJCNN)
Automatic question generation (QG) has found applications in the education sector and to enhance human-machine interactions in chatbots. Existing neural QG models can be categorized into answer-unaware and answer-aware models. One of the main challenges faced by existing neural QG models is the degradation in performance due to the issue of one-tomany mapping, where, given a passage, both answer (query interest/question intent) and auxiliary information (context information present in the question) can result in different questions being generated. We propose a controllable question generation model (CQG) that employs an attentive sequence-to-sequence (seq2seq) based generative model with copying mechanism. The proposed CQG also incorporates query interest and auxiliary information as controllers to address the one-to-many mapping problem in QG. Two variants of embedding strategies are designed for CQG to achieve good performance. To verify its performance, an automatic labeling scheme for harvesting auxiliary information is first developed. A QG dataset is also annotated with auxiliary information from a reading comprehension dataset. Performance evaluation shows that the proposed model not only outperforms existing QG models, it also has the potential to generate multiple questions that are relevant given a single passage.
Evaluating the state-of-the-art of End-to-End Natural Language Generation: The E2E NLG challenge
Computer Speech & Language, 2019
This paper provides a comprehensive analysis of the first shared task on End-to-End Natural Language Generation (NLG) and identifies avenues for future research based on the results. This shared task aimed to assess whether recent end-to-end NLG systems can generate more complex output by learning from datasets containing higher lexical richness, syntactic complexity and diverse discourse phenomena. Introducing novel automatic and human metrics, we compare 62 systems submitted by 17 institutions, covering a wide range of approaches, including machine learning architectures-with the majority implementing sequence-to-sequence models (seq2seq)-as well as systems based on grammatical rules and templates. Seq2seq-based systems have demonstrated a great potential for NLG in the challenge. We find that seq2seq systems generally score high in terms of word-overlap metrics and human evaluations of naturalness-with the winning Slug system (Juraska et al., 2018) being seq2seq-based. However, vanilla seq2seq models often fail to correctly express a given meaning representation if they lack a strong semantic control mechanism applied during decoding. Moreover, seq2seq models can be outperformed by hand-engineered systems in terms of overall quality, as well as complexity, length and diversity of outputs. This research has influenced, inspired and motivated a number of recent studies outwith the original competition, which we also summarise as part of this paper.
2021
We introduce a synthetic dialogue generation framework, Velocidapter, which addresses the corpus availability problem for dialogue comprehension. Velocidapter augments datasets by simulating synthetic conversations for a task-oriented dialogue domain, requiring a small amount of bootstrapping work for each new domain. We evaluate the efficacy of our framework on a task-oriented dialogue comprehension dataset, MRCWOZ, which we curate by annotating questions for slots in the restaurant, taxi, and hotel domains of the MultiWOZ 2.2 dataset (Zang et al., 2020). We run experiments within a low-resource setting, where we pretrain a model on SQuAD, fine-tuning it on either a small original data or on the synthetic data generated by our framework. Velocidapter shows significant improvements using both the transformer-based BERTBase and BiDAF as base models. We further show that the framework is easy to use by novice users and conclude that Velocidapter can greatly help training over task-ori...
Transformer-based End-to-End Question Generation
ArXiv, 2020
Question Generation (QG) is an important task in Natural Language Processing (NLP) that involves generating questions automatically when given a context paragraph. While many techniques exist for the task of QG, they employ complex model architectures, extensive features, and additional mechanisms to boost model performance. In this work, we show that transformer-based finetuning techniques can be used to create robust question generation systems using only a single pretrained language model, without the use of additional mechanisms, answer metadata, and extensive features. Our best model outperforms previous more complex RNN-based Seq2Seq models, with an 8.62 and a 14.27 increase in METEOR and ROUGE_L scores, respectively. We show that it also performs on par with Seq2Seq models that employ answer-awareness and other special mechanisms, despite being only a single-model system. We analyze how various factors affect the model's performance, such as input data formatting, the len...
A Named Entity Recogniser for Question Answering
2000
Named Entity Recognisers (NERs) are typically used by question answering (QA) systems as means to preselect an- swer candidates. However, there has not been much work on the formal assessment of the use of NERs for QA nor on their op- timal parameters. In this paper we investi- gate the main characteristics of a NER for QA. The results show
A Named entity recogniser for question answering| Macquarie University ResearchOnline
2007
Description Named Entity Recognisers (NERs) are typically used by question answering (QA) systems as means to preselect answer candidates. However, there has not been much work on the formal assessment of the use of NERs for QA nor on their optimal parameters. In this paper we investigate the main characteristics of a NER for QA. The results show that it is important to maintain high recall to retain all possible answers on the one hand, while high precision is essential during the final answer selection phase. We present an NER ...