Selecting Better Samples from Pre-trained LLMs: A Case Study on Question Generation (original) (raw)

Vocabulary Matters: A Simple yet Effective Approach to Paragraph-level Question Generation

2020

Question generation (QG) has recently attracted considerable attention. Most of the current neural models take as input only one or two sentences, and perform poorly when multiple sentences or complete paragraphs are given as input. However, in real-world scenarios it is very important to be able to generate high-quality questions from complete paragraphs. In this paper, we present a simple yet effective technique for answer-aware question generation from paragraphs. We augment a basic sequence-to-sequence QG model with dynamic, paragraph-specific dictionary and copy attention that is persistent across the corpus, without requiring features generated by sophisticated NLP pipelines or handcrafted rules. Our evaluation on SQuAD shows that our model significantly outperforms current state-of-the-art systems in question generation from paragraphs in both automatic and human evaluation. We achieve a 6-point improvement over the best system on BLEU-4, from 16.38 to 22.62.

Mind the Gap: Learning to Choose Gaps for Question Generation

The 2012 Conference of the North …, 2012

Not all learning takes place in an educational setting: more and more self-motivated learners are turning to on-line text to learn about new topics. Our goal is to provide such learners with the well-known benefits of testing by automatically generating quiz questions for online text. Prior work on question generation has focused on the grammaticality of generated questions and generating effective multiple-choice distractors for individual question targets, both key parts of this problem. Our work focuses on the complementary aspect of determining what part of a sentence we should be asking about in the first place; we call this "gap selection." We address this problem by asking human judges about the quality of questions generated from a Wikipedia-based corpus, and then training a model to effectively replicate these judgments. Our data shows that good gaps are of variable length and span all semantic roles, i.e., nouns as well as verbs, and that a majority of good questions do not focus on named entities. Our resulting system can generate fill-in-the-blank (cloze) questions from generic source materials.

Scalable Educational Question Generation with Pre-trained Language Models

arXiv (Cornell University), 2023

The automatic generation of educational questions will play a key role in scaling online education, enabling self-assessment at scale when a global population is manoeuvring their personalised learning journeys. We develop EduQG, a novel educational question generation model built by adapting a large language model. Our extensive experiments demonstrate that EduQG can produce superior educational questions by further pre-training and fine-tuning a pre-trained language model on the scientific text and science question data.

Question-Worthy Sentence Selection for Question Generation

Advances in Artificial Intelligence, 2020

The problem of automatic question generation from text is of increasing importance due to many useful applications. While deep neural networks achieved success in generating questions from text paragraphs, they mainly focused on a whole paragraph in generating questions, assuming all sentences are question-worthy sentences. However, a text paragraph often contains only a few important sentences that are worthy of asking questions. To that end, we present a feature-based sentence selection method for identifying question-worthy sentences. Such sentences are then used by a sequence-to-sequence (i.e., seq2seq) model to generate questions. Our experiments show that these features significantly improves the question generated by seq2seq models.

Simplifying Paragraph-level Question Generation via Transformer Language Models

2021

Question generation (QG) is a natural language generation task where a model is trained to ask questions corresponding to some input text. Most recent approaches frame QG as a sequence-to-sequence problem and rely on additional features and mechanisms to increase performance; however, these often increase model complexity, and can rely on auxiliary data unavailable in practical use. A single Transformer-based unidirectional language model leveraging transfer learning can be used to produce high quality questions while disposing of additional task-specific complexity. Our QG model, finetuned from GPT-2 Small, outperforms several paragraph-level QG baselines on the SQuAD dataset by 0.95 METEOR points. Human evaluators rated questions as easy to answer, relevant to their context paragraph, and corresponding well to natural human speech. Also introduced is a new set of baseline scores on the RACE dataset, which has not previously been used for QG tasks. Further experimentation with vary...

Towards Automatic Generation of Questions from Long Answers

2020

Automatic question generation (AQG) has broad applicability in domains such as tutoring systems, conversational agents, healthcare literacy, and information retrieval. Existing efforts at AQG have been limited to short answer lengths of up to two or three sentences. However, several real-world applications require question generation from answers that span several sentences. Therefore, we propose a novel evaluation benchmark to assess the performance of existing AQG systems for long-text answers. We leverage the large-scale open-source Google Natural Questions dataset to create the aforementioned long-answer AQG benchmark. We empirically demonstrate that the performance of existing AQG methods significantly degrades as the length of the answer increases. Transformer-based methods outperform other existing AQG methods on long answers in terms of automatic as well as human evaluation. However, we still observe degradation in the performance of our best performing models with increasin...

On the Evaluation of Answer-Agnostic Paragraph-level Multi-Question Generation

ArXiv, 2022

We study the task of predicting a set of salient questions from a given paragraph without any prior knowledge of the precise answer. We make two main contributions. First, we propose a new method to evaluate a set of predicted questions against the set of references by using the Hungarian algorithm to assign predicted questions to references before scoring the assigned pairs. We show that our proposed evaluation strategy has better theoretical and practical properties compared to prior methods because it can properly account for the coverage of references. Second, we compare different strategies to utilize a pre-trained seq2seq model to generate and select a set of questions related to a given paragraph. The code is available1.

Transformer-based End-to-End Question Generation

ArXiv, 2020

Question Generation (QG) is an important task in Natural Language Processing (NLP) that involves generating questions automatically when given a context paragraph. While many techniques exist for the task of QG, they employ complex model architectures, extensive features, and additional mechanisms to boost model performance. In this work, we show that transformer-based finetuning techniques can be used to create robust question generation systems using only a single pretrained language model, without the use of additional mechanisms, answer metadata, and extensive features. Our best model outperforms previous more complex RNN-based Seq2Seq models, with an 8.62 and a 14.27 increase in METEOR and ROUGE_L scores, respectively. We show that it also performs on par with Seq2Seq models that employ answer-awareness and other special mechanisms, despite being only a single-model system. We analyze how various factors affect the model's performance, such as input data formatting, the len...

Evaluation of Question Generation Needs More References

arXiv (Cornell University), 2023

Question generation (QG) is the task of generating a valid and fluent question based on a given context and the target answer. According to various purposes, even given the same context, instructors can ask questions about different concepts, and even the same concept can be written in different ways. However, the evaluation for QG usually depends on single reference-based similarity metrics, such as ngram-based metric or learned metric, which is not sufficient to fully evaluate the potential of QG methods. To this end, we propose to paraphrase the reference question for a more robust QG evaluation. Using large language models such as GPT-3, we created semantically and syntactically diverse questions, then adopt the simple aggregation of the popular evaluation metrics as the final scores. Through our experiments, we found that using multiple (pseudo) references is more effective for QG evaluation while showing a higher correlation with human evaluations than evaluation with a single reference.

A minimally supervised approach for question generation: what can we learn from a single seed?

In this paper, we investigate how many quality natural language questions can be generated from a single question/answer pair (a seed). In our approach we learn patterns that relate the various levels of linguistic information in the question/answer seed with the same levels of information in text. These patterns contain lexical, syntactic and semantic information and when matched against a target document, new question/answer pairs can be generated. Here, we focus specifically on the task of generating questions. Several works, for instance in Question Answering, explore the rewriting of questions to create (usually lexical) patterns; instead, we use several levels of linguistic information-lexical, syntactic and semantic (through the use of named entities). Also, the patterns are commonly hand-crafted, as opposed to our strategy where patterns are automatically learned, based on a single seed. Preliminary results show that with the single question/answer seed pair-"When was Leonardo da Vinci Born?"/1452-we manage to generate several questions (from documents related with 25 personalities), from which 80% were evaluated as plausible.