Textual properties and task based evaluation: investigating the role of surface properties, structure and content (original) (raw)

Why We Need New Evaluation Metrics for NLG

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017

The majority of NLG evaluation relies on automatic metrics, such as BLEU. In this paper, we motivate the need for novel, system-and data-independent automatic evaluation methods: We investigate a wide range of metrics, including state-of-the-art word-based and novel grammar-based ones, and demonstrate that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG. We also show that metric performance is data-and system-specific. Nevertheless, our results also suggest that automatic metrics perform reliably at system-level and can support system development by finding cases where a system performs poorly. 4 https://github.com/glampouras/JLOLS\_ NLG 5 Note that we use lexicalised versions of SFHOTEL and SFREST and a partially lexicalised version of BAGEL, where proper names and place names are replaced by placeholders ("X"), in correspondence with the outputs generated by the MR: inform(name=X, area=X, pricerange=moderate, type=restaurant) Reference: "X is a moderately priced restaurant in X."

Introducing Shared Task Evaluation to NLG

Abstract. Shared Task Evaluation Challenges (stecs) have only recently begun in the field of nlg. The tuna stecs, which focused on Referring Expression Generation (reg), have been part of this development since its inception. This chapter looks back on the experience of organising the three tuna Challenges, which came to an end in 2009.

Perturbation CheckLists for Evaluating NLG Evaluation Metrics

ArXiv, 2021

Natural Language Generation (NLG) evaluation is a multifaceted task requiring assessment of multiple desirable criteria, e.g., fluency, coherency, coverage, relevance, adequacy, overall quality, etc. Across existing datasets for 6 NLG tasks, we observe that the human evaluation scores on these multiple criteria are often not correlated. For example, there is a very low correlation between human scores on fluency and data coverage for the task of structured data to text generation. This suggests that the current recipe of proposing new automatic evaluation metrics for NLG by showing that they correlate well with scores assigned by humans for a single criteria (overall quality) alone is inadequate. Indeed, our extensive study involving 25 automatic evaluation metrics across 6 different tasks and 18 different evaluation criteria shows that there is no single metric which correlates well with human scores on all desirable criteria, for most NLG tasks. Given this situation, we propose Ch...

Validating the web-based evaluation of NLG systems

Proceedings of the ACL-IJCNLP 2009 Conference Short Papers on - ACL-IJCNLP '09, 2009

The GIVE Challenge is a recent shared task in which NLG systems are evaluated over the Internet. In this paper, we validate this novel NLG evaluation methodology by comparing the Internet-based results with results we collected in a lab experiment. We find that the results delivered by both methods are consistent, but the Internetbased approach offers the statistical power necessary for more fine-grained evaluations and is cheaper to carry out.

Refocusing on Relevance: Personalization in NLG

ArXiv, 2021

Many NLG tasks such as summarization, dialogue response, or open domain question answering focus primarily on a source text in order to generate a target response. This standard approach falls short, however, when a user’s intent or context of work is not easily recoverable based solely on that source text– a scenario that we argue is more of the rule than the exception. In this work, we argue that NLG systems in general should place a much higher level of emphasis on making use of additional context, and suggest that relevance (as used in Information Retrieval) be thought of as a crucial tool for designing user-oriented text-generating tasks. We further discuss possible harms and hazards around such personalization, and argue that value-sensitive design represents a crucial path forward through these challenges.

Exploratory analysis on the natural language processing models for task specific purposes

Bulletin of Electrical Engineering and Informatics

Natural language processing (NLP) is a technology that has become widespread in the area of human language understanding and analysis. A range of text processing tasks such as summarisation, semantic analysis, classification, question-answering, and natural language inference are commonly performed using it. The dilemma of picking a model to help us in our task is still there. It's becoming an impediment. This is where we are trying to determine which modern NLP models are better suited for the tasks set out above in order to compare them with datasets like SQuAD and GLUE. For comparison, BERT, RoBERTa, distilBERT, BART, ALBERT, and text-to-text transfer transformer (T5) models have been used in this study. The aim is to understand the underlying architecture, its effects on the use case and also to understand where it falls short. Thus, we were able to observe that RoBERTa was more effective against the models ALBERT, distilBERT, and BERT in terms of tasks related to semantic analysis, natural language inference, and question-answering. The reason is due to the dynamic masking present in RoBERTa. For summarisation, even though BART and T5 models have very similar architecture the BART model has performed slightly better than the T5 model.

Integrated NLP evaluation system for pluggable evaluation metrics with extensive interoperable toolkit

Proceedings of the Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing - SETQA-NLP '09, 2009

To understand the key characteristics of NLP tools, evaluation and comparison against different tools is important. And as NLP applications tend to consist of multiple semiindependent sub-components, it is not always enough to just evaluate complete systems, a fine grained evaluation of underlying components is also often worthwhile. Standardization of NLP components and resources is not only significant for reusability, but also in that it allows the comparison of individual components in terms of reliability and robustness in a wider range of target domains. But as many evaluation metrics exist in even a single domain, any system seeking to aid inter-domain evaluation needs not just predefined metrics, but must also support pluggable user-defined metrics. Such a system would of course need to be based on an open standard to allow a large number of components to be compared, and would ideally include visualization of the differences between components. We have developed a pluggable evaluation system based on the UIMA framework, which provides visualization useful in error analysis. It is a single integrated system which includes a large ready-to-use, fully interoperable library of NLP tools.

On Evaluation of Natural Language Processing Tasks - Is Gold Standard Evaluation Methodology a Good Solution?

Proceedings of the 8th International Conference on Agents and Artificial Intelligence, 2016

The paper discusses problems in state of the art evaluation methods used in natural language processing (NLP). Usually, some form of gold standard data is used for evaluation of various NLP tasks, ranging from morphological annotation to semantic analysis. We discuss problems and validity of this type of evaluation, for various tasks, and illustrate the problems on examples. Then we propose using application-driven evaluations, wherever it is possible. Although it is more expensive, more complicated and not so precise, it is the only way to find out if a particular tool is useful at all.