On Evaluation of Natural Language Processing Tasks - Is Gold Standard Evaluation Methodology a Good Solution? (original) (raw)

Workshop on the evaluation of natural language processing systems

1990

In the past few years, the computational linguistics research community has begun to wrestle with the problem of how to evaluate its progress in developing natural language processing systems. With the exception of natural language interfaces, there are few working systems in existence, and they tend to focus on very different tasks using equally different techniques.

Evaluation of natural language processing systems: Issues and approaches

Proceedings of the IEEE, 2000

This paper encompasses t w o main topics: a broad and general analysis of the issue of performance evaluation of NLP systems and a report on a specific approach developed by the authors and experimented on a sample test case. More precisely, it first presents a brief survey of the major works in the area of NIP systems evaluation. Then, after introducing the notion of the life cycle of an N I P system, it focuses on the concept of performance evaluation and analyzes the scope and the major problems of the investigation. The tools generally used within computer science to assess the quality of a software system are briefly reviewed, and their applicability to the task of evaluation of NLP systems is discussed. Particular attention is devoted to the concepts of efficiency, correctness, reliability, and adequacy, and how all of them basically fail in capturing the peculiar features of performance evaluation of an N I P system is discussed. Two main approaches to performance evaluation are later introduced; namely, black-box-and modelbased, and their most important characteristics are presented. Finally, a specific model for performance evaluation proposed by the authors is illustrated, and the results of an experiment with a sample application are reported. The paper concludes with a discussion o n research perspectwes, open problems, and importance ofperformance evaluation to industrial applications.

Principles of Evaluation in Natural Language Processing

In this special issue of TAL, we look at the fundamental principles underlying evaluation in natural language processing. We adopt a global point of view that goes beyond the horizon of a single evaluation campaign or a particular protocol. After a brief review of history and terminology, we will address the topic of a gold standard for natural language processing, of annotation quality, of the amount of data, of the difference between technology evaluation and usage evaluation, of dialog systems, and of standards, before concluding with a short discussion of the articles in this special issue and some prospective remarks. RÉSUMÉ. Dans ce numéro spécial de TAL nous nous intéressons aux principes fondamentaux qui sous-tendent l'évaluation pour le traitement automatique du langage naturel, que nous abordons de manière globale, c'est à dire au delà de l'horizon d'une seule campagne d'évaluation ou d'un protocole particulier. Après un rappel historique et terminologique, nous aborderons le sujet de la référence pour le traitement du langage naturel, de la qualité des annotations, de la quantité des données, des différence entre évaluation de technologie et évaluation d'usage, de l'évaluation des systèmes de dialogue, des standards avant de conclure sur une bref présentation des articles du numéro et quelques remarques prospectives.

DiET-Diagnostic and Evaluation Tools for natural language processing applications

Proceedings of the first …, 1998

The project DiET is developing a comprehensive environment for the construction, annotation and maintenance of structured reference data for the diagnosis and evaluation of NLP applications. The target user group are developers, professional evaluators and consultants of language technology. The system, implemented in a configurable, open client/server architecture, offers the user the possibility to construct and annotate data by freely choosing his annotation types from a given set coming with all the necessary functions for editing, displaying and storing such annotations. The project will also result in a substantial amount of structured test data, representing linguistic phenomena on the levels of morphology, syntax and discourse and annotated with information covering different linguistic and application specific aspects to make the data as transparent as possible and to support optimal access and retrieval. For the application to new domains, the user is also given various means for customisation. Through a process of corpus profiling, links can be established between the structured test items in the data base and related phenomena occurring in domain specific corpora. Lexical replacement functions allow the user to adapt the vocabulary of the test items to his specific domain and terminology. The tools and database finally allow the user to set up evaluation scenarios and to record the results of test cycles.

Evaluating natural language systems

Proceedings of the 12th conference on Computational linguistics -, 1988

This paper reports progress in development of evaluation methodologies for natural language systems. Without a common classification of the problems in natural language understanding authors have no way to specify clearly what their systems do, potential users have no way to compare different systems and researchers have no way to judge the advantages or disadvantages of different approaches to developing systems. introduction.

An evaluation of natural language processing methodologies

Proceedings / AMIA ... Annual Symposium. AMIA Symposium, 1998

Medical language processing (MLP) systems that codify information in textual patient reports have been developed to help solve the data entry problem. Some systems have been evaluated in order to assess performance, but there has been little evaluation of the underlying technology. Various methodologies are used by the different MLP systems but a comparison of the methods has not been performed although evaluations of MLP methodologies would be extremely beneficial to the field. This paper describes a study that evaluates different techniques. To accomplish this task an existing MLP system MedLEE was modified and results from a previous study were used. Based on confidence intervals and differences in sensitivity and specificity between each technique and all the others combined, the results showed that the two methods based on obtaining the largest well-formed segment within a sentence had significantly higher sensitivity than the others by 5% and 6%. The method based on recognizin...