Integrated NLP Evaluation System for Pluggable Evaluation Metrics with Extensive Interoperable Toolkit (original) (raw)
Related papers
Why We Need New Evaluation Metrics for NLG
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017
The majority of NLG evaluation relies on automatic metrics, such as BLEU. In this paper, we motivate the need for novel, system-and data-independent automatic evaluation methods: We investigate a wide range of metrics, including state-of-the-art word-based and novel grammar-based ones, and demonstrate that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG. We also show that metric performance is data-and system-specific. Nevertheless, our results also suggest that automatic metrics perform reliably at system-level and can support system development by finding cases where a system performs poorly. 4 https://github.com/glampouras/JLOLS\_ NLG 5 Note that we use lexicalised versions of SFHOTEL and SFREST and a partially lexicalised version of BAGEL, where proper names and place names are replaced by placeholders ("X"), in correspondence with the outputs generated by the MR: inform(name=X, area=X, pricerange=moderate, type=restaurant) Reference: "X is a moderately priced restaurant in X."
NLPStatTest: A Toolkit for Comparing NLP System Performance
2020
Statistical significance testing centered on p-values is commonly used to compare NLP system performance, but p-values alone are insufficient because statistical significance differs from practical significance. The latter can be measured by estimating effect size. In this pa-per, we propose a three-stage procedure for comparing NLP system performance and provide a toolkit, NLPStatTest, that automates the process. Users can upload NLP system evaluation scores and the toolkit will analyze these scores, run appropriate significance tests, estimate effect size, and conduct power analysis to estimate Type II error. The toolkit provides a convenient and systematic way to compare NLP system performance that goes beyond statistical significance testing.
Validating the web-based evaluation of NLG systems
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers on - ACL-IJCNLP '09, 2009
The GIVE Challenge is a recent shared task in which NLG systems are evaluated over the Internet. In this paper, we validate this novel NLG evaluation methodology by comparing the Internet-based results with results we collected in a lab experiment. We find that the results delivered by both methods are consistent, but the Internetbased approach offers the statistical power necessary for more fine-grained evaluations and is cheaper to carry out.
2015
Dashboard is a tool for integration, validation, and visualization of Natural Language Processing (NLP) systems. It provides infra-structural facilities using which individual NLP modules may be evaluated and refined, and multiple NLP modules may be combined to build a large end-user NLP system. It helps system integration team to integrate and vali-date NLP systems. The tool provides a visuali-zation interface that helps developers to profile (time and memory) for each module. It helps researchers to evaluate and compare their module with the earlier versions of same mod-ule. The tool promotes reuse of existing mod-ules to build new NLP systems. Dashboard supports execution of modules that are distri-buted on heterogeneous platforms. It provides a powerful notation to specify runtime proper-ties of NLP modules. It provides an easy-to-use graphical interface that is developed using Eclipse RCP. Users can choose an I/O pers-pective (view) that allows him better visualiza-tion of inte...
Proceedings of the 8th International Conference on Agents and Artificial Intelligence, 2016
The paper discusses problems in state of the art evaluation methods used in natural language processing (NLP). Usually, some form of gold standard data is used for evaluation of various NLP tasks, ranging from morphological annotation to semantic analysis. We discuss problems and validity of this type of evaluation, for various tasks, and illustrate the problems on examples. Then we propose using application-driven evaluations, wherever it is possible. Although it is more expensive, more complicated and not so precise, it is the only way to find out if a particular tool is useful at all.
D2. 5.1 Quantitative Evaluation Tools and Corpora V1
sekt-project.com
This deliverable covers the description and production of a semantically annotated corpus. This is available within the Sekt consortium as training and test data for the machine learning algorithms for semantic annotation and as a gold standard for the evaluation of techniques.
What is SemEval evaluating? A Systematic Analysis of Evaluation Campaigns in NLP
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, 2021
SemEval is the primary venue in the NLP community for the proposal of new challenges and for the systematic empirical evaluation of NLP systems. This paper provides a systematic quantitative analysis of SemEval aiming to evidence the patterns of the contributions behind SemEval. By understanding the distribution of task types, metrics, architectures, participation and citations over time we aim to answer the question on what is being evaluated by SemEval.
Perturbation CheckLists for Evaluating NLG Evaluation Metrics
ArXiv, 2021
Natural Language Generation (NLG) evaluation is a multifaceted task requiring assessment of multiple desirable criteria, e.g., fluency, coherency, coverage, relevance, adequacy, overall quality, etc. Across existing datasets for 6 NLG tasks, we observe that the human evaluation scores on these multiple criteria are often not correlated. For example, there is a very low correlation between human scores on fluency and data coverage for the task of structured data to text generation. This suggests that the current recipe of proposing new automatic evaluation metrics for NLG by showing that they correlate well with scores assigned by humans for a single criteria (overall quality) alone is inadequate. Indeed, our extensive study involving 25 automatic evaluation metrics across 6 different tasks and 18 different evaluation criteria shows that there is no single metric which correlates well with human scores on all desirable criteria, for most NLG tasks. Given this situation, we propose Ch...