Textual properties and task based evaluation: investigating the role of surface properties, structure and content (original) (raw)

Abstract

Abstract This paper investigates the relationship between the results of an extrinsic, task-based evaluation of an NLG system and various metrics measuring both surface and deep semantic textual properties, including relevance. The latter rely heavily on domain knowledge. We show that they correlate systematically with some measures of performance. The core argument of this paper is that more domain knowledge-based metrics shed more light on the relationship between deep semantic properties of a text and task performance.

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

References (27)

F. Allen. 1983. Maintaining knowledge about temporal intervals. Communications of the ACM, 26(11):832-843.
Charles B. Callaway and James C. Lester. 2002. Narrative prose generation. Artificial Intelligence, 139(2):213-252.
C. B. Callaway. 2003. Evaluating coverage for large symbolic NLG grammars. In Proc. IJCAI'03.
C. Calliston-Burch, M. Osborne, and P. Koehn. 2006. Re-evaluating the role of BLEU in machine transla- tion research. In Proc. EACL'06.
B. J. Dorr, C. Monz, S. President, R. Schwartz, and D. Zajic. 2005. A methodology for extrinsic evalu- ation of text summarization: Does ROUGE correlate? In Proc. Workshop on Intrinsic and Extrinsic Evalu- ation Measures.
M.E. Foster. 2008. Automated metrics that agree with human judgements on generated output for an em- bodied conversational agent. In Proc. INLG'08.
A. Gatt and A. Belz. to appear. Introducing shared task evaluation to NLG: The TUNA shared task evalua- tion challenges. In E. Krahmer and M. Theune, ed- itors, Empirical Methods in Natural Language Gen- eration. Springer.
A. Gatt and F. Portet. 2009. Text content and task performance in the evaluation of a natural language generation system. In Proc. RANLP'09.
J. Hunter, G. Ewing, L. Ferguson, Y. Freer, R. Logie, P. McCue, and N. McIntosh. 2003. The NEONATE database. In Proc. IDAMAP'03.
A. Karasimos and A. Isard. 2004. Multilingual eval- uation of a natural language generation system. In Proc. LREC'04.
I. Langkilde-Geary. 2002. An empirical verification of coverage and correctness for a general-purpose sen- tence generator. In Proc. INLG'02.
J.C. Lester and B.W. Porter. 1997. Developing and empirically evaluating robust explanation genera- tors: The KNIGHT experiments. Computational Lin- guistics, 23(1):65-101.
C-Y Lin and E. Hovy. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. in. In Proc. of HLT-NAACL'03.
F. Liu and Y. Liu. 2008. Correlation between rouge and human evaluation of extractive meeting sum- maries. In Proc. ACL'08.
W. C. Mann and S.Thompson. 1988. Rhetorical struc- ture theory: Towards a functional theory of text or- ganisation. Text, 8(3):243-281.
A. Nenkova and R. Passonneau. 2004. Evaluating content selection in summarisation: The Pyramid method. In Proc. NAACL-HLT'04.
S. Papineni, T. Roukos, W. Ward, and W. Zhu. 2002. BLEU: A method for automatic evaluation of ma- chine translation. In Proc. ACL'02.
L Plaza, A Díaz, and P Gervás P. 2009. Auto- matic summarization of news using wordnet concept graphs. best paper award. In Proc. IADIS'09.
F. Portet, E. Reiter, A. Gatt, J. Hunter, S. Sripada, Y. Freer, and C. Sykes. 2009. Automatic generation of textual summaries from neonatal intensive care data. Artificial Intelligence, 173(7-8):789-816.
E. Reiter and A. Belz. 2009. An investigation into the validity of some metrics for automatically evaluat- ing Natural Language Generation systems. Compu- tational Linguistics, 35(4):529-558.
E. Reiter, R. Robertson, and L. Osman. 2003. Lessons from a failure: Generating tailored smoking cessa- tion letters. Artificial Intelligence, 144:41-58.
E. Reiter, S. Sripada, J. Hunter, J. Yu, and I. Davy. 2005. Choosing words in computer-generated weather forecasts. Artificial Intelligence, 167:137- 169.
E. Reiter, A. Gatt, F. Portet, and M. van der Meulen. 2008. The importance of narrative and other lessons from an evaluation of an NLG system that sum- marises clinical data. In Proc. INLG'08.
K. Spärck-Jones and J. R. Galliers. 1996. Evaluating natural language processing systems: An analysis and review. Springer, Berlin.
O. Stock, M. Zancanaro, P. Busetta, C. Callaway, A. Krueger, M. Kruppa, T. Kuflik, E. Not, and C. Rocchi. 2007. Adaptive, intelligent presen- tation of information for the museum visitor in PEACH. User Modeling and User-Adapted Interac- tion, 17(3):257-304.
M. van der Meulen, R. H. Logie, Y. Freer, C. Sykes, N. McIntosh, and J. Hunter. 2009. When a graph is poorer than 100 words. Applied Cognitive Psychol- ogy, 24(1):77-89.
I. Yoo, X. Hu, and I-Y Song. 2007. A coherent graph-based semantic clustering and summarization approach for biomedical literature and a new sum- marization evaluation method. BMC Bioinformat- ics, 8(9).