Towards a top-down approach for an automatic discourse analysis for Basque: Segmentation and Central Unit detection tool (original) (raw)

Extending Automatic Discourse Segmentation for Texts in Spanish to Catalan

2016

At present, automatic discourse analysis is a relevant research topic in the field of NLP. However, discourse is one of the phenomena most difficult to process. Although discourse parsers have been already developed for several languages, this tool does not exist for Catalan. In order to implement this kind of parser, the first step is to develop a discourse segmenter. In this article we present the first discourse segmenter for texts in Catalan. This segmenter is based on Rhetorical Structure Theory (RST) for Spanish, and uses lexical and syntactic information to translate rules valid for Spanish into rules for Catalan. We have evaluated the system by using a gold standard corpus including manually segmented texts and results are promising.

The RST Basque TreeBank: an online search interface to check rhetorical relations

This paper introduces the first Basque discourse TreeBank annotated with rhetorical relations following Rhetorical Structure Theory. We report the main features of the corpus, such as the annotation criteria, inter-annotator agreement and harmonization procedure. We describe an online search system to check the annotation of discourse relations.

Features for automatic discourse analysis of paragraphs

2008

In this paper, we investigate which information is useful for the detection of rhetorical (RST) relations between (Multi-) Sentential Discourse Units ((M-)SDUs)-text spans consisting of one or more sentences-within the same paragraph. In order to do so, we simplified the task of discourse parsing to a decision problem in which we decided whether an (M-)SDU is either rhetorically related to a preceding or a following (M-)SDU. Employing the RST Treebank , we offered this choice to machine learning algorithms together with syntactic, lexical, referential, discourse and surface features. Next, the features were ranked on the basis of (1) models established by the classification algorithms and (2) feature selection metrics. Highly ranked features that predict the presence of a rhetorical relation are syntactic similarity, word overlap, word similarity, continuous punctuation and many reference features. Other features are used to introduce new topics or arguments: time references, proper nouns, definite articles and the word further.

Discourse Processing for Text Analysis: Recent Successes, Current Challenges

2019

Computational discourse processing has come a long way in the 10 years since I spoke at ACL’2009 on Discourse: Early problems, current successes, future challenges. Much of this progress can be attributed to the vast amounts of textual data that have become available and to a concomitant weakening of theoretical commitments, so as to be able to use the data in information extraction, sentiment analysis, question answering, etc. Along with weakened commitments to the demands of particular theories, has been a greater willingness to consider what can be learned from textual data and from various forms of annotation, in English and in other languages as well. This paper briefly summarizes (1) changing assumptions about discourse structure; (2) recent work on lexico-syntactic grounding of low-level discourse structure and frameworks for higher-level discourse structure that recognize differences in genre; and (3) suggestions for addressing some of the challenges still facing us. For mor...

DiZer 2.0 - An Adaptable On-line Discourse Parser

2011

This paper presents DiZer 2.0, an adaptable on-line discourse parser. It is an evolution of DiZer, the first version of the system for Brazilian Portuguese language. It keeps the same analysis method following the Rhetorical Structure Theory, but builds on it by allowing any user to run it on the web and, if necessary, to build its own parser by incorporating discourse knowledge of the desired language and text type/genre. Besides presenting the system main points, this paper also shows a case study, in which the system is adapted for parsing the Spanish language.

X-tractor: A tool for extracting discourse markers

Discourse Markers (DMs) are among the most popular clues for capturing discourse structure for NLP applications. However, they suffer from inconsistency and uneven coverage. In this paper we present X-TRACTOR, a language-independant system for automatically extracting DMs from plain text. Seeking low processing cost and wide applicability, we have tried to remain independent of any handcrafted resources, including annotated corpora or NLP tools. Results of an application to Spanish point that this system succeeds in finding new DMs in corpus and ranking them according to their likelihood as DMs. Moreover, due to its modular architecture, X-TRACTOR evidences the specific contribution of each out of a number of parameters to characterise DMs. Therefore, this tool can be used not only for obtaining DM lexicons for heterogeneous purposes, but also for empirically delimiting the concept of DM.

Automatic Discourse Parsing of Arabic Text: Building Discourse Structure of Text

2018

Discourse parsing of Arabic texts presents an important task in natural language processing (NLP) and it plays a critical role in discourse analysis. It is considered as a modern scientific concern. Its importance lies in its ability to determine the semantic and rhetorical meaning between discourse units through a coherent structure. Discourse structure analysis can benefit a variety of NLP applications such as question answering, machine translation, text categorization, etc. The rhetorical analysis is based on three pillars. The first pillar consists to segment the text into discourse units. The second pillar is to look for structural links (attachments) between different discourse units. The third pillar connects these units to each other via discourse relations. In this context, our task of automatic discourse parsing of Arabic text falls within the second pillar of rhetorical analysis. This perception of rhetorical analysis is based on the Segmented Discourse Representation Th...

A syntactic and lexical-based discourse segmenter

Proceedings of the ACL-IJCNLP 2009 Conference Short Papers on - ACL-IJCNLP '09, 2009

We present a syntactic and lexically based discourse segmenter (SLSeg) that is designed to avoid the common problem of over-segmenting text. Segmentation is the first step in a discourse parser, a system that constructs discourse trees from elementary discourse units. We compare SLSeg to a probabilistic segmenter, showing that a conservative approach increases precision at the expense of recall, while retaining a high F-score across both formal and informal texts.