Automatic Discourse Segmentation: an evaluation in French (original) (raw)

Automated Discourse Segmentation by Syntactic Information and Cue Phrases

This paper presents an approach to automatic segmentation of English written text into Elementary Discourse Units (EDUs) 1 using syntactic information and cue phrases. The system takes documents with syntactic information as the input and generates EDUs as well as their nucleus/satellite roles. The experiment shows that this approach can give promising result in comparison with existing research in discourse segmentation.

Extending Automatic Discourse Segmentation for Texts in Spanish to Catalan

2016

At present, automatic discourse analysis is a relevant research topic in the field of NLP. However, discourse is one of the phenomena most difficult to process. Although discourse parsers have been already developed for several languages, this tool does not exist for Catalan. In order to implement this kind of parser, the first step is to develop a discourse segmenter. In this article we present the first discourse segmenter for texts in Catalan. This segmenter is based on Rhetorical Structure Theory (RST) for Spanish, and uses lexical and syntactic information to translate rules valid for Spanish into rules for Catalan. We have evaluated the system by using a gold standard corpus including manually segmented texts and results are promising.

Cross-lingual and cross-domain discourse segmentation of entire documents

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2017

Discourse segmentation is a crucial step in building end-to-end discourse parsers. However, discourse segmenters only exist for a few languages and domains. Typically they only detect intra-sentential segment boundaries, assuming gold standard sentence and token segmentation, and relying on high-quality syntactic parses and rich heuristics that are not generally available across languages and domains. In this paper, we propose statistical discourse segmenters for five languages and three domains that do not rely on gold preannotations. We also consider the problem of learning discourse segmenters when no labeled data is available for a language. Our fully supervised system obtains 89.5% F 1 for English newswire, with slight drops in performance on other domains, and we report supervised and unsupervised (cross-lingual) results for five languages in total.

A Realistic Success Criterion for Discourse Segmentation

2003

In this study, compared to the existing one, a more realistic evaluation method for discourse segmentation is introduced. It is believed that discourse segmentation is a fuzzy task [Pas96]. Human subjects may agree on different discourse boundaries, with high agreement among them. In the existing method a threshold value is calculated and sentences that marked by that many subjects are decided as real boundaries and other marks are not been considered. Furthermore automatically discovered boundaries, in case of being misplaced, are treated as a strict failure, disregarding the proximity wrt to the human found boundaries. The proposed method overcomes these shortcomings, and credits the fuzziness of the human subjects’ decisions as well as tolerates misplacements of the automated discovery. The proposed method is tunable from crisp/harsh to fuzzy/tolerant on human decision as well as automated discovery handling.

A syntactic and lexical-based discourse segmenter

Proceedings of the ACL-IJCNLP 2009 Conference Short Papers on - ACL-IJCNLP '09, 2009

We present a syntactic and lexically based discourse segmenter (SLSeg) that is designed to avoid the common problem of over-segmenting text. Segmentation is the first step in a discourse parser, a system that constructs discourse trees from elementary discourse units. We compare SLSeg to a probabilistic segmenter, showing that a conservative approach increases precision at the expense of recall, while retaining a high F-score across both formal and informal texts.

IJERT-Agglomerative Sentence Clustering Approach for Discourse Segmentation

International Journal of Engineering Research and Technology (IJERT), 2013

https://www.ijert.org/agglomerative-sentence-clustering-approach-for-discourse-segmentation https://www.ijert.org/research/agglomerative-sentence-clustering-approach-for-discourse-segmentation-IJERTV2IS120789.pdf Automatic recognition of discourse which describes about a situation or set of entities is important natural language tasks like summarization, information retrieval, etc. A text can be viewed as a collection of discourses that describe a set of nouns. This paper presents an agglomerative sentence clustering approach for text segmentation based on nouns. This method considers both cohesion and coherence relationships among sentences. The nouns are the best representatives of the sentence in discourse. We find that clustering of the sentence by considering these nouns gives better segmentation strategy. The output of the clustering process is further refined with named entities and WordNet for better accuracy.

Clause-based Discourse Segmentation of Arabic Texts

This paper describes a rule-based approach to segment Arabic texts into clauses. Our method relies on an extensive analysis of a large set of lexical cues as well as punctuation marks. Our analysis was carried out on two different corpus genres: news articles and elementary school textbooks. We propose a three steps segmentation algorithm: first by using only punctuation marks, then by relying only on lexical cues and finally by using both typology and lexical cues. The results were compared with manual segmentations elaborated by experts.

Multi-lingual and Cross-genre Discourse Unit Segmentation

Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019

We describe a series of experiments applied to data sets from different languages and genres annotated for coherence relations according to different theoretical frameworks. Specifically, we investigate the feasibility of a unified (theory-neutral) approach toward discourse segmentation; a process which divides a text into minimal discourse units that are involved in some coherence relation. We apply a RandomForest and an LSTM based approach for all data sets, and we improve over a simple baseline assuming simple sentence or clause-like segmentation. Performance however varies a lot depending on language, and more importantly genre, with f-scores ranging from 73.00 to 94.47.

Towards a top-down approach for an automatic discourse analysis for Basque: Segmentation and Central Unit detection tool

PLOS ONE

Lately, discourse structure has received considerable attention due to the benefits its application offers in several NLP tasks such as opinion mining, summarization, question answering, text simplification, among others. When automatically analyzing texts, discourse parsers typically perform two different tasks: i) identification of basic discourse units (text segmentation) ii) linking discourse units by means of discourse relations, building structures such as trees or graphs. The resulting discourse structures are, in general terms, accurate at intra-sentence discourse-level relations, however they fail to capture the correct inter-sentence relations. Detecting the main discourse unit (the Central Unit) is helpful for discourse analyzers (and also for manual annotation) in improving their results in rhetorical labeling. Bearing this in mind, we set out to build the first two steps of a discourse parser following a top-down strategy: i) to find discourse units, ii) to detect the Central Unit. The final step, i.e. assigning rhetorical relations, remains to be worked on in the immediate future. In accordance with this strategy, our paper presents a tool consisting of a discourse segmenter and an automatic Central Unit detector.