Automated Discourse Segmentation by Syntactic Information and Cue Phrases (original) (raw)
A syntactic and lexical-based discourse segmenter
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers on - ACL-IJCNLP '09, 2009
We present a syntactic and lexically based discourse segmenter (SLSeg) that is designed to avoid the common problem of over-segmenting text. Segmentation is the first step in a discourse parser, a system that constructs discourse trees from elementary discourse units. We compare SLSeg to a probabilistic segmenter, showing that a conservative approach increases precision at the expense of recall, while retaining a high F-score across both formal and informal texts.
Automatic Discourse Segmentation: an evaluation in French
ArXiv, 2020
In this article, we describe some discursive segmentation methods as well as a preliminary evaluation of the segmentation quality. Although our experiment were carried for documents in French, we have developed three discursive segmentation models solely based on resources simultaneously available in several languages: marker lists and a statistic POS labeling. We have also carried out automatic evaluations of these systems against the Annodis corpus, which is a manually annotated reference. The results obtained are very encouraging.
Thai Elementary Discourse Unit Segmentation by Discourse Segmentation Cues and Syntactic Information
2005
Elementary discourse unit (EDU) segmentation is an important process, since it separates full text into minimal discourse units that are used as an input of many applications such as text summarization, discourse parsing. This paper, we present a hybrid approach for Thai EDU segmentation by using the decision-tree learning and rules. In additional, the important problem of this process is EDUs boundary ambiguity because Thai does not have punctuation marks or special symbols to signal EDU boundary and Embedded EDU usually occurring in the middle of another EDU. The precision and recall of the system are 0.80 and 0.81
A Realistic Success Criterion for Discourse Segmentation
2003
In this study, compared to the existing one, a more realistic evaluation method for discourse segmentation is introduced. It is believed that discourse segmentation is a fuzzy task [Pas96]. Human subjects may agree on different discourse boundaries, with high agreement among them. In the existing method a threshold value is calculated and sentences that marked by that many subjects are decided as real boundaries and other marks are not been considered. Furthermore automatically discovered boundaries, in case of being misplaced, are treated as a strict failure, disregarding the proximity wrt to the human found boundaries. The proposed method overcomes these shortcomings, and credits the fuzziness of the human subjects’ decisions as well as tolerates misplacements of the automated discovery. The proposed method is tunable from crisp/harsh to fuzzy/tolerant on human decision as well as automated discovery handling.
Extending Automatic Discourse Segmentation for Texts in Spanish to Catalan
2016
At present, automatic discourse analysis is a relevant research topic in the field of NLP. However, discourse is one of the phenomena most difficult to process. Although discourse parsers have been already developed for several languages, this tool does not exist for Catalan. In order to implement this kind of parser, the first step is to develop a discourse segmenter. In this article we present the first discourse segmenter for texts in Catalan. This segmenter is based on Rhetorical Structure Theory (RST) for Spanish, and uses lexical and syntactic information to translate rules valid for Spanish into rules for Catalan. We have evaluated the system by using a gold standard corpus including manually segmented texts and results are promising.
Cross-lingual and cross-domain discourse segmentation of entire documents
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2017
Discourse segmentation is a crucial step in building end-to-end discourse parsers. However, discourse segmenters only exist for a few languages and domains. Typically they only detect intra-sentential segment boundaries, assuming gold standard sentence and token segmentation, and relying on high-quality syntactic parses and rich heuristics that are not generally available across languages and domains. In this paper, we propose statistical discourse segmenters for five languages and three domains that do not rely on gold preannotations. We also consider the problem of learning discourse segmenters when no labeled data is available for a language. Our fully supervised system obtains 89.5% F 1 for English newswire, with slight drops in performance on other domains, and we report supervised and unsupervised (cross-lingual) results for five languages in total.
Features for automatic discourse analysis of paragraphs
2008
In this paper, we investigate which information is useful for the detection of rhetorical (RST) relations between (Multi-) Sentential Discourse Units ((M-)SDUs)-text spans consisting of one or more sentences-within the same paragraph. In order to do so, we simplified the task of discourse parsing to a decision problem in which we decided whether an (M-)SDU is either rhetorically related to a preceding or a following (M-)SDU. Employing the RST Treebank , we offered this choice to machine learning algorithms together with syntactic, lexical, referential, discourse and surface features. Next, the features were ranked on the basis of (1) models established by the classification algorithms and (2) feature selection metrics. Highly ranked features that predict the presence of a rhetorical relation are syntactic similarity, word overlap, word similarity, continuous punctuation and many reference features. Other features are used to introduce new topics or arguments: time references, proper nouns, definite articles and the word further.
Multi-lingual and Cross-genre Discourse Unit Segmentation
Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019
We describe a series of experiments applied to data sets from different languages and genres annotated for coherence relations according to different theoretical frameworks. Specifically, we investigate the feasibility of a unified (theory-neutral) approach toward discourse segmentation; a process which divides a text into minimal discourse units that are involved in some coherence relation. We apply a RandomForest and an LSTM based approach for all data sets, and we improve over a simple baseline assuming simple sentence or clause-like segmentation. Performance however varies a lot depending on language, and more importantly genre, with f-scores ranging from 73.00 to 94.47.
The automatic identification of discourse units in Dutch text
2013
The identification of discourse units is an essential step in discourse parsing, the automatic construction of a discourse structure from a text. We present a rule-based algorithm to identify elementary discourse units (EDUs) in Dutch written text. Contrary to approaches that focus on the determination of segment boundaries, we identify complete discourse units, which is especially helpful for the recognition of interrupted EDUs that contain embedded discourse units. We use syntactic and lexical information to decompose sentences into EDUs. Experimental results show that our algorithm for EDU identification performs well on texts of various genres.