Automated Discourse Segmentation by Syntactic Information and Cue Phrases (original) (raw)
Related papers
A syntactic and lexical-based discourse segmenter
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers on - ACL-IJCNLP '09, 2009
We present a syntactic and lexically based discourse segmenter (SLSeg) that is designed to avoid the common problem of over-segmenting text. Segmentation is the first step in a discourse parser, a system that constructs discourse trees from elementary discourse units. We compare SLSeg to a probabilistic segmenter, showing that a conservative approach increases precision at the expense of recall, while retaining a high F-score across both formal and informal texts.
Automatic Discourse Segmentation: an evaluation in French
ArXiv, 2020
In this article, we describe some discursive segmentation methods as well as a preliminary evaluation of the segmentation quality. Although our experiment were carried for documents in French, we have developed three discursive segmentation models solely based on resources simultaneously available in several languages: marker lists and a statistic POS labeling. We have also carried out automatic evaluations of these systems against the Annodis corpus, which is a manually annotated reference. The results obtained are very encouraging.
IJERT-Agglomerative Sentence Clustering Approach for Discourse Segmentation
International Journal of Engineering Research and Technology (IJERT), 2013
https://www.ijert.org/agglomerative-sentence-clustering-approach-for-discourse-segmentation https://www.ijert.org/research/agglomerative-sentence-clustering-approach-for-discourse-segmentation-IJERTV2IS120789.pdf Automatic recognition of discourse which describes about a situation or set of entities is important natural language tasks like summarization, information retrieval, etc. A text can be viewed as a collection of discourses that describe a set of nouns. This paper presents an agglomerative sentence clustering approach for text segmentation based on nouns. This method considers both cohesion and coherence relationships among sentences. The nouns are the best representatives of the sentence in discourse. We find that clustering of the sentence by considering these nouns gives better segmentation strategy. The output of the clustering process is further refined with named entities and WordNet for better accuracy.
Thai Elementary Discourse Unit Segmentation by Discourse Segmentation Cues and Syntactic Information
2005
Elementary discourse unit (EDU) segmentation is an important process, since it separates full text into minimal discourse units that are used as an input of many applications such as text summarization, discourse parsing. This paper, we present a hybrid approach for Thai EDU segmentation by using the decision-tree learning and rules. In additional, the important problem of this process is EDUs boundary ambiguity because Thai does not have punctuation marks or special symbols to signal EDU boundary and Embedded EDU usually occurring in the middle of another EDU. The precision and recall of the system are 0.80 and 0.81
A Realistic Success Criterion for Discourse Segmentation
2003
In this study, compared to the existing one, a more realistic evaluation method for discourse segmentation is introduced. It is believed that discourse segmentation is a fuzzy task [Pas96]. Human subjects may agree on different discourse boundaries, with high agreement among them. In the existing method a threshold value is calculated and sentences that marked by that many subjects are decided as real boundaries and other marks are not been considered. Furthermore automatically discovered boundaries, in case of being misplaced, are treated as a strict failure, disregarding the proximity wrt to the human found boundaries. The proposed method overcomes these shortcomings, and credits the fuzziness of the human subjects’ decisions as well as tolerates misplacements of the automated discovery. The proposed method is tunable from crisp/harsh to fuzzy/tolerant on human decision as well as automated discovery handling.
Extending Automatic Discourse Segmentation for Texts in Spanish to Catalan
2016
At present, automatic discourse analysis is a relevant research topic in the field of NLP. However, discourse is one of the phenomena most difficult to process. Although discourse parsers have been already developed for several languages, this tool does not exist for Catalan. In order to implement this kind of parser, the first step is to develop a discourse segmenter. In this article we present the first discourse segmenter for texts in Catalan. This segmenter is based on Rhetorical Structure Theory (RST) for Spanish, and uses lexical and syntactic information to translate rules valid for Spanish into rules for Catalan. We have evaluated the system by using a gold standard corpus including manually segmented texts and results are promising.
Cross-lingual and cross-domain discourse segmentation of entire documents
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2017
Discourse segmentation is a crucial step in building end-to-end discourse parsers. However, discourse segmenters only exist for a few languages and domains. Typically they only detect intra-sentential segment boundaries, assuming gold standard sentence and token segmentation, and relying on high-quality syntactic parses and rich heuristics that are not generally available across languages and domains. In this paper, we propose statistical discourse segmenters for five languages and three domains that do not rely on gold preannotations. We also consider the problem of learning discourse segmenters when no labeled data is available for a language. Our fully supervised system obtains 89.5% F 1 for English newswire, with slight drops in performance on other domains, and we report supervised and unsupervised (cross-lingual) results for five languages in total.
Features for automatic discourse analysis of paragraphs
2008
In this paper, we investigate which information is useful for the detection of rhetorical (RST) relations between (Multi-) Sentential Discourse Units ((M-)SDUs)-text spans consisting of one or more sentences-within the same paragraph. In order to do so, we simplified the task of discourse parsing to a decision problem in which we decided whether an (M-)SDU is either rhetorically related to a preceding or a following (M-)SDU. Employing the RST Treebank , we offered this choice to machine learning algorithms together with syntactic, lexical, referential, discourse and surface features. Next, the features were ranked on the basis of (1) models established by the classification algorithms and (2) feature selection metrics. Highly ranked features that predict the presence of a rhetorical relation are syntactic similarity, word overlap, word similarity, continuous punctuation and many reference features. Other features are used to introduce new topics or arguments: time references, proper nouns, definite articles and the word further.
Multi-lingual and Cross-genre Discourse Unit Segmentation
Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019
We describe a series of experiments applied to data sets from different languages and genres annotated for coherence relations according to different theoretical frameworks. Specifically, we investigate the feasibility of a unified (theory-neutral) approach toward discourse segmentation; a process which divides a text into minimal discourse units that are involved in some coherence relation. We apply a RandomForest and an LSTM based approach for all data sets, and we improve over a simple baseline assuming simple sentence or clause-like segmentation. Performance however varies a lot depending on language, and more importantly genre, with f-scores ranging from 73.00 to 94.47.