Language-independent Techniques for Automated Text Summarization (original) (raw)
Related papers
A new approach to improving multilingual summarization using a genetic algorithm
Proceedings of the 48th Annual Meeting …, 2010
Automated summarization methods can be defined as "language-independent," if they are not based on any languagespecific knowledge. Such methods can be used for multilingual summarization defined by Mani as "processing several languages, with summary in the same language as input." In this paper, we introduce MUSE, a languageindependent approach for extractive summarization based on the linear optimization of several sentence ranking measures using a genetic algorithm. We tested our methodology on two languages-English and Hebrew-and evaluated its performance with ROUGE-1 Recall vs. stateof-the-art extractive summarization approaches. Our results show that MUSE performs better than the best known multilingual approach (TextRank 1 ) in both languages. Moreover, our experimental results on a bilingual (English and Hebrew) document collection suggest that MUSE does not need to be retrained on each language and the same model can be used across at least two different languages.
2010
The trend toward the growing multilinguality of the Internet requires text summarization techniques that work equally well in multiple languages. Only some of the automated summarization methods proposed in the literature, however, can be defined as "languageindependent", as they are not based on any morphological analysis of the summarized text. In this paper, we perform an in-depth comparative analysis of language-independent sentence scoring methods for extractive single-document summarization. We evaluate 15 published summarization methods proposed in the literature and 16 methods introduced in (Litvak et al., 2010). The evaluation is performed on English and Hebrew corpora. The results suggest that the performance ranking of the compared methods is quite similar in both languages. The top ten bilingual scoring methods include six methods introduced in (Litvak et al., 2010).
This document overviews the strategy, effort and aftermath of the MultiLing 2013 multilingual summarization data collection. We describe how the Data Contributors of MultiLing collected and generated a multilingual multi-document summarization corpus on 10 different languages: Arabic, Chinese, Czech, English, French, Greek, Hebrew, Hindi, Romanian and Spanish. We discuss the rationale behind the main decisions of the collection, the methodology used to generate the multilingual corpus, as well as challenges and problems faced per language. This paper overviews the work on Czech, Hebrew and Spanish languages.
A multi-document summarization system based on statistics and linguistic treatment
Expert Systems with Applications, 2014
The massive quantity of data available today in the Internet has reached such a huge volume that it has become humanly unfeasible to efficiently sieve useful information from it. One solution to this problem is offered by using text summarization techniques. Text summarization, the process of automatically creating a shorter version of one or more text documents, is an important way of finding relevant information in large text libraries or in the Internet. This paper presents a multi-document summarization system that concisely extracts the main aspects of a set of documents, trying to avoid the typical problems of this type of summarization: information redundancy and diversity. Such a purpose is achieved through a new sentence clustering algorithm based on a graph model that makes use of statistic similarities and linguistic treatment. The DUC 2002 dataset was used to assess the performance of the proposed system, surpassing DUC competitors by a 50% margin of f-measure, in the best case.
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '99, 1999
Human-quality text summarization systems are di cult to design, and even more di cult to evaluate, in part because documents can di er along several dimensions, such as length, writing style and lexical usage. Nevertheless, certain cues can often help suggest the selection of sentences for inclusion in a summary. This paper presents our analysis of news-article summaries generated by sentence selection. Sentences are ranked for potential inclusion in the summary using a weighted combination of statistical and linguistic features. The statistical features were adapted from standard IR methods. The potential linguistic ones were derived from an analysis of news-wire summaries. To e v aluate these features we use a normalized version of precision-recall curves, with a baseline of random sentence selection, as well as analyze the properties of such a baseline. We illustrate our discussions with empirical results showing the importance of corpus-dependent baseline summarization standards, compression ratios and carefully crafted long queries.
The Text Analysis Conference MultiLing Pilot of 2011 posed a multi-lingual summarization task to the summarization community, aiming to quantify and measure the performance of multi-lingual, multi-document summarization systems. The task was to create a 240-250 word summary from 10 news texts, describing a given topic. The texts of each topic were provided in seven languages (Arabic, Czech, English, French, Greek, Hebrew, Hindi) and each participant generated summaries for at least 2 languages. The evaluation of the summaries was performed using automatic (Au-toSummENG, Rouge) and manual processes (Overall Responsiveness score). The participating systems were 8, some of which providing summaries across all languages. This paper provides a brief description for the collection of the data, the evaluation methodology, the problems and challenges faced, and an overview of participation and corresponding results.
2002
We describe our work on the development of Language and Evaluation Resources for the evaluation of summaries in English and Chinese. The language resources include a parallel corpus of English and Chinese texts which are translations of each other, a set of queries in both languages, clusters of documents relevants to each query, sentence relevance measures for each sentence in the document clusters, and manual multi-document summaries at different compression rates. The evaluation resources consist of metrics for measuring the content of automatic summaries against reference summaries. The framework can be used in the evaluation of extractive, non-extractive, single and multi-document summarization. We focus on the resources developed that are made available for the research community.
EASY-M: Evaluation System for Multilingual Summarizers
Proceedings of the Workshop MultiLing 2019: Summarization Across Languages, Genres and Sources associated with RANLP 2019, 2019
Automatic text summarization aims at producing a shorter version of a document (or a document set). Evaluation of summarization quality is a challenging task. Because human evaluations are expensive and evaluators often disagree between themselves, many researchers prefer to evaluate their systems automatically, with help of software tools. Such a tool usually requires a point of reference in the form of one or more human-written summaries for each text in the corpus. Then, a systemgenerated summary is compared to one or more human-written summaries, according to selected measures (also called metrics). However, a single metric cannot reflect all quality-related aspects of a summary. In this paper we present the EvAluation SYstem for Multilingual Summarization (EASY-M), which enables the evaluation of system-generated summaries in 23 languages with several quality measures, based on comparison with their human-generated counterparts. The system also provides comparative results with two built-in baselines. The EASY-M system is freely available for the NLP community 1 .