A multi-document summarization system based on statistics and linguistic treatment (original) (raw)
Related papers
A graph-based approach towards automatic text summarization
Due to an exponential increase in number of electronic documents and easy access to information on the Internet, the need for text summarization has become obvious. An ideal summary contains important parts of the original document, eliminates redundant information and can be generated from single or multiple documents. There are several online text summarizers but they have limited accessibility and generate somewhat incoherent summaries. We have proposed a Graph-based Automatic Summarizer (GAUTOSUMM), which consists of a pre-processing module, control features and a post-processing module. For evaluation, two datasets, Opinosis and DUC 2007 are used and generated summaries are evaluated using ROUGE metrics. The results show that GAUTOSUMM outperforms the online text summarizers in eight out of ten topics both in terms of the summary quality and time performance. A user interface has also been built to collect the original text and the desired number of sentences in the summary.
Expert Systems with Applications, 2009
The technology of automatic document summarization is maturing and may provide a solution to the information overload problem. Nowadays, document summarization plays an important role in information retrieval. With a large volume of documents, presenting the user with a summary of each document greatly facilitates the task of finding the desired documents. Document summarization is a process of automatically creating a compressed version of a given document that provides useful information to users, and multi-document summarization is to produce a summary delivering the majority of information content from a set of documents about an explicit or implicit main topic. In our study we focus on sentence based extractive document summarization. We propose the generic document summarization method which is based on sentence clustering. The proposed approach is a continue sentence-clustering based extractive summarization methods, proposed in Alguliev [Alguliev, R. M., Aliguliyev, R. M., Bagirov, A. M. (2005). Global optimization in the summarization of text documents. Automatic Control and Computer Sciences 39, 42-47], Aliguliyev [Aliguliyev, R. M. (2006). A novel partitioning-based clustering method and generic document summarization. In Proceedings of the 2006 IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology (WI-IAT 2006 Workshops) (WI-IATW'06), 18-22 December (pp. 626-629) Hong Kong, China], Alguliev and Alyguliev [Alguliev, R. M., Alyguliev, R. M. (2007). Summarization of text-based documents with a determination of latent topical sections and information-rich sentences. Automatic Control and Computer Sciences 41, 132-140] Aliguliyev, [Aliguliyev, R. M. (2007). Automatic document summarization by sentence extraction. Journal of Computational Technologies 12, 5-15.]. The purpose of present paper to show, that summarization result not only depends on optimized function, and also depends on a similarity measure. The experimental results on an open benchmark datasets from DUC01 and DUC02 show that our proposed approach can improve the performance compared to sate-of-the-art summarization approaches.
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '99, 1999
Human-quality text summarization systems are di cult to design, and even more di cult to evaluate, in part because documents can di er along several dimensions, such as length, writing style and lexical usage. Nevertheless, certain cues can often help suggest the selection of sentences for inclusion in a summary. This paper presents our analysis of news-article summaries generated by sentence selection. Sentences are ranked for potential inclusion in the summary using a weighted combination of statistical and linguistic features. The statistical features were adapted from standard IR methods. The potential linguistic ones were derived from an analysis of news-wire summaries. To e v aluate these features we use a normalized version of precision-recall curves, with a baseline of random sentence selection, as well as analyze the properties of such a baseline. We illustrate our discussions with empirical results showing the importance of corpus-dependent baseline summarization standards, compression ratios and carefully crafted long queries.
A Review of Text Summarization
2015
The excessive use of internet and online technologies has caused a rapid growth of electronic data. When a data is being accessed from such a huge repository of e-documents, hundreds and thousands of documents are retrieved. For a user, it is impossible to read all the retrieved documents. Also, these documents contain redundant information. The problem is termed as Information Overload. Text summarization addresses this problem by producing the summary of related documents. Text summarization is one of the typical tasks of text mining. It is among most attractive research areas now-a-days. This paper gives a review of text summarization and defines the criteria for summary generation.
A Systematic Survey on Multi-document Text Summarization
International Journal of Advanced Trends in Computer Science and Engineering, 2021
Automatic text summarization is a technique of generating short and accurate summary of a longer text document. Text summarization can be classified based on the number of input documents (single document and multi-document summarization) and based on the characteristics of the summary generated (extractive and abstractive summarization). Multi-document summarization is an automatic process of creating relevant, informative and concise summary from a cluster of related documents. This paper does a detailed survey on the existing literature on the various approaches for text summarization. Few of the most popular approaches such as graph based, cluster based and deep learning-based summarization techniques are discussed here along with the evaluation metrics, which can provide an insight to the future researchers.
AN AUTOMATIC TEXT SUMMARIZATION USING LEXICAL COHESION AND CORRELATION OF SENTENCES
Due to substantial increase in the amount of information on the Internet, it has become extremely difficult to search for relevant documents needed by the users. To solve this problem, Text summarization is used which produces the summary of documents such that the summary contains important content of the document. This paper proposes a better approach for text summarization using lexical chaining and correlation of sentences. Lexical chains are created using Wordnet . The score of each Lexical chain is calculated based on keyword strength, Tf-idf & other features. The concept of using lexical chains helps to analyze the document semantically and the concept of correlation of sentences helps to consider the relation of sentence with preceding or succeeding sentence. This improves the quality of summary generated.
A Review on Text Summarization Techniques
Journal of scientific research, 2020
In recent years, an enormous amount of text data from diversified sources has been emerged day-by-day. This huge amount of data carries essential information and knowledge that needs to be effectively summarized to be useful. Hence, the main contribution of this paper is twofold. We first introduce some concepts related to extractive text summarization and then provide a systematic analysis of various text summarization techniques. In particular, some challenges in extractive summarization of single as well as multiple documents are introduced. The problems focus on the textual assessment and similarity measurement between the text documents are addressed. The challenges discussed are generic and applicable to every possible scenario in text summarization. Then, existing state-of-the-art of extractive summarization techniques are discussed that focus on the identified challenges.
Corpus-based web document summarization using statistical and linguistic approach
2010
Single document summarization generates summary by extracting the representative sentences from the document. In this paper, we presented a novel technique for summarization of domain-specific text from a single web document that uses statistical and linguistic analysis on the text in a reference corpus and the web document. The proposed summarizer uses the combinational function of Sentence Weight () and Subject Weight () to determine the rank of a sentence, where is the function of number of terms () and number of words () in a sentence, and term frequency () in the corpus and is the function of and in a subject, and in the corpus. 30 percent of the ranked sentences are considered to be the summary of the web document. We generated three web document summaries using our technique and compared each of them with the summaries developed manually from 16 different human subjects. Results showed that 68 percent of the summaries produced by our approach satisfy the manual summaries.
Expert Systems with Applications, 2019
Nowadays abundant amount of information is available on Internet which makes it difficult for the users to locate desired information. Automatic methods are needed to efficiently sieve and scavenge useful information from the Internet. Text summarization is identified and accepted as one of the solutions to find desired contents from one or more documents. The objective of proposed multi-document summarization is to gain good content coverage with information diversity. The proposed statistical feature based model utilizes the fuzzy model to deal with the imprecise and uncertainty of feature weight. Redundancy removal using cosine similarity is presented as enrichment to proposed work. The proposed approach is compared with DUC (Document Understanding Conference) participant systems and other summarization systems such as TexLexAn, ItemSum, Yago Summarizer, MSSF and PatSum using ROUGE measure on dataset DUC 2004. The experimental results show that our proposed work achieves a significant performance improvement over the other summarizers.