The two-stage unsupervised approach to multidocument summarization (original) (raw)
Related papers
Integrating Document Clustering and Multidocument Summarization
ACM Transactions on Knowledge Discovery from Data, 2011
Document understanding techniques such as document clustering and multidocument summarization have been receiving much attention recently. Current document clustering methods usually represent the given collection of documents as a document-term matrix and then conduct the clustering process. Although many of these clustering methods can group the documents effectively, it is still hard for people to capture the meaning of the documents since there is no satisfactory interpretation for each document cluster. A straightforward solution is to first cluster the documents and then summarize each document cluster using summarization methods. However, most of the current summarization methods are solely based on the sentence-term matrix and ignore the context dependence of the sentences. As a result, the generated summaries lack guidance from the document clusters. In this article, we propose a new language model to simultaneously cluster and summarize documents by making use of both the document-term and sentenceterm matrices. By utilizing the mutual influence of document clustering and summarization, our method makes; (1) a better document clustering method with more meaningful interpretation; and (2) an effective document summarization method with guidance from document clustering. Experimental results on various document datasets show the effectiveness of our proposed method and the high interpretability of the generated summaries.
A multi-document summarization system based on statistics and linguistic treatment
Expert Systems with Applications, 2014
The massive quantity of data available today in the Internet has reached such a huge volume that it has become humanly unfeasible to efficiently sieve useful information from it. One solution to this problem is offered by using text summarization techniques. Text summarization, the process of automatically creating a shorter version of one or more text documents, is an important way of finding relevant information in large text libraries or in the Internet. This paper presents a multi-document summarization system that concisely extracts the main aspects of a set of documents, trying to avoid the typical problems of this type of summarization: information redundancy and diversity. Such a purpose is achieved through a new sentence clustering algorithm based on a graph model that makes use of statistic similarities and linguistic treatment. The DUC 2002 dataset was used to assess the performance of the proposed system, surpassing DUC competitors by a 50% margin of f-measure, in the best case.
Unstructured Text Documents Summarization With Multi-Stage Clustering
IEEE Access
In natural language processing, text summarization is an important application used to extract desired information by reducing large text. Existing studies use keyword-based algorithms for grouping text, which do not give the documents' actual theme. Our proposed dynamic corpus creation mechanism combines metadata with summarized extracted text. The proposed approach analyzes the mesh of multiple unstructured documents and generates a linked set of multiple weighted nodes by applying multistage Clustering. We have generated adjacency graphs to link the clusters of various collections of documents. This approach comprises of ten steps: pre-processing, making multiple corpuses, first stage clustering, creating sub-corpuses, interlinking sub-corpuses, creating page rank keyword dictionary of each sub-corpus, second stage clustering, path creation among clusters of sub-corpuses, text processing by forward and backward propagation for results generation. The outcome of this technique consists of interlinked sub-corpuses through clusters. We have applied our approach to a News dataset, and this interlinked corpus processing follows step by step clustering to search the most relevant parts of the corpus with less cost, time, and improve content detection. We have applied six different metadata processing combinations over multiple text queries to compare results during our experimentation. The comparison results of text satisfaction show that Page-Rank keywords give 38% related text, single-stage Clustering gives 46%, two-stage Clustering gives 54%, and the proposed technique gives 67% associated text. Furthermore, this approach covers/searches the relevant data with a range of most to less relevant content. It provides the systematic query-relevant corpus processing mechanism, which automatically selects the most relevant sub-corpus through dynamic path selection. We used the SHAP model to evaluate the proposed technique, and our evaluation results proved that the proposed mechanism improved text processing. Moreover, combining text summarization features, shown satisfactory results compared to the summaries generated by general models of abstractive & extractive summarization. INDEX TERMS Cosine similarity, page rank keywords, k-means, word2vec, summarized parallel corpus.
Multi Document Summarization: Approaches and Future Scope
2015
With rapid growth of world wide web, the amount of quickly growing information has gone beyond our imagination. Many techniques are presented to help users to find the desired information from large data set quickly and accurately. Multi document summarization is effective one. The techniques that are used in summarization are feature based, graph based, cluster based, knowledge based, component based and CST based. The outline of all the methods is discussed in detail. Then all methods are compared and future work is discussed
Experiments in multidocument summarization
Proceedings of the second international conference on Human Language Technology Research -, 2002
This paper describes a multidocument summarizer built upon research into the detection of new information. The summarizer uses several new strategies to select interesting and informative sentences, including an innovative measure of importance derived from the analysis of a large corpus. The system also computes concept frequencies rather than word frequencies as an additional measure of importance. It merges these strategies with a number of familiar summarization heuristics to rank sentences. The initial version of the summarizer performed successfully in the evaluation reported at the Document Understanding Conference last year, although the system addressed only the content of the summary and not the presentation. We also discuss here the procedures we are developing to improve the presentation and readability of the summaries.
A Survey of Multi-Document Summarization.
International Journal of Engineering Sciences & Research Technology, 2013
This paper describes a survey of Multiaims at extraction of information from a set of documents written about same topic and helps to familiarize themselves with information content in large cluster of documents. There are several strategies for selecting interesting and informative sentences from set documents. Generic and Topi Summarization are the two main strategies. This paper goes through different approaches in each strategy.
Automatic multi document summarization approaches
Journal of Computer Science, 2012
Problem statement: Text summarization can be of different nature ranging from indicative summary that identifies the topics of the document to informative summary which is meant to represent the concise description of the original document, providing an idea of what the whole content of document is all about. Approach: Single document summary seems to capture both the information well but it has not been the case for multi document summary where the overall comprehensive quality in presenting informative summary often lacks. It is found that most of the existing methods tend to focus on sentence scoring and less consideration is given to the contextual information content in multiple documents. Results: In this study, some survey on multi document summarization approaches has been presented. We will direct our focus notably on four well known approaches to multi document summarization namely the feature based method, cluster based method, graph based method and knowledge based method. The general ideas behind these methods have been described. Conclusion: Besides the general idea and concept, we discuss the benefits and limitations concerning these methods. With the aim of enhancing multi document summarization, specifically news documents, a novel type of approach is outlined to be developed in the future, taking into account the generic components of a news story in order to generate a better summary.
A Systematic Survey on Multi-document Text Summarization
International Journal of Advanced Trends in Computer Science and Engineering, 2021
Automatic text summarization is a technique of generating short and accurate summary of a longer text document. Text summarization can be classified based on the number of input documents (single document and multi-document summarization) and based on the characteristics of the summary generated (extractive and abstractive summarization). Multi-document summarization is an automatic process of creating relevant, informative and concise summary from a cluster of related documents. This paper does a detailed survey on the existing literature on the various approaches for text summarization. Few of the most popular approaches such as graph based, cluster based and deep learning-based summarization techniques are discussed here along with the evaluation metrics, which can provide an insight to the future researchers.
2019
Huge volumes of textual information has been produced every single day. In order to organize and understand such large datasets, in recent years, summarization techniques have become popular. These techniques aims at finding relevant, concise and non-redundant content from such a big data. While network methods have been adopted to model texts in some scenarios, a systematic evaluation of multilayer network models in the multi-document summarization task has been limited to a few studies. Here, we evaluate the performance of a multilayer-based method to select the most relevant sentences in the context of an extractive multi document summarization (MDS) task. In the adopted model, nodes represent sentences and edges are created based on the number of shared words between sentences. Differently from previous studies in multidocument summarization, we make a distinction between edges linking sentences from different documents (inter-layer) and those connecting sentences from the same document (intra-layer). As a proof of principle, our results reveal that such a discrimination between intra-and inter-layer in a multilayered representation is able to improve the quality of the generated summaries. This piece of information could be used to improve current statistical methods and related textual models.
An Abstract Study on Non-Identical Multi-Document Summarization Approaches
2018
In the current situation the rate of development of data is growing exponentially in the World Wide Web. Thus, extricating legitimate and valuable data from enormous information has turned into a testing issue. As of late text summarization is perceived as one of the answer for remove applicable data from huge documents. Based on number of documents considered for summarization, the summarization assignment is ordered as single document or multi-document summarization. As opposed to single document, multi-document summarization is all the more trying for the analysts to discover exact synopsis from multiple documents. In this paper we have begun with presentation of multidocument summarization and after that have additionally examined examination and investigation of different methodologies which goes under the multidocument summarization. The paper additionally contains insights about the advantages and issues in the current techniques. This would particularly be useful for scientists working in this field of text information mining. By utilizing this information, scientists can fabricate new or blended based methodologies for multi-document summarization.