Document Clustering and Text Summarization (original) (raw)

Text Summarization using Clustering Technique

— A summarization system consists of reduction of a text document to generate a new form which conveys the key meaning of the contained text. Due to the problem of information overload, access to sound and correctly-developed summaries is necessary. Text summarization is the most challenging task in information retrieval. Data reduction helps a user to find required information quickly without wasting time and effort in reading the whole document collection. This paper presents a combined approach to document and sentence clustering as an extractive technique of summarization.

Unstructured Text Documents Summarization With Multi-Stage Clustering

IEEE Access

In natural language processing, text summarization is an important application used to extract desired information by reducing large text. Existing studies use keyword-based algorithms for grouping text, which do not give the documents' actual theme. Our proposed dynamic corpus creation mechanism combines metadata with summarized extracted text. The proposed approach analyzes the mesh of multiple unstructured documents and generates a linked set of multiple weighted nodes by applying multistage Clustering. We have generated adjacency graphs to link the clusters of various collections of documents. This approach comprises of ten steps: pre-processing, making multiple corpuses, first stage clustering, creating sub-corpuses, interlinking sub-corpuses, creating page rank keyword dictionary of each sub-corpus, second stage clustering, path creation among clusters of sub-corpuses, text processing by forward and backward propagation for results generation. The outcome of this technique consists of interlinked sub-corpuses through clusters. We have applied our approach to a News dataset, and this interlinked corpus processing follows step by step clustering to search the most relevant parts of the corpus with less cost, time, and improve content detection. We have applied six different metadata processing combinations over multiple text queries to compare results during our experimentation. The comparison results of text satisfaction show that Page-Rank keywords give 38% related text, single-stage Clustering gives 46%, two-stage Clustering gives 54%, and the proposed technique gives 67% associated text. Furthermore, this approach covers/searches the relevant data with a range of most to less relevant content. It provides the systematic query-relevant corpus processing mechanism, which automatically selects the most relevant sub-corpus through dynamic path selection. We used the SHAP model to evaluate the proposed technique, and our evaluation results proved that the proposed mechanism improved text processing. Moreover, combining text summarization features, shown satisfactory results compared to the summaries generated by general models of abstractive & extractive summarization. INDEX TERMS Cosine similarity, page rank keywords, k-means, word2vec, summarized parallel corpus.

Analysis of Clustering Techniques for Query Dependent Text Document Summarization

IJESIT, 2013

World Wide Web is a huge collection of data of different file formats. With the coming of the information revolution, electronic documents are becoming a principle media of business and academic information. In order to fully utilize these on-line documents effectively, it is crucial to be able to extract the gist of these documents. It is not the case that a particular clustering algorithm is best suited for clustering of documents of different file formats.Having a Text Summarization system would thus be immensely useful in serving this need. In order to generate a summary, we have to identify the most important pieces of information from the document, omitting irrelevant information and minimizing details, and assembling them into a compact Coherent report. A particular Clustering algorithm is best suited for query dependent text document summarization. As every document we can convert into text, this strategy is much needful for the end users. The conclusion is drawn by using and comparing two different clustering algorithms namely Nearest Neighbor and Agglomerative Hierarchical Clustering Algorithm.

Public Domain [Unpublished] :: Document Clustering, Summarization, and Visualization

CSE573 :: Group13 :: Term Project :: Final Report, 2022

The main idea of this project is to explore various document clustering, summarization, and visualization techniques such that they can be used in two ways - summarization of the cluster topic and clustering of summaries and evaluate their performances. The 20 Newsgroups dataset by Ken Leng has around 20000 documents that are partitioned (nearly) evenly across 20 different categories. The first step in the project was to pre-process the documents, which included stop words removal, data cleanup, and lemmatization. Next, document clustering consists of two stages - creating embedded vectors from text and clustering those vectors. Vector creation is done using a topic modeling technique called Top2Vec and clustering using Bisecting K-means. Cluster summarization is implemented by a trained transformer named Bart-Large-CNN, an abstractive summarization. Later, we used techniques like t-SNE, Plotly, and DASH to visualize clustering results.

GENERATING AUTO TEXT SUMMARIZATION FROM DOCUMENT USING CLUSTERING

TJPRC, 2014

Auto text summarization is a method of reducing the size of the text document with a software program in a way to generate a summary that retains the most important points of the original document. Interest in the automatic summarization has increased due to the occurrence of information overload, and tremendous growth in quantity of data. Coherent summary can be made using technologies such as considering account variables such as length, writing style and syntax. Google is the one of the good example of the use of summarization technology. The two technologies viz. Extraction and Abstraction are used for auto text summarizations. Extraction methods work by selecting a content of existing sentences, phrases, or words from the original textual document to form the summary. Unlike, abstractive methods generate an internal semantic of content and then use natural language generation techniques to create a summary that can be related to what a human may generate. Such a summary which may contain words not explicitly present in the original text.

Subject Review: Text Clustering Algorithms

International Journal of Engineering Research and Advanced Technology , 2020

Clustering algorithms are taking attention in recent times, according to a huge amount of datasets and the growth of parallelized computing architectures. The goal of clustering algorithms is to divided the dataset into clusters, such that objects within the same cluster are similar to each other and differ from objects of other clusters. Clustering algorithms play an important role in information retrieval, indexing and text summarization. In this paper a brief overview of several clustering algorithms is discussed

Integrating Document Clustering and Multidocument Summarization

ACM Transactions on Knowledge Discovery from Data, 2011

Document understanding techniques such as document clustering and multidocument summarization have been receiving much attention recently. Current document clustering methods usually represent the given collection of documents as a document-term matrix and then conduct the clustering process. Although many of these clustering methods can group the documents effectively, it is still hard for people to capture the meaning of the documents since there is no satisfactory interpretation for each document cluster. A straightforward solution is to first cluster the documents and then summarize each document cluster using summarization methods. However, most of the current summarization methods are solely based on the sentence-term matrix and ignore the context dependence of the sentences. As a result, the generated summaries lack guidance from the document clusters. In this article, we propose a new language model to simultaneously cluster and summarize documents by making use of both the document-term and sentenceterm matrices. By utilizing the mutual influence of document clustering and summarization, our method makes; (1) a better document clustering method with more meaningful interpretation; and (2) an effective document summarization method with guidance from document clustering. Experimental results on various document datasets show the effectiveness of our proposed method and the high interpretability of the generated summaries.

“Analysis of Text Document Summarization Using Nearest Neighbour and Agglomerative Clustering”

2018

World Wide Web is a huge collection of data of different file formats. With the coming of the information revolution, electronic documents are becoming a principle media of business and academic information. In order to fully utilize these on-line documents effectively, it is crucial to be able to extract the gist of these documents. It is not the case that a particular clustering algorithm is best suited for clustering of documents of different file formats.Having a Text Summarization system would thus be immensely useful in serving this need. In order to generate a summary, we have to identify the most important pieces of information from the document, omitting irrelevant information and minimizing details, and assembling them into a compact Coherent report. A particular Clustering algorithm is best suited for query dependent text document summarization. As every document we can convert into text, this strategy is much needful for the end users. The conclusion is drawn by using an...

Text Summarization using Clustering Technique and SVM Technique

The Text Summarization is one of the problem under Natural Language Processing.This system which gives a single summarized document from multiple related documents. The summarizer provides an accurate result to the input query in the form of a precise text document by analyzing the text from various text document clusters. There are two methodologies- Clustering and Support Vector Machine (SVM) are used to solve this NLP problem.The present text summarizer system uses either SVM or Clustering technique. In this work we propose a Hybrid approach to serve our purpose by cascading both techniques to get an improved summary of data on related documents. We pre process the documents to get tokens obtained after stemming and stop word removal. The hybrid approach helps in summarizing the text documents efficiently by avoiding redundancy among the words in the document and ensures highest relevance to the input query.The guiding factors of our results are the ratio of input to output sentences after summarization.

CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM

iaeme

Text summarization addresses both the problem of selecting most important portions of text and the problem of generating coherent summaries. Information Retrieval plays an important role in searching on the Internet. Huge amount of data is present on the Web. There has been a great amount of work on query-independent summarization of documents. Most systems are limited to the query processing based on information retrieval system; the matching of the query against a set of text record is the core of the system. Retrieval of the relevant natural language text document without ambiguity is of more challenge. Today’s most search engines are based on keyword based techniques, which results in some disadvantages. We exploit NLP techniques to support a range of NL queries and snippets over an existing keywordbased search. This paper describes a simple system for choosing phrases from a document as key phrases. This phrase is made up of modules, splitting, tokenization; part-of-speech tagging; Chunking and parsing. While doing so we have also used new association algorithm to generate accurate distance matrix and then combine or divide existing group, creating hierarchical cluster of related data that reflects the order in which groups are merged or divided. Here before clustering we are also handling text ambiguity to generate effective summary.