Comparative Study on Text Document Clustering Algorithms Based on Latent Semantic Indexing (original) (raw)

Using Latent Semantic Indexing for Document Clustering

2010

Documents with various contents are easily obtained from URLs which are associated with their titles. However, the titles of documents may not describe their contents and they just attract the readers to buy and read them. Therefore, the document clustering based on the same category is important to help users to retrieve information they need. Document clustering is an implementation of data mining task. By using similarity measurement of documents‟ characteristic, they can be clustered based on the same category or topic. High dimensionality of the document representation is due to representing of all substantial words in the vector space model. It is one of problems in document clustering that decreases the cluster quality performance including f-measure, entropy and accuracy. In categorical domain, many research have been conducted to reduce the dimension size of term-document matrix representation until by using keyword base. However, the result is obtained low accuracy in vari...

Review on Text Clustering Using Statistical and Semantic Data

Eighth Sense Research Group

ABSTRACT The explosive growth of information stored in unstructured texts created a great demand for new and powerful tools to acquire useful information, such as text mining. Document clustering is one of its the powerful methods and by which document retrieval, organization and summarization can be achieved. Text documents are the unstructured databases that contain raw data collection. The clustering techniques are used group up the text documents according to its similarity. As there is a huge amount of unstructured data and there is a semantic correlation between features of data it is difficult to handle that. There are large no of feature selection methods that are used to used to improve the efficiency and accuracy of clustering process. The feature selection was done by eliminate the redundant and irrelevant items from the text document contents. Statistical methods were used in the text clustering and feature selection algorithm. The semantic clustering and feature selection method was proposed to improve the clustering and feature selection mechanism with semantic relations of the text documents. Keywords:- Clustering, CHIR, CHIRSIM, K-means algorithm

A solution of semantic clustering of text documents

Clustering is a useful technique that organizes a large quantity of unordered text documents into a small number of meaningful and coherent clusters. Measuring similarity and discernment of two documents is not always clear problem and it depends of topical affiliation of the documents. For example, when clustering research papers, two documents are regarded as similar if they share similar topics. When clustering is employed on web sites, we are usually more interested in clustering the component pages according to the type of information that is presented in the page. A variety of similarity or distance measures have been proposed and widely applied, such as cosine similarity, Pearson correlation coefficient, Euclidian distance etc. This paper deals with semantic clustering of text documents written in Serbian language. The aim is to prepare the documents of different formats for clustering, to find key words in the set of documents, clustering documents based on key words and finding the most appropriate document for the given question.

COMPARISON OF LATENT SEMANTIC ANALYSIS AND PROBABILISTIC LATENT SEMANTIC ANALYSIS FOR DOCUMENTS CLUSTERING

In this paper we compare usefulness of statistical techniques of dimen-sionality reduction for improving clustering of documents in Polish. We start with partitional and agglomerative algorithms applied to Vector Space Model. Then we investigate two transformations: Latent Semantic Analysis and Probabilistic Latent Semantic Analysis. The obtained results showed advantage of Latent Semantic Analysis technique over probabilistic model. We also analyse time and memory consumption aspects of these transformations and present runtime details for IBM BladeCenter HS21 machine.

Document Representation and Dimension Reduction for Text Clustering

2007 IEEE 23rd International Conference on Data Engineering Workshop, 2007

Increasingly large text datasets and the high dimensionality associated with natural language create a great challenge in text mining. In this research, a systematic study is conducted, in which three different document representation methods for text are used, together with three Dimension Reduction Techniques (DRT), in the context of the text clustering problem. Several standard benchmark datasets are used. The three Document representation methods considered are based on the vector space model, and they include word, multi-word term, and character N-gram representations. The dimension reduction methods are independent component analysis (ICA), latent semantic indexing (LSI), and a feature selection technique based on Document Frequency (DF). Results are compared in terms of clustering performance, using the k-means clustering algorithm. Experiments show that ICA and LSI are clearly better than DF on all datasets. For word and N-gram representation, ICA generally gives better results compared with LSI. Experiments also show that the word representation gives better clustering results compared to term and N-gram representation. Finally, for the N-gram representation, it is demonstrated that a profile length (before dimensionality reduction) of 2000 is sufficient to capture the information and, in most cases, a 4-gram representation gives better performance than 3-gram representation.

Review on Semantic Document Clustering

Now the age of information technology, the textual document is spontaneously increasing over online or offline. In those articles contain Product information to a company profile. A lot of sources generate valuable information into text in the medical report, economic analysis, scientific journals, news, blog etc. Maintain and access those documents are very difficult without proper classification. Those problems can be overcome by proper document classification. Only a few documents are classified. All need classification and those are unsupervised. In this context clustering is the only solution. Traditional clustering technique and textual clustering have some difference. Relations between words are very imported to do clustering. Semantic clustering is proven as more appropriate clustering technique for texts. In this review paper, there has valuable information about clustering to semantic document clustering technique. In this paper, there has some information provided about advantage and disadvantage for various clustering methods.

High performance in minimizing of term-document matrix representation for document clustering

2009 Innovative Technologies in Intelligent Systems and Industrial Applications, 2009

Document clustering usually involves high dimensional term space, which makes it difficult for organizing data into a small number of meaningful clusters. Clustering based on similar terms without considering the content or meaning is often unsatisfactory as it ignores the relationship between important terms that do not co-occur literally. In this paper, we propose to integrate the Latent Semantic Indexing (LSI) concept to our document clustering. This involves the use of Singular Value Decomposition (SVD) which creates a new abstract and uses a way of finding pattern document collection in matrix representation, so that it can identify between the terms and documents which are similar. By using various numbers of patterns (rank) of SVD, the proposed method is applied to cluster documents using the Fuzzy C-Means algorithm. The results of the experiment show that the performance of document clustering to be better when applied to the LSI method.

Clustering and Classification of Text Documents Using Improved Similarity Measure

Dimensionality reduction is very challenging and important in text mining. We need to know which features be retained what to be and It helps in reducing the processing overhead when performing text classification and text clustering. Another concern in text clustering and text classification is the similarity measure which we choose to find the similarity degree between any two text documents. In this paper, we work towards text clustering and text classification by addressing dimensionality reduction using SVD followed by the use of the proposed similarity measure which is an improved version of our previous measure [25, 31]. This proposed measure is used for supervised and un-supervised learning. The proposed distance measure overcomes the disadvantages of the existing measures [10].

Semantic based Document Clustering: A Detailed Review

International Journal of Computer Applications, 2012

Document clustering, one of the traditional data mining techniques, is an unsupervised learning paradigm where clustering methods try to identify inherent groupings of the text documents, so that a set of clusters is produced in which clusters exhibit high intra-cluster similarity and low intercluster similarity. The importance of document clustering emerges from the massive volumes of textual documents created. Although numerous document clustering methods have been extensively studied in these years, there still exist several challenges for increasing the clustering quality. Particularly, most of the current document clustering algorithms does not consider the semantic relationships which produce unsatisfactory clustering results. Since last three-four years efforts have been seen in applying semantics to document clustering. Here, an exhaustive and detailed review of more than thirty semantic driven document clustering methods is presented. After an introduction to the document clustering and its basic requirements for improvement, traditional algorithms are overviewed. Also, semantic similarity measures are explained. The article then discusses algorithms that make semantic interpretation of documents for clustering. The semantic approach applied, datasets used, evaluation parameters applied, limitations and future work of all these approaches is presented in tabular format for easy and quick interpretation.