Self-Organizing Map vs Initial Centroid Selection Optimization to Enhance K-Means with Genetic Algorithm to Cluster Transcribed Broadcast News Documents (original) (raw)
Related papers
Novel similarity-based clustering algorithm for grouping broadcast news
Proceedings of SPIE, 2002
The goal of the current paper is to introduce a novel clustering algorithm that has been designed for grouping transcribed textual documents obtained out of audio, video segments. Since audio transcripts are normally highly erroneous documents, one of the major challenges at the text processing stage is to reduce the negative impacts of errors gained at the speech recognition stage. Other difficulties come from the nature of conversational speech. In the paper we describe the main difficulties of the spoken documents and suggest an approach restricting their negative effects. In our paper we also present a clustering algorithm that groups transcripts on the base of informative closeness of documents. To carry out such partitioning we give an intuitive definition of "informative field of a transcript" and use it in our algorithm. To assess informative closeness of the transcripts, we apply Chi-square similarity measure, which is also described in the paper. Our experiments with Chi-square similarity measure showed its robustness and high efficacy. In particular, the performance analysis that have been carried out in regard to Chi-square and three other similarity measures such as Cosine, Dice, and Jaccard showed that Chi-square is more robust to specific features of spoken documents.
Novel similarity-based clustering algorithm for grouping broadcast news
2002
The goal of the current paper is to introduce a novel clustering algorithm that has been designed for grouping transcribed textual documents obtained out of audio, video segments. Since audio transcripts are normally highly erroneous documents, one of the major challenges at the text processing stage is to reduce the negative impacts of errors gained at the speech recognition stage. Other difficulties come from the nature of conversational speech. In the paper we describe the main difficulties of the spoken documents and suggest an approach restricting their negative effects.
Document Clustering using Self-Organizing Maps
MENDEL
Cluster analysis of textual documents is a common technique for better ltering, navigation, under-standing and comprehension of the large document collection. Document clustering is an autonomous methodthat separate out large heterogeneous document collection into smaller more homogeneous sub-collections calledclusters. Self-organizing maps (SOM) is a type of arti cial neural network (ANN) that can be used to performautonomous self-organization of high dimension feature space into low-dimensional projections called maps. Itis considered a good method to perform clustering as both requires unsupervised processing. In this paper, weproposed a SOM using multi-layer, multi-feature to cluster documents. The paper implements a SOM usingfour layers containing lexical terms, phrases and sequences in bottom layers respectively and combining all atthe top layers. The documents are processed to extract these features to feed the SOM. The internal weightsand interconnections between these layer...
2002
The goal of the paper is to present a novel Chi-square similarity measure and assess its performance through comparison with well-known similarity measures such as Cosine, Dice, and Jaccard. The Chi-square similarity measure has been designed to withstand the imperfections of transcribed spoken documents. The major difference of our similarity measure from others consists in the fact that in addition to searching for co-occurring words in documents, we also match informative closeness of common words. We assume that co-occurring words, which had been employed to convey the same information, should have the compatible significance in matching documents. To test it we apply the Chi-square method. Experimental results obtained via using an archive of transcribed news broadcasts demonstrate the high efficacy of the proposed methodology.
Document Clustering Using Self-Organizing Maps : A Multi-Features Layered Approach
2017
Cluster analysis of textual documents is a common technique for better filtering, navigation, understanding and comprehension of the large document collection. Document clustering is an autonomous method that separate out large heterogeneous document collection into smaller more homogeneous sub-collections called clusters. Self-organizing maps (SOM) is a type of artificial neural network (ANN) that can be used to perform autonomous self-organization of high dimension feature space into low-dimensional projections called maps. It is considered a good method to perform clustering as both requires unsupervised processing. In this paper, we proposed a SOM using multi-layer, multi-feature to cluster documents. The paper implements a SOM using four layers containing lexical terms, phrases and sequences in bottom layers respectively and combining all at the top layers. The documents are processed to extract these features to feed the SOM. The internal weights and interconnections between t...
Weighted segmental k-means initialization for som-based speaker clustering
Ninth Annual Conference …, 2008
A new approach for initial assignment of data in a speaker clustering application is presented. This approach employs Weighted Segmental K-Means clustering algorithm prior to competitive based learning. The clustering system relies on Self-Organizing Maps (SOM) for speaker modeling and likelihood estimation. Performance is evaluated on 108 two speaker conversations taken from LDC CALLHOME American English Speech corpus using NIST criterion and shows an improvement of approximately 48% in Cluster Error Rate (CER) relative to the randomly initialized clustering system. The number of iterations was reduced significantly, which contributes to both speed and efficiency of the clustering system.
Systematic Selection of Initial Centroid for K-Means Document Clustering System
2016
As the number of electronic documents generated<br> from worldwide source increases, it is hard to manually<br> organize, analyze and present these documents efficiently.<br> Document clustering is one of the traditionally data mining<br> techniques and an unsupervised learning paradigm. Fast and<br> high quality document clustering algorithms play an<br> important role in helping users to effectively navigate,<br> summarize and organize the information. K-Means algorithm<br> is the most commonly used partitioned clustering algorithm<br> because it can be easily implemented and is the most efficient<br> one in terms of execution times. However, the major problem<br> with this algorithm is that it is sensitive to the selection of<br> initial centroid and may converge to local optima. The<br> algorithm takes the initial cluster centre arbitrarily so it does<br> not always guarantee good clustering resu...
Evaluation and comparison of concept based and n-grams based text clustering using SOM
INFOCOMP Journal of …, 2008
With the great and rapidly growing number of documents available in digital form (Internet, library, CD-Rom…), the automatic classification of texts has become a significant research field and a fundamental task in document processing. This paper deals with unsupervised classification of textual documents also called text clustering using Self-Organizing Maps of Kohonen in two new situations: a conceptual representation of texts and a representation based on n-grams, instead of a representation based on words. The effects of these combinations are examined in several experiments using 4 measurements of similarity. The Reuters-21578 corpus is used for evaluation. The evaluation was done by using the F-measure and the entropy.
A comparative study of clustering techniques for non-segmented language documents
Rangsit University, 2017
Document clustering has become an important area of study due to the rapid increase in the number of electronic documents. It can be employed to group and categorize documents, as well as provide a useful summary of the categories for browsing purposes. Until now, many clustering techniques have been developed for grouping and clustering documents both in segmented and non-segmented languages, like English and some Asian languages, respectively. However, document clustering can be a complicated task for many Asian languages such as Chinese, Japanese, Korean and Thai, because these languages are written without explicit word boundary delimiters such as white space. The aim of this paper is to provide a comprehensive and comparative study of non-segmented document clustering techniques using self-organizing map (SOM) and k-means, as they are two classic and well known methods in the area of text clustering. To illustrate these two methods, experimental and comparative studies on clustering non-segmented documents by using SOM and k-means are revealed in this paper. The keyword extraction is first applied to search for the member of occurrences. These members are then used as an input for the next clustering process. The experimental results show that k-means technique is simple and has low computation cost. Meanwhile, SOM is relatively complex, but the clustering performance is more visual and easy to comprehend. Consequently, k-means technique has become a well-known text clustering method and is used by many fields due to its straightforwardness, while SOM performs well for detection of noisy documents, thus making it more suitable for some applications such as navigation of document collection and multi-document summarization.
Document Clustering using Improved K- meansAlgorithm
Clustering is an efficient technique that organizes a large quantity of unordered text documents into a small number of significant and coherent clusters, thereby providing a basis for intuitive and informative navigation and browsing mechanisms. It is studied by the researchers at broad level because of its broad application in several areas such as web mining, search engines, and information extraction. It clusters the documents based on various similarity measures. The existing K-means (document clustering algorithm) was based on random center generation and every time the clusters generated was different In this paper, an Improved Document Clustering algorithm is given which generates number of clusters for any text documents based on fixed center generation, collect only exclusive words from different documents in dataset and uses cosine similarity measures to place similar documents in proper clusters. Experimental results showed that accuracy of proposed algorithm is high compare to existing algorithm in terms of F-Measure, Recall, Precision and time complexity.