An effective web document clustering for information retrieval (original) (raw)
Related papers
A fuzzy-based algorithm for web document clustering
2004
Abstract Most existing methods of document clustering are based on a model that assumes a fixed-size vector representation of key terms or key phrases within each document. This assumption is not realistic in large and diverse document collections such as the World Wide Web. We propose a new fuzzy-based document clustering method (FDCM), to cluster documents that are represented by variable length vectors. Each vector element consists of two fields.
A new approach for fuzzy clustering of web documents
2004
Most existing methods of document clustering are based on the classical vector-space model, which represents each document by a fixed-size vector of key terms or key phrases. I n large and diverse document collections such as the World Wide Web, this approach suffers from a tremendous computational overload. since the constant size of the term vector equals to the total number o f key terms in all documents. We propose a new fuzzy-based approach to clustering documents that are represented by vectors of variable size. Each entry in a vector consists o f two fields. The first field is the name o f a key phrase in the document and the second denotes an importance weight associated with this key phrase within the particular document. W e will describe the proposed approach in detail and show how i t is implemented in a real world application from the area of web monitoring.
2013 Joint IFSA World Congress and NAFIPS Annual Meeting (IFSA/NAFIPS), 2013
The clustering of web search has become a very interesting research area among academic and scientific communities involved in information retrieval. Clustering of web search result systems, also called Web Clustering Engines, seek to increase the coverage of documents presented for the user to review, while reducing the time spent reviewing them. Several algorithms for web document clustering already exist, but results show there is room for more to be done. This paper introduces a new description-centric algorithm for clustering of web results called IFCWR. IFCWR initially selects a maximum estimated number of clusters using Forgy's strategy, then it iteratively merges clusters until results cannot be improved. Every merge operation implies the execution of Fuzzy C-Means for clustering results of web search and the calculus of Bayesian Information Criterion for automatically evaluating the best solution and number of clusters. IFCWR was compared against other established web document clustering algorithms, among them: Suffix Tree Clustering and Lingo. Comparison was executed on AMBIENT and MORESQUE datasets, using precision, recall, fmeasure, SSL k and other metrics. Results show a considerable improvement in clustering quality and performance.
Frequent Term Based Text Document Clustering - A Novel Approach.pdf
Clustering is one of the epic and traditional ways to make sure that the documents are retrieved at the right pace and according to the requirement. Clustering leads to keeping the similar kind of documents all together and so that they can be retrieved easily. The measure through which the relation between two documents is measured is called similarity index.
Web Document Clustering using Proposed Similarity Measure
2014
Recent advance research in data warehousing and data mining emerges various types of information sources. Web documents are the most useful information resources in this era. Efficient uses of these resources are most important for knowledge discovery. Bunch of documents providing related information is to be grouped in one cluster. Finding the similarity between documents is tedious task. There are various similarity measures introduced earlier to solve the problems related to clustering. Proposing new similarity measure to get better results of clustering is reason behind this paper work. As before concern to previous research, there is no consideration of present and absent features in documents. Proposed similarity measure concentrates on both present and absent features in the documents. Concentrating on similarity measure will help to mining technique.
Advanced Data Clustering Methods of Mining Web Documents
Issues in Informing Science and Information Technology, 2006
The aim of this paper is to evaluate, propose and improve the use of advanced web data clustering techniques, allowing data analysts to conduct more efficient execution of large-scale web data searches. Increasing the efficiency of this search process requires a detailed knowledge of abstract categories, pattern matching techniques, and their relationship to search engine speed. In this paper we compare several alternative advanced techniques of data clustering in creation of abstract categories for these algorithms. These algorithms will be submitted to a side-by-side speed test to determine the effectiveness of their design. In effect this paper serves to evaluate and improve upon the effectiveness of current web data search clustering techniques.
2010
This paper introduces a new description-centric algorithm for web document clustering based on the hybridization of the Global-Best Harmony Search with the K-means algorithm, Frequent Term Sets and Bayesian Information Criterion. The new algorithm defines the number of clusters automatically. The Global-Best Harmony Search provides a global strategy for a search in the solution space, based on the Harmony Search and the concept of swarm intelligence. The K-means algorithm is used to find the optimum value in a local search space. Bayesian Information Criterion is used as a fitness function, while FP-Growth is used to reduce the high dimensionality in the vocabulary. This resulting algorithm, called IGBHSK, was tested with data sets based on Reuters-21578 and DMOZ, obtaining promising results (better precision results than a Singular Value Decomposition algorithm). Also, it was also then evaluated by a group of users.
Challenging Issues and Similarity Measures for Web Document Clustering
Web itself contains a large amount of documents available in electronic form. The available documents are in various forms and the information in them is not in organized form. The lack of organization of materials in the WWW motivates people to automatically manage the huge amount of information. Text-mining refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. Text mining framework contains Information Retrieval, Information Extraction, Information Mining and Interpretation. During Information Retrieval, so many web documents are retrieved. In that how we can find out similar documents among retrieved? This paper deals with the challenging issues and similarity measures for web document clustering .