A Consistent Web Documents Based Text Clustering Using Concept Based Mining Model (original) (raw)

Efficient Conceptual Rule Mining on Text Clusters in Web Documents

International Journal of Computer Applications, 2012

Text mining is a modern and computational approach attempts to determine new, formerly unidentified information by pertaining techniques from normal language processing and data mining. Clustering, one of the conventional data mining techniques is an unsubstantiated learning pattern where clustering techniques attempt to recognize intrinsic groupings of the text documents, so that a set of clusters is formed in which clusters reveal high intra-cluster comparison and low inter-cluster similarity. Most current document clustering methods are based on the Vector Space Model (VSM), which is a widely used data representation for text classification and clustering. Moreover, weighting these features accurately also affects the result of the clustering algorithm substantially. The previous work described the conceptual text clustering to web documents, containing various mark up language formats associated with the documents (term extraction mode). In this work, we are going to present a Conceptual rule mining which is generated for the sentence meaning and related sentences in the document. Weights are appropriated for the sentences having higher contribution to the topic of the document. Conditional probability is evaluated for the sentence weights. Probability ratio is identified for the sentence similarity from which unique sentence meaning contributing to the document topic are listed. Experiments are conducted with the web documents extracted from the research repositories to evaluate the efficiency of the proposed efficient conceptual rule mining on text clusters in web documents and compared with an existing Model for Concept Based Clustering and Classification in terms of Topic related rules, Weights of the influential sentence, Topic Sensitivity..

Concept Based Mining Model for Text Clustering

2013

The common techniques in text mining are based on the statistical analysis of a term either word or phrase. Statistical analysis of a term frequency captures the importance of the term within a document only. Two terms can have the same frequency in their documents, but one term contributes more to the meaning of its sentences than the other term. A new concept-based mining model that analyzes terms in the sentence, document level and corpus level is introduced. The concept based mining model can effectively discriminate between non important terms with respect to sentence semantics and terms which hold the concepts that represent the sentence meaning. The proposed model consists of sentence-based concept analysis, document-based concept analysis, corpus based concept analysis and concept-based similarity measure in calculating the similarity between documents.

Enhancing text clustering using concept-based mining model

2006

Most of text mining techniques are based on word and/or phrase analysis of the text. The statistical analysis of a term (word or phrase) frequency captures the importance of the term within a document. However, to achieve a more accurate analysis, the underlying mining technique should indicate terms that capture the semantics of the text from which the importance of a term in a sentence and in the document can be derived. A new concept-based mining model that relies on the analysis of both the sentence and the document, rather than, the traditional analysis of the document dataset only is introduced. The proposed mining model consists of a concept-based analysis of terms and a concept-based similarity measure. The term which contributes to the sentence semantics is analyzed with respect to its importance at the sentence and document levels. The model can efficiently find significant matching terms, either words or phrases, of the documents according to the semantics of the text. The similarity between documents relies on a new concept-based similarity measure which is applied to the matching terms between documents. Experiments using the proposed concept-based term analysis and similarity measure in text clustering are conducted. Experimental results demonstrate that the newly developed concept-based mining model enhances the clustering quality of sets of documents substantially.

An Efficient Concept-Based Mining Model for Enhancing Text Clustering

The common techniques in text mining are based on the statistical analysis of a term, either word or phrase.Text is represented by the words it mentions, and thematic similarity is based on the proportion of words that texts have in common. The complex is constructed using groups of co-occurring words (term associations) identified using traditional data mining methods. Disjoint subsections of the complex (connect components) represent general concepts within the documents' concept space. A new concept-based mining model composed of four components, is proposed to improve the text clustering quality. By exploiting the semantic structure of the sentences in documents, a better text clustering result is achieved.

Improving Text Clustering Quality by Concept Mining

2013

In text mining most techniques depends on statistical analysis of terms. Statistical analysis trances important terms within document only. However this concept based mining model analyses terms in sentence, document and corpus level. This mining model consist of sentence based concept analysis, document based and corpus based concept analysis and concept based similarity measure. Experimental result enhances text clustering quality by using sentence, document, corpus and combined approach of concept analysis.

Concept-based knowledge discovery in texts extracted from the web

Sigkdd Explorations, 2000

This paper presents an approach for knowledge discovery in texts extracted from the Web. Instead of analyzing words or attribute values, the approach is based on concepts, which are extracted from texts to be used as characteristics in the mining process. Statistical techniques are applied on concepts in order to find interesting patterns in concept distributions or associations. In this way, users can perform discovery in a high level, since concepts describe real world events, objects, thoughts, etc. For identifying concepts in texts, a categorization algorithm is used associated to a previous classification task for concept definitions. Two experiments are presented: one for political analysis and other for competitive intelligence. At the end, the approach is discussed, examining its problems and advantages in the Web context.

A Survey Paper on Concept Mining in Text Documents

Concept Mining has become an important research area. Concept Mining is used to search or extract the concepts embedded in the text document. Concept based approach search for the informative terms based on their meaning rather than on the presence of the keyword in the text.

Design and Develop Semantic Textual Document Clustering Model

The utilization of textual documents is spontaneously increasing over the internet, email, web pages, reports, journals, articles and they stored in the electronic database format. It is challenging to find and access these documents without proper classification mechanisms. To overcome such difficulties we proposed a semantic document clustering model and develop this model. The document pre-processing steps, semantic information from WordNet help us to be bioavailable the semantic relation from raw text. By reminding the limitation of traditional clustering algorithms on the natural language, we consider semantic clustering by COBWEB conceptual clustering. Clustering quality and high accuracy were one of the most important aims of our research, and we chose F-Measure evaluation for ensuring the purity of clustering. However, there still exist many challenges, like the word, high spatial property, extracting core linguistics from texts, and assignment adequate description for the generated clusters. By the help of Word Net database, we eliminate those issues. In this research paper, there have a proposed framework and describe our development evaluation with evaluation.

Document Clustering based on Semantic Notions

2017

ing for each document. There are several possible extensions to this work:  The proposed document clustering approach has many practical applications. One direction is to apply this technique on some specific application area along with application specific optimizations to see the outcome. For example: web search results can be clustered using this approach. The snippets for each cluster are generated to see the quality of these snippets.  In the proposed approach each term, whether it is from lexical chain or from topic maps, has an equal effect on similarity calculation for a pair of documents. One possible direction is to introduce discriminative feature weighting for the features in this approach. Discriminative feature weighting has encouraging results for both text clustering and classification tasks.