Text mining with conceptual graphs (original) (raw)

Text mining at detail level using conceptual graphs

… Structures: Integration and …, 2002

Text mining is defined as knowledge discovery in large text collections. It detects interesting patterns such as clusters, associations, deviations, similarities, and differences in sets of texts. Current text mining methods use simplistic representations of text contents, such as keyword vectors, which imply serious limitations on the kind and meaningfulness of possible discoveries. We show how to do some typical mining tasks using conceptual graphs as formal but meaningful representation of texts. Our methods involve qualitative and quantitative comparison of conceptual graphs, conceptual clustering, building a conceptual hierarchy, and application of data mining techniques to this hierarchy in order to detect interesting associations and deviations. Our experiments show that, despite widespread misbelief, detailed meaningful mining with conceptual graphs is computationally affordable. * Work done under partial support of CONACyT, CGEPI-IPN, and SNI, Mexico. patterns, i.e., those distinguishing not only entities (topics) but also actions, attributes and their relations.

Concept Based Mining Model for Text Clustering

2013

The common techniques in text mining are based on the statistical analysis of a term either word or phrase. Statistical analysis of a term frequency captures the importance of the term within a document only. Two terms can have the same frequency in their documents, but one term contributes more to the meaning of its sentences than the other term. A new concept-based mining model that analyzes terms in the sentence, document level and corpus level is introduced. The concept based mining model can effectively discriminate between non important terms with respect to sentence semantics and terms which hold the concepts that represent the sentence meaning. The proposed model consists of sentence-based concept analysis, document-based concept analysis, corpus based concept analysis and concept-based similarity measure in calculating the similarity between documents.

An Efficient Concept-Based Mining Model for Enhancing Text Clustering

The common techniques in text mining are based on the statistical analysis of a term, either word or phrase.Text is represented by the words it mentions, and thematic similarity is based on the proportion of words that texts have in common. The complex is constructed using groups of co-occurring words (term associations) identified using traditional data mining methods. Disjoint subsections of the complex (connect components) represent general concepts within the documents' concept space. A new concept-based mining model composed of four components, is proposed to improve the text clustering quality. By exploiting the semantic structure of the sentences in documents, a better text clustering result is achieved.

Enhancing text clustering using concept-based mining model

2006

Most of text mining techniques are based on word and/or phrase analysis of the text. The statistical analysis of a term (word or phrase) frequency captures the importance of the term within a document. However, to achieve a more accurate analysis, the underlying mining technique should indicate terms that capture the semantics of the text from which the importance of a term in a sentence and in the document can be derived. A new concept-based mining model that relies on the analysis of both the sentence and the document, rather than, the traditional analysis of the document dataset only is introduced. The proposed mining model consists of a concept-based analysis of terms and a concept-based similarity measure. The term which contributes to the sentence semantics is analyzed with respect to its importance at the sentence and document levels. The model can efficiently find significant matching terms, either words or phrases, of the documents according to the semantics of the text. The similarity between documents relies on a new concept-based similarity measure which is applied to the matching terms between documents. Experiments using the proposed concept-based term analysis and similarity measure in text clustering are conducted. Experimental results demonstrate that the newly developed concept-based mining model enhances the clustering quality of sets of documents substantially.

Concept Mining using Conceptual Ontological Graph (COG)

Concept mining (CM) is the area of exploring and finding links, associations, relationships, and patterns among huge collections of information. In this paper, we propose concept-based text representation, with an emphasis on using the proposed representation in different application s such as information retrieval, text summarization, and question answering. This work presents a new paradigm for concept mining by extracting the concept-based information from a raw text. At the text representation level, we introduce a sentence based conceptual ontological representation that builds concept-based representations for the whole document. A new concept-based similarity measure is proposed to measure the similarity of texts based on their meaning. The proposed approach is domain independent and it could be applied to general domain applications. The proposed approach has been applied to the domain of information retrieval and preliminary results are promising, and give an affirmation for proceeding in the right directions of this research.

A Consistent Web Documents Based Text Clustering Using Concept Based Mining Model

2012

Text mining is a growing innovative field that endeavors to collect significant information from natural language processing term. It might be insecurely distinguished as the course of examining texts to extract information that is practical for particular purposes. In this case, the mining model can detain provisions that identify the concepts of the sentence or document, which tends to detect the subject of the document. In an existing work, the concept-based mining model is used only for normal text documents clustering and clustered the text parts of the documents and efficiently discover noteworthy identical concepts among documents, according to the semantics of the sentences. But the downside of the work is that the existing work cannot be linked to web documents clustering and the text classification for the documents is an unreliable one. To make the text clustering more consistent, in our work, we plan to present a Conceptual Rule Mining On Text clusters to evaluate the mo...

Efficient Conceptual Rule Mining on Text Clusters in Web Documents

International Journal of Computer Applications, 2012

Text mining is a modern and computational approach attempts to determine new, formerly unidentified information by pertaining techniques from normal language processing and data mining. Clustering, one of the conventional data mining techniques is an unsubstantiated learning pattern where clustering techniques attempt to recognize intrinsic groupings of the text documents, so that a set of clusters is formed in which clusters reveal high intra-cluster comparison and low inter-cluster similarity. Most current document clustering methods are based on the Vector Space Model (VSM), which is a widely used data representation for text classification and clustering. Moreover, weighting these features accurately also affects the result of the clustering algorithm substantially. The previous work described the conceptual text clustering to web documents, containing various mark up language formats associated with the documents (term extraction mode). In this work, we are going to present a Conceptual rule mining which is generated for the sentence meaning and related sentences in the document. Weights are appropriated for the sentences having higher contribution to the topic of the document. Conditional probability is evaluated for the sentence weights. Probability ratio is identified for the sentence similarity from which unique sentence meaning contributing to the document topic are listed. Experiments are conducted with the web documents extracted from the research repositories to evaluate the efficiency of the proposed efficient conceptual rule mining on text clusters in web documents and compared with an existing Model for Concept Based Clustering and Classification in terms of Topic related rules, Weights of the influential sentence, Topic Sensitivity..

Graph-based hierarchical conceptual clustering

The Journal of Machine Learning …, 2002

Hierarchical conceptual clustering has proven to be a useful, although under-explored, data mining technique. A graph-based representation of structural information combined with a substructure discovery technique has been shown to be successful in knowledge discovery. The SUBDUE substructure discovery system provides one such combination of approaches. This work presents SUBDUE and the development of its clustering functionalities. Several examples are used to illustrate the validity of the approach both in structured and unstructured domains, as well as to compare SUBDUE to the Cobweb clustering algorithm. We also develop a new metric for comparing structurally-defined clusterings. Results show that SUBDUE successfully discovers hierarchical clusterings in both structured and unstructured data.

A ConceptLink Graph for Text Structure Mining

Australasian Computer Science Conference, 2009

Most text mining methods are based on representing doc- uments using a vector space model, commonly known as a bag of word model, where each document is modeled as a linear vector representing the occurrence of independent words in the text corpus. It is well known that using this vector-based representation, important information, such as semantic relationship among concepts, is

Mining Conceptual Graphs for Knowledge Acquisition

2008

This work addresses the use of computational linguistic analysis techniques for conceptual graphs learning from unstructured texts. A technique including both content mining and interpretation, as well as clustering and data cleaning, is introduced. Our proposal exploits sentence structure in order to generate concept hypothese, rank them according to plausibility and select the most credible ones. It enables the knowledge acquisition task to be performed without supervision, minimizing the possibility of failing to retrieve information contained in the document, in order to extract non-taxonomic relations.

A Frequent Concepts Based Document Clustering Algorithm

International Journal of Computer Applications, 2010

This paper presents a novel technique of document clustering based on frequent concepts. The proposed technique, FCDC (Frequent Concepts based document clustering), a clustering algorithm works with frequent concepts rather than frequent items used in traditional text mining techniques. Many well known clustering algorithms deal with documents as bag of words and ignore the important relationships between words like synonyms. the proposed FCDC algorithm utilizes the semantic relationship between words to create concepts. It exploits the WordNet ontology in turn to create low dimensional feature vector which allows us to develop a efficient clustering algorithm. It uses a hierarchical approach to cluster text documents having common concepts. FCDC found more accurate, scalable and effective when compared with existing clustering algorithms like Bisecting K-means , UPGMA and FIHC.

2013

Automatic lexicon generation is a useful task in learning text fragment patterns. In our previous work we have focused on text fragment pattern learning through the fuzzy grammar method which inputs include a predefined lexicon and text fragments that represents the expression of the grammar class to be learned. However, the bottleneck of the success of the fuzzy grammar creation and in common with other text learner often lies in the knowledge acquisition phase; due to the labour intensive text annotation which also demands skills and background knowledge of the text. For this reason, a semi-automated technique called automatic Terminal Grammar Recommender (TGR) is devised to identify conceptually related lexicons in the texts and their related to create terminal grammars by mining associations of words contexts. The approach recognizes that there is a degree of local structure within such text and the technique exploits the local structure without the large computational overhead of deeper analysis. Result from the comparison of the associative words detected by TGR with the definition of a content category tool called General Inquirer on the data from European Central Bank data is reported. Our findings show that our proposed method has managed to reduce the manual effort of identifying conceptually similar lexicons to form terminal grammars. The average of matched generated terminal grammar clusters compared to General Inquirer is 54.85% which indicates that at least half the expensive effort to construct conceptually related lexicon is saved. This hint the potential of word context association mining in automated conceptual lexicon generation.

Automatic structuring of knowledge bases by conceptual clustering

IEEE Transactions on Knowledge and Data Engineering, 1995

An important structuring mechanism for knowledge bases is building an inheritance hierarchy of classes based on the content of their knowledge objects. This hierarchy facilitates group-related processing tasks such as answering set queries, discriminating between objects, finding similarities among objects, etc. Building this hierarchy is a difficult task for the knowledge engineer. Conceptual clustering may be used to automate or assist the engineer in the creation of such a classification structure. This article introduces a new conceptual clustering method which addresses the problem of clustering large amounts of structured objects. The conditions under which the method is applicable are discussed.

GDClust: A Graph-Based Document Clustering Technique

Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007), 2007

This paper introduces a new technique of document clustering based on frequent senses. The proposed system, GDClust (Graph-Based Document Clustering) works with frequent senses rather than frequent keywords used in traditional text mining techniques. GDClust presents text documents as hierarchical document-graphs and utilizes an Apriori paradigm to find the frequent subgraphs, which reflect frequent senses. Discovered frequent subgraphs are then utilized to generate sense-based document clusters. We propose a novel multilevel Gaussian minimum support approach for candidate subgraph generation. GDClust utilizes English language ontology to construct document-graphs and exploits graph-based data mining technique for sense discovery and clustering. It is an automated system and requires minimal human interaction for the clustering purpose.

A Survey Paper on Concept Mining in Text Documents

Concept Mining has become an important research area. Concept Mining is used to search or extract the concepts embedded in the text document. Concept based approach search for the informative terms based on their meaning rather than on the presence of the keyword in the text.

Improving Text Clustering Quality by Concept Mining

2013

In text mining most techniques depends on statistical analysis of terms. Statistical analysis trances important terms within document only. However this concept based mining model analyses terms in sentence, document and corpus level. This mining model consist of sentence based concept analysis, document based and corpus based concept analysis and concept based similarity measure. Experimental result enhances text clustering quality by using sentence, document, corpus and combined approach of concept analysis.

Clustering Data Text Based on Semantic

2017

Clustering is one of the most important data mining techniques which categorize a large number of unordered text documents into meaningful and coherent clusters. Most of text clustering algorithms do not consider the semantic relationships between words and do not have the ability to recognize and use the semantic concepts.In this paper, a new algorithm has been presented to cluster texts based on meanings of the words. First, a new method has been presented to find semantic relationship between words based on Wordnet ontology then, text data is clustered using the proposed method and hierarchical clustering algorithm. Documents are preprocessed, converted to vector space model, and then are clustered using the proposed algorithm semantically. The experimental results show that the quality and accuracy of the proposed algorithm are more reliable than the existing hierarchical clustering algorithms.

Design and Develop Semantic Textual Document Clustering Model

The utilization of textual documents is spontaneously increasing over the internet, email, web pages, reports, journals, articles and they stored in the electronic database format. It is challenging to find and access these documents without proper classification mechanisms. To overcome such difficulties we proposed a semantic document clustering model and develop this model. The document pre-processing steps, semantic information from WordNet help us to be bioavailable the semantic relation from raw text. By reminding the limitation of traditional clustering algorithms on the natural language, we consider semantic clustering by COBWEB conceptual clustering. Clustering quality and high accuracy were one of the most important aims of our research, and we chose F-Measure evaluation for ensuring the purity of clustering. However, there still exist many challenges, like the word, high spatial property, extracting core linguistics from texts, and assignment adequate description for the generated clusters. By the help of Word Net database, we eliminate those issues. In this research paper, there have a proposed framework and describe our development evaluation with evaluation.

An Improved Hierarchical Technique for Document Clustering

2015

Data mining is the process of non-trivial discovery from implied, previously unknown, and potentially useful information from data in large databases. Hence it is a core element in knowledge discovery, often used synonymously. Clustering, one of technique for data mining used for grouping similar terms together. Earlier statistical analysis used in text mining depends on term frequency. Then, new concept based text mining model was introduced which analyses terms. Clustering of document is useful for the purpose of document organization, summarization, and information retrieval in an efficient way. Initially, clustering is applied for enhancing the information retrieval techniques. Of late, clustering techniques have been applied in the areas which involve browsing the gathered data or in categorizing the outcome provided by the search engines for the reply to the query raised by the users. In this paper, we are providing a comprehensive survey over the document clustering.

Text Schema Mining Using Graphs and Formal Concept Analysis

2002

This paper presents an investigation into finding and evaluating schemata through formal concept analysis. Schemata are used in conceptual authoring support to provide proven building blocks of text structures. As still only few schemata are available, ways to mine them from structures of existing texts seem worthwhile. The general process begins with the structure of a text as a graph, transforms this into a formal context and examines the formal concept lattice for this context. Especially formal concepts with large extents may be candidates for schemata. Three alternative kinds of transformations are presented: Wille’s Natural transformation produces contexts mainly based on type and connection information, Schema-derived transformations derive of attributes that identify partial or complete instances from a set of schemata, Informal: Starting from a set of schemata, manually formulate conditions that may be present in the instance graph and contribute to the presence of such schemata. We have regarded document structures consisting of a hierarchy of sections and subsections, which may import and export topics. The topics are interconnected in a conceptual graph called the topic map. Results of processing two such structures with the natural transformation and an informal one are reported. Some notes on the implementation in the Chasid prototype are given.