erasmus mundus masters program in language and communication technology A graph model for text analysis and text mining (original) (raw)

TM-SGTD: Text Mining Based Semantic Graph for Text Document Approach for Text Representation

International Journal of Engineering and Technology, 2017

Text representation is the essential step for the tasks of text mining. To represent the textual information more expressively, a kind of Text Mining based Semantic Graph approach is proposed, in which more semantic and ordering information among terms as well as the structural information of the text is incorporated. Such model can be constructed by extracting representative terms from texts and their mutually semantic relationships. The implementation of the proposed work is provided using the JAVA environment and python environment. Moreover, WordNet is showing relationship amongst word node. So that GEPHI tool is used to constructing more effectively semantic graph. Additionally the comparative performance is also compared with traditional. In order to compare the performance of the algorithms the memory consumption and time consumption is taken as stand parameters. The experimental results have proved the better performance of the proposed text information representation model in terms of its Time and Space complexity. Keyword-SVM, Semantic Graphs, POS Tagging, WordNet, Text representation, Text Mining, Graph Model, Semantic Networks I. INTRODUCTION Advances in digital technology and the World Wide Web have led to the increase of digital documents that are used for various purposes such as publishing and digital library. This phenomenon raises awareness for the requirement of effective techniques that can help during the search and retrieval of text. Nowadays, by using digital and computational techniques, we can store, manage and retrieve information automatically without any printed or hard copy of document. In addition of that in various applications automated text analysis or text mining played important role such as medical science, library management, social media and others. Typical tasks involved in these two areas include text classification, information extraction, document summarization, text pattern mining etc. [1]. Nowadays text is the most common form of storing the information. The representation of document is important step in the process of text mining. Hence, the challenging task is the appropriate representation of the textual information which will capable of representing the semantic information of the text [2]. In this work, we developed graph-based document model which is leverage valuable knowledge about relations between entities. Hence the work is intended to deliver a mechanism for constructing a semantic graph of text documents. A. Semantic Graph The data structure we will focus on is the semantic graphs. Semantic graphs are appropriate to represent the semantically information in their nodes, i.e., they carry semantic information on their nodes and edges. A semantic graph is a type of linkage of the different objects where nodes represent objects (e.g., persons, papers, organizations, etc.) and links (or edges) represent binary relationships between those objects (e.g., friend, citation, authorship, etc.). A semantic graph is a powerful representation structure which can encode semantic relationships between different types of objects. The edge relation information provides us the information of how the two different object nodes are connected to each other and their meaning. These graphs encode relationships as typed link between a pair of typed nodes. These semantically structured graphs are also called a relational data graph or an attributed relational graph. Indeed, semantic graphs are very similar to semantic networks and multi-relational networks (MRNs) used in artificial intelligence and knowledge representation [3].

An Efficient Semantic Graph-Based Approach for Text Representation

2018

Text document representation is one of the main issue in the text analysis areas such as topic extraction and text similarities. Standard Bag-of-Word representation does not deal with relationships between words. In order to overcome this limitation, we introduce a new approach based on the joint use of co-occurrence graph and semantic network of English language called Wordnet. To do this, a word sense disambiguation algorithm has been used in order to establish semantic links between terms given the surrounding context. Experimentations on standard datasets show good performances of the proposed approach. MOTS-CLÉS : Représentation des textes, WordNet, graphe, désambiguïsation des mots, sémantique.

Text Analysis Using Different Graph-Based Representations

Computación y Sistemas, 2018

This paper presents an overview of different graph-based representations proposed to solve text classification tasks. The core of this manuscript is to highlight the importance of enriched/non-enriched co-occurrence graphs as an alternative to traditional features representation models like vector representation, where most of the time these models can not map all the richness of text documents that comes from the web (social media, blogs, personal web pages, news, etc). For each text classification task the type of graph created as well as the benefits of using it are presented and discussed. In specific, the type of features/patterns extracted, the implemented classification/similarity methods and the results obtained in datasets are explained. The theoretical and practical implications of using co-occurrence graphs are also discussed, pointing out the contributions and challenges of modeling text document as graphs.

Graph-based Semantical Extractive Text Analysis

arXiv (Cornell University), 2022

In the past few decades, there has been an explosion in the amount of available data produced from various sources with different topics. The availability of this enormous data necessitates us to adopt effective computational tools to explore the data. This leads to an intense growing interest in the research community to develop computational methods focused on processing this text data. A line of study focused on condensing the text so that we are able to get a higher level of understanding in a shorter time. The two important tasks to do this are keyword extraction and text summarization. In keyword extraction, we are interested in finding the key important words from a text. This makes us familiar with the general topic of a text. In text summarization, we are interested in producing a short-length text which includes important information about the document. The TextRank algorithm, an unsupervised learning method that is an extension of the PageRank (algorithm which is the base algorithm of Google search engine for searching pages and ranking them) has shown its efficacy in large-scale text mining, especially for text summarization and keyword extraction. this algorithm can automatically extract the important parts of a text (keywords or sentences) and declare them as the result. However, this algorithm neglects the semantic similarity between the different parts. In this work, we improved the results of the TextRank algorithm by incorporating the semantic similarity between parts of the text. Aside from keyword extraction and text summarization, we develop a topic clustering algorithm based on our framework which can be used individually or as a part of generating the summary to overcome coverage problems.

Towards graphical models for text processing

Knowledge and Information Systems, 2012

The rapid proliferation of the World Wide Web has increased the importance and prevalence of text as a medium for dissemination of information. A variety of text mining and management algorithms have been developed in recent years such as clustering, classification, indexing, and similarity search. Almost all these applications use the well-known vectorspace model for text representation and analysis. While the vector-space model has proven itself to be an effective and efficient representation for mining purposes, it does not preserve information about the ordering of the words in the representation. In this paper, we will introduce the concept of distance graph representations of text data. Such representations preserve information about the relative ordering and distance between the words in the graphs and provide a much richer representation in terms of sentence structure of the underlying data. Recent advances in graph mining and hardware capabilities of modern computers enable us to process more complex representations of text. We will see that such an approach has clear advantages from a qualitative perspective. This approach enables knowledge discovery from text which is not possible with the use of a pure vector-space representation, because it loses much less information about the ordering of the underlying words. Furthermore, this representation does not require the development of new mining and management techniques. This is because the technique can also be converted into a structural version of the vectorspace representation, which allows the use of all existing tools for text. In addition, existing techniques for graph and XML data can be directly leveraged with this new representation. Thus, a much wider spectrum of algorithms is available for processing this representation. We will apply this technique to a variety of mining and management applications and show its advantages and richness in exploring the structure of the underlying text documents.

Text classification using graph mining-based feature extraction

2010

A graph-based approach to document classification is described in this paper. The graph representation offers the advantage that it allows for a much more expressive document encoding than the more standard bag of words/phrases approach, and consequently gives an improved classification accuracy.

Text Classification of English News Articles using Graph Mining Techniques

Proceedings of the 14th International Conference on Agents and Artificial Intelligence, 2022

Several techniques can be used in the natural language processing systems to understand text documents, such as, text classification. Text Classification is considered a classical problem with several purposes, varying from automated text classification to sentiment analysis. A graph mining technique for the text classification of English news articles is considered in this research. The proposed model was examined where every text is characterized by a graph that codes relations among the various words. A word's significance to a text is presented by the graph-theoretical degree of a graph's vertices. The proposed weighting scheme can significantly obtain the links between the words that co-appear in a text, producing feature vectors that can enhance the English news articles classification. Experiments have been conducted by implementing the proposed classification algorithms in well-known text datasets. The findings suggest that the proposed text classification using graph mining technique as accurate as other techniques using appropriate parameters.

Automatic Linguistic Pattern Identification Based on Graph Text Representation

Research in Computing Science

In this paper it is presented a model of text representation based on graphs. The model is applied in the particular case study of authorship attribution. The experiments were performed by using a corpus made up of 500 documents written by 10 different authors (50 documents per author). The obtained results highlight the benefit of using text features at different levels of language description in tasks associated to automatic processing of information. In particular, we have obtained a performance of 57% of accuracy for the authorship attribution task.

Using shallow semantic analysis and graph modelling for document classification

International Journal of Data Mining, Modelling and Management, 2013

Using graph-based, shallow semantic analysis-driven approach for modelling text contents allow to extract additional information about meaning of text. This paper discusses using two novel algorithms that are based on this idea. They are compared against 'legacy' bag-of-words and Schenker et al. approaches in NN document classification task.

Text Classification Using Graph-Encoded Linguistic Elements

Proceedings of the Eighteenth International Florida Artificial Intelligence Research Society Conference (FLAIRS-2005), 2005

Inspired by the goal to more accurately classify text, we describe an effort to map tokens and their characteristic linguistic elements into a graph and use that expressive representation to classify text phrases. We outperform the bag-of-words approach by exploiting word order and the semantic and syntactic characteristics within the phases. In this study, we map tagged corpora into a placeholder graph structure and classify the phrases within, using the crossdimensional linguistic characteristics of each token. Finally, we present heuristics for use in applying this method to other corpora.