Automatic Text Ontological Representation and Classification via Fundamental to Specific Conceptual Elements (TOR-FUSE) (original) (raw)

Semantic Ontology-Based Approach to Enhance Text Classification

2021

Text Classification is the process of defining a collection of pre-defined classes to free-text. It has been one of the most researched areas in machine learning with various applications such as sentiment analysis, topic labeling, language detection and spam filter etc. The efficiency of text classification improves, when some relation or pattern in the data is given or known, which can be provided by ontology. It further helps in reducing the size of dataset. Ontology is a collection of data items that helps in storing and representing data in a way that preserves the patterns in it and its semantic relationship with each other. We have attempted to verify the improvement provided by the use of ontology in classification algorithms. The code prepared in this research and the method developed is pretty generic, and could be extended to any ontology based text classification system. In this paper, we present an enhanced architecture that can uses ontology to provide an effective tex...

Impact of an ontology for automatic text classification

Annals of Library and Information Studies, 2015

The concept of ontologies has widely been used in various applications including email filtering and electronic news classification. It can be also used for the classification of digital documents in a library. Advancing the accuracy of classification is the main purpose of using ontologies for classification. Documents may be difficult to understand due to the vague terms used in the text. However, since ontologies represent the semantic relationships of the terms, they can be used to correctly identify the subject of a document. This study made an attempt to improve the classification accuracy of an automatic text classification system by using an ontology. Classification results given by the automatic system with and without integrating the ontology were used to evaluate the impact of the ontology for automatic classification. Results showed that 32.76% more documents and 25% more subjects were correctly classified by the ontology based system than the system prior to use of ontology.

Ontology-driven Conceptual Document Classification

Document classification based on the lexical-semantic network, wordnet, is presented. Two types of document classification in Serbian have been experimented with-classification based on chosen concepts from Serbian WordNet (SWN) and proper names-based classification. Conceptual document classification criteria are constructed from hierarchies rooted in a set of chosen concepts (first case) or in hierarchies rooted in some of the proper names' hypernyms (second case). A classificator of the first type is trained and then tested on an indexed and already classified Ebart corpus of Serbian newspapers (476917 articles). Precision, recall and F-measure show that this type of classification is promising although incomplete due mainly to SWN incompleteness. In the context of proper names-based classification, a proper names ontology based on the SWN is presented in the paper. A distance based similarity measure is defined, based on Euclidean and Manhattan distances. Classification of a subset of Contemporary Serbian Language Corpus is presented.

Semantic text classification: A survey of past and recent advances

Information Processing and Management, 2018

Automatic text classification is the task of organizing documents into predetermined classes, generally using machine learning algorithms. Generally speaking, it is one of the most important methods to organize and make use of the gigantic amounts of information that exist in unstructured textual format. Text classification is a widely studied research area of language processing and text mining. In traditional text classification, a document is represented as a bag of words where the words in other words terms are cut from their finer context i.e. their location in a sentence or in a document. Only the broader context of document is used with some type of term frequency information in the vector space. Consequently, semantics of words that can be inferred from the finer context of its location in a sentence and its relations with neighboring words are usually ignored. However, meaning of words, semantic connections between words, documents and even classes are obviously important since methods that capture semantics generally reach better classification performances. Several surveys have been published to analyze diverse approaches for the traditional text classification methods. Most of these surveys cover application of different semantic term relatedness methods in text classification up to a certain degree. However, they do not specifically target semantic text classification algorithms and their advantages over the traditional text classification. In order to fill this gap, we undertake a comprehensive discussion of semantic text classification vs. traditional text classification. This survey explores the past and recent advancements in semantic text classification and attempts to organize existing approaches under five fundamental categories; domain knowledge-based approaches, corpus-based approaches, deep learning based approaches, word/character sequence enhanced approaches and linguistic enriched approaches. Furthermore, this survey highlights the advantages of semantic text classification algorithms over the traditional text classification algorithms.

A novel approach for ontology based dimensionality reduction for web text document classification

2017

—Dimensionality reduction of feature vector size plays a vital role in enhancing the text processing capabilities; it aims in reducing the size of the feature vector used in the mining tasks (classification, clustering... etc.). This paper proposes an efficient approach to be used in reducing the size of the feature vector for web text document classification process. This approach is based on using WordNet ontology, utilizing the benefit of its hierarchal structure, to eliminate words from the generated feature vector that has no relation with any of WordNet lexical categories; this leads to the reduction of the feature vector size without losing information on the text. For mining tasks, the Vector Space Model (VSM) is used to represent text documents and the Term Frequency Inverse Document Frequency (TFIDF) is used as a term weighting method. The proposed ontology based approach was evaluated against the Principal component analysis (PCA) approach using several experiments. The experimental results reveal the effectiveness of our proposed approach against other traditional approaches to achieve a better classification accuracy, F-measure, precision, and recall

Towards Ontology-Based web text Document Classification

Journal of Engineering Science and Military Technologies

The data on the web is generally stored in structured, semi-structured and unstructured formats; from the survey the most of the information of an organization is stored in unstructured textual form .so, the task of categorizing this huge number of unstructured web text documents has become one of the most important tasks when dealing with web. Categorization, Classification, of web text documents aims in assigning one or more class labels, Categories, to the un-labeled ones; the assignment process depends mainly on the contents of the document itself with the help of using one or more of machine learning techniques. Different learning algorithms have been applied on the content of text documents for the classification process. In this paper experiments uses a subset of Reuters-21578 dataset to highlight the leakage and limitations of traditional techniques for feature generation and dimensionality reduction, showing the results of classification accuracy, and F-measure when applying different classification algorithms.

Lexical Ontology Layer – A Bridge between Text and Concepts

Lecture Notes in Computer Science, 2012

Intelligent methods for automatic text processing require linking between lexical resources (texts) and ontologies that define semantics. However, one of the main problems is that while building ontologies, the main effort is put to the construction of the conceptual part, whereas the lexical aspects of ontologies are usually diminished. Therefore, analyzing texts, it is usually difficult to map words to concepts from the ontology. Usually one should consider various linguistic relationships, such as homonymy, synonymy, etc. However, they are not clearly reflected in the conceptual part. We propose LEXO -a special lexical layer, which is thought as a bridge between text and the conceptual core of the ontology. LEXO is dedicated to storing linguistic relationships along with textual evidence for the relationships (as discovered in the text mining process). In addition, we present an algorithm based on LEXO for determining meaning of a given term in an analyzed text.

Document classification utilising ontologies and relations between documents

Proceedings of the Eighth Workshop on Mining and Learning with Graphs - MLG '10, 2010

Two major types of relational information can be utilized in automatic document classification as background information: relations between terms, such as ontologies, and relations between documents, such as web links or citations in articles. We introduce a model where a traditional bag-of-words type classifier is gradually extended to utilize both of these information types. The experiments with data from the Finnish National Archive show that classification accuracy improves from 70% to 74% when the General Finnish Ontology YSO is used as background information, without using relations between documents.

TEXT CLASSIFICATION USING SEMANTIC NETWORKS

2011

In the age of information overflow, we face with the challenge of categorizing the digital information we come across on a daily basis, in order to apply different operations and priorities to different types of information and to manage to use it in a more efficient manner. This issue introduces the challenge of automatic text classification. The problem of text classification can be defined as assigning one or more categories to a certain text, based on its contents. There are many different approaches for solving this problem: one of the solutions is the use of latent semantic analysis (LSA), statistical text analysis, etc.

Role of semantic indexing for text classification

The Vector Space Model (VSM) of text representation suffers a number of limitations for text classification. Firstly, the VSM is based on the Bag-Of-Words (BOW) assumption where terms from the indexing vocabulary are treated independently of one another. However, the expressiveness of natural language means that lexically different terms often have related or even identical meanings. Thus, failure to take into account the semantic relatedness between terms means that document similarity is not properly captured in the VSM. To address this problem, semantic indexing approaches have been proposed for modelling the semantic relatedness between terms in document representations. Accordingly, in this thesis, we empirically review the impact of semantic indexing on text classification. This empirical review allows us to answer one important question: how beneficial is semantic indexing to text classification performance. We also carry out a detailed analysis of the semantic indexing process which allows us to identify reasons why semantic indexing may lead to poor text classification performance. Based on our findings, we propose a semantic indexing framework called Relevance Weighted Semantic Indexing (RWSI) that addresses the limitations identified in our analysis. RWSI uses relevance weights of terms to improve the semantic indexing of documents. A second problem with the VSM is the lack of supervision in the process of creating document representations. This arises from the fact that the VSM was originally designed for unsupervised document retrieval. An important feature of effective document representations is the ability to discriminate between relevant and non-relevant documents. For text classification, relevance information is explicitly available in the form of document class labels. Thus, more effective document vectors can be derived in a supervised manner by taking advantage of available class knowledge. Accordingly, we investigate approaches for utilising class knowledge for supervised indexing of documents. Firstly, we demonstrate how the RWSI framework can be utilised for assigning supervised weights to terms for supervised document indexing. Secondly, we present an approach called Supervised Sub-Spacing (S3) for supervised semantic indexing of documents. A further limitation of the standard VSM is that an indexing vocabulary that consists only of terms from the document collection is used for document representation. This is based on the assumption that terms alone are sufficient to model the meaning of text documents. However for certain classification tasks, terms are insufficient to adequately model the semantics needed for accurate document classification. A solution is to index documents using semantically rich concepts. Accordingly, we present an event extraction framework called Rule-Based Event Extractor (RUBEE) for identifying and utilising event information for concept-based indexing of incident reports. We also demonstrate how certain attributes of these events e.g. negation, can be taken into consideration to distinguish between documents that describe the occurrence of an event, and those that mention the non-occurrence of that event.