A network approach to dimensionality reduction in Text Mining (original) (raw)

Dimensionality Reduction Techniques for Text Mining

Collaborative Filtering Using Data Mining and Analysis

Sentiment analysis is an emerging field, concerned with the analysis and understanding of human emotions from sentences. Sentiment analysis is the process used to determine the attitude/opinion/emotions expressed by a person about a specific topic based on natural language processing. Proliferation of social media such as blogs, Twitter, Facebook and Linkedin has fuelled interest in sentiment analysis. As the real time data is dynamic, the main focus of the chapter is to extract different categories of features and to analyze which category of attribute performs better. Moreover, classifying the document into positive and negative category with fewer misclassification rate is the primary investigation performed. The various approaches employed for feature selection involves TF-IDF, WET, Chi-Square and mRMR on benchmark dataset pertaining diverse domains.

A Dimensionality Reduction Approach for Semantic Document Classification

2011

The curse of dimensionality is a well-recognized problem in the field of document filtering. In particular, this concerns methods where vector space models are utilized to describe the document-concept space. When performing content classification across a variety of topics, the number of different concepts (dimensions) rapidly explodes and as a result many techniques are rendered inapplicable. Furthermore the extent of information represented by each of the concepts may vary significantly. In this paper, we present a dimensionality reduction approach which approximates the user's preferences in the form of value function and leads to a quick and efficient filtering procedure. The proposed system requires the user to provide preference information in the form of a training set in order to generate a search rule. Each document in the training set is profiled into a vector of concepts. The document profiling is accomplished by utilizing Wikipedia-articles to define the semantic information contained in words which allows them to be perceived as concepts. Once the set of concepts contained in the training set is known, a modified Wilks' lambda approach is used for dimensionality reduction by ensuring minimal loss of semantic information.

Text Mining: Approaches and Applications 1

The field of text mining seeks to extract useful information from unstructured textual data through the identification and exploration of interesting patterns. The techniques employed usually do not involve deep linguistic analysis or parsing, but rely on simple "bag-of-words" text representations based on vector space. Several approaches to the iden-tification of patterns are discussed, including dimensionality reduction, automated classification and clustering. Pattern exploration is illustrated through two applications from our recent work: a classification-based Web meta-search engine and visualization of coauthorship relationships automatically extracted from a semi-structured collection of documents describing researchers in the region of Vojvodina. Finally, preliminary results concerning the application of dimensionality reduction techniques to problems in sentiment classification are presented.

A novel approach for ontology based dimensionality reduction for web text document classification

2017

—Dimensionality reduction of feature vector size plays a vital role in enhancing the text processing capabilities; it aims in reducing the size of the feature vector used in the mining tasks (classification, clustering... etc.). This paper proposes an efficient approach to be used in reducing the size of the feature vector for web text document classification process. This approach is based on using WordNet ontology, utilizing the benefit of its hierarchal structure, to eliminate words from the generated feature vector that has no relation with any of WordNet lexical categories; this leads to the reduction of the feature vector size without losing information on the text. For mining tasks, the Vector Space Model (VSM) is used to represent text documents and the Term Frequency Inverse Document Frequency (TFIDF) is used as a term weighting method. The proposed ontology based approach was evaluated against the Principal component analysis (PCA) approach using several experiments. The experimental results reveal the effectiveness of our proposed approach against other traditional approaches to achieve a better classification accuracy, F-measure, precision, and recall

ESSENTIAL ANALYSIS ON MACHINE LEARNING BASED TEXT MINING

IAEME PUBLICATION, 2021

In online or offline scenarios, from years on, a large number of text-data are generated from different web sources. Mainly incoherent and unstructured format is the huge amount of data, so hard to process via the available computer machines. A large number of unknown objects can be examined in an uncontrolled classification method. Text categorization involves learning methodology which is applied in areas such as linguistic identification, data recovery, opinion mining, spam filtering and e-mail routing, etc. The categorisation of text can also be considered as a mechanism for the labeling of different documents from natural corpus. The text classification by various Mechanisms of Machine Learning meets the challenge of the vector's high dimensionality. The latent semant indexing method can solve this problem by replacing the individual words with statistically derived conceptual indices. We propose a twostage feature selection method with the aim of improving the accuracy and efficiency of categorizing. Firstly, to reduce the dimension of the terms, we apply a new method of selection and then build a new semantinal space, between terms, which is based on the latent semant indexation method. We can find that our two-stage feature selection method works better with certain applications involving the categorisation of the spam database.

Document Representation and Dimension Reduction for Text Clustering

2007 IEEE 23rd International Conference on Data Engineering Workshop, 2007

Increasingly large text datasets and the high dimensionality associated with natural language create a great challenge in text mining. In this research, a systematic study is conducted, in which three different document representation methods for text are used, together with three Dimension Reduction Techniques (DRT), in the context of the text clustering problem. Several standard benchmark datasets are used. The three Document representation methods considered are based on the vector space model, and they include word, multi-word term, and character N-gram representations. The dimension reduction methods are independent component analysis (ICA), latent semantic indexing (LSI), and a feature selection technique based on Document Frequency (DF). Results are compared in terms of clustering performance, using the k-means clustering algorithm. Experiments show that ICA and LSI are clearly better than DF on all datasets. For word and N-gram representation, ICA generally gives better results compared with LSI. Experiments also show that the word representation gives better clustering results compared to term and N-gram representation. Finally, for the N-gram representation, it is demonstrated that a profile length (before dimensionality reduction) of 2000 is sufficient to capture the information and, in most cases, a 4-gram representation gives better performance than 3-gram representation.

erasmus mundus masters program in language and communication technology A graph model for text analysis and text mining

Automated text analysis and text mining methods have received a great deal of attention because of the remarkable increase of digital documents. Typical tasks involved in these two areas include text classification, information extraction, document summarization, text pattern mining etc. Most of them are based on text representation models which are used to represent text content. The traditional text representation method, Vector Space Model, has several noticeable weak points with respect to the ability of capturing text structure and the semantic information of text content. Recently, instead of using Vector Space Model, graph-based models have emerged as alternatives to text representation model. However, it is still difficult to include semantic information into these graph-based models. In this thesis, we propose FrameNet-based Graph Model for Text (FGMT), a new graph model that contains structural and shallow semantic information of text by using FrameNet resource. Moreover, we introduce a Hybrid model based on FGMT which is more adapted to text classification. The experiment results show a significant improvement in classification by using our models versus a typical Vector Space Model.

Lifting the Curse: Exploring Dimensionality Reduction on Text Clustering Applications

2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA), 2022

Nowadays, huge amounts of text are being generated on the Web by a vast number of applications. Examples of such applications include instant messengers, social networks, e-mail clients, news portals, blog communities, commercial platforms, and so forth. The requirement for effectively identifying documents of similar content in these services rendered text clustering one of the most emerging problems of the machine learning discipline. Nevertheless, the high dimensionality and the natural sparseness of text introduce significant challenges that threat the feasibility of even the most successful algorithms. Consequently, the role of dimensionality reduction techniques becomes crucial for this particular problem. Motivated by these challenges, in this article we investigate the impact of dimensionality reduction on the performance of text clustering algorithms. More specifically, we experimentally analyze its effects in the effectiveness and running times of eight clustering algorithms by employing six high-dimensional text datasets. The results indicate that, in most cases, dimensionality reduction may significantly improve the algorithm execution times, by sacrificing only small amounts of clustering quality.

A novel text mining approach using a bipartite graph projected onto two dimensions

Trap@NCI, 2018

The collection of text data is exploding. From blogs to news reports, to helpdesk tickets, there seems to be a never-ending supply of writings. The owners of these data see methods to group texts and look for clusters of topics. Because of the size of the data, solutions that scale on clustered computer solutions are ideal. The traditional term vector approach can lead to the curse-of-dimensionality. Simple solutions are better than complex because it is often necessary to explain the model to either business users or even regulators. This paper demonstrates a method of keyword mining using the graph-of-words technique and classification by projecting the bipartite graph of terms and documents onto two dimensions. This method can be scaled using a cluster computing technology such as Apache Spark, and the results are easily surfaced to users.