Text Classification Research Papers - Academia.edu (original) (raw)

- by
- •
- Machine Learning, Quality Control, Feature Selection, OPERATING SYSTEM
- by Machteld Van Den Bogaerd
- •
- Artificial Intelligence, Machine Learning, Data Mining, Content Analysis
- by Robert Mckay
- •
- Ontology, Principal Component Analysis, Comparative Study, Feature Selection

The textual analysis has become most important task due to the rapid increase of the number of texts that have been continuously generated in several forms such as posts and chats in social media, emails, articles, and news. The management of these texts requires efficient and
effective methods, which can handle the linguistic issues that come from the complexity of natural languages. In recent years, the exploitation of semantic features from the lexical sources has been widely investigated by researchers to deal with the issues of “synonymy and ambiguity” in the tasks involved in the Social Media like document clustering. The main challenges of exploiting the lexical knowledge sources such as 1WordNet 3.1 in these tasks are how to integrate
the various types of semantic relations for capturing additional semantic evidence, and how to settle the high dimensionality of current semantic representing approaches. In this paper, the proposed weighting of features for a new semantic feature-based method as which combined
four things as which is “Synonymy, Hypernym, non-taxonomy, and Glosses”. Therefore, this research proposes a new knowledge-based semantic representation approach for text mining, which can handle the linguistic issues as well as the high dimensionality issue. Thus, the
proposed approach consists of two main components: a feature-based method for incorporating the relations in the lexical sources, and a topic-based reduction method to overcome the high dimensionality issue. The proposed method approach will evaluated using WordNet 3.1 in the text clustering and text classification.

- by Ali M. Hasan
- •
- Data Mining, Sentiment Analysis, Software Components, Text Mining
- by Franca Debole
- •
- Machine Learning, Text Classification, Supervised Learning, Text Categorization

Most research in text classification to date has used a bag of words representation in which each feature corresponds to a single word. This paper examines some alternative ways to represent text based on syntactic and semantic... more

- by Sam Scott
- •
- Natural Language Processing, Representations, Learning, Text Classification

2013 curriculum is a new curriculum in the Indonesian education system which has been enacted by the government to replace KTSP curriculum. The implementation of this curriculum in the last few years has sparked various opinions among students, teachers, and public in general, especially on social media twitter. In this study, a sentimental analysis on 2013 curriculum is conducted. Ensemble of several feature sets were used including textual features, twitter specific features, lexicon-based features, Parts of Speech (POS) features, and Bag of Words (BOW) features for the sentiment classification using K-Nearest Neighbor method. The experiment result showed that the the ensemble features have the best performance of sentiment classification compared to only using individual features. The best accuracy using ensemble features is 96% when k=5 is used.

–Text classification is used to classify the document of similar types. Text classification can be also performed under supervision i.e. it is an supervised leaning technique Text classification is a process in which documents are sorted spontaneously into different classes using predefined set. The main issue is that large scale of information lacks organization which makes it difficult to manage. Text classification is identified as one of the key methods used for recognizing such types of digital information. Text classification have various applications such as in information retrieval, natural language processing, automatic indexing, text filtering, image processing, etc. Text classification is also used to process the big data and it can also be used to predict the class labels for newly added data. Text classification is also being used in academic and industries to classify the unstructured data. There are various types of the text classification approaches such as decision tree, SVM, Naïve Bayes etc. In this survey paper, we have analysed the various text classification techniques such as decision tree, SVM, Naïve Bayes etc. These techniques have their individual set of advantages which make them suitable in almost all classification jobs. In this paper we have also analysed evaluation parameters such as F-measure, G-measure and accuracy used in various research works. .

- by IJFRCSCE Journal
- •
- Artificial Intelligence, Machine Learning, Data Mining, Neural Network

With growing texts of electronic documents used in many applications, a fast and accurate text classification method is very important. Arabic text classification is one of the most challenging topics. This is probably caused by the fact that Arabic words have unlimited variation in the meaning, in addition to the problems that are specific to Arabic language only. Many studies have been proved that Naive Bayes (NB) classifier is being relatively robust, easy to implement, fast, and accurate for many different fields such as text classification. However, non-linear classification and strong violations of the independence assumptions problems can lead to very poor performance of NB classifier. In this paper, first, we pre-process the Arabic documents to tokenize only the Arabic words. Second, we convert those words into vectors using term frequency and inverse document frequency (TF-IDF) technique. Third, we propose an efficient approach based on Kernel Naive Bayes (KNB) classifier to solve the non-linearity problem of Arabic text classification. Finally, experimental results and performance evaluation on our collected dataset of Arabic topic mining corpus are presented, showing the effectiveness of the proposed KNB classifier against other baseline classifiers.

- by Paulo Quaresma
- •
- Information Retrieval, Artificial Intelligence, Portuguese, Statistical Analysis

'El Diario de Juárez' is a local newspaper in a city of 1.5 million Spanish-speaking inhabitants that publishes texts of which citizens read them on both a website and an RSS (Really Simple Syndication) service. This research applies natural-language-processing and machine-learning algorithms to the news provided by the RSS service in order to classify them based on whether they are about a traffic incident or not, with the final intention of notifying citizens where such accidents occur. The classification process explores the bag-of-words technique with five learners (Classification and Regression Tree (CART), Naïve Bayes, kNN, Random Forest, and Support Vector Machine (SVM)) on a class-imbalanced benchmark; this challenging issue is dealt with via five sampling algorithms: synthetic minority oversampling technique (SMOTE), borderline SMOTE, adaptive synthetic sampling, random oversampling, and random undersampling. Consequently, our final classifier reaches a sensitivity of 0.86 and an area under the precision-recall curve of 0.86, which is an acceptable performance when considering the complexity of analyzing unstructured texts in Spanish.

In recent years, impressive attention has been given for mining the publically available huge amount of data to gain situational awareness, which may help in preventing or decrease the effect of some disaster by taking the correct responses. In this study, an effective Convolutional Neural Networks (CNN) tweet classification system that fully supports the Turkish language has been developed.
In addition, the first-ever Turkish tweet dataset for crisis response is created. This dataset has been carefully preprocessed, annotated, well organized and suitable to be used by all the well-known natural language processing tools. Furthermore, the performance of some well-known machine learning algorithms, i.e., K-Nearest Neighbor (KNN), Naive Bayes (NB), and Support Vector Machine(SVM) was investigated. Then, the performances of the ensemble systems Random Forest (RF), AdaBoost Classifier (AdaBoost), GradientBoosting Classifier (GBC), when used for text (tweets) classification, has been also observed.
A wide range of experiments was performed to investigate the performance of the developed system. As a result, the developed approach has achieved very good performance, robustness, and stability when processing both Turkish and English languages.
Key Words: Crises Management Systems; Tweet Classification; Turkish language; Convolutional Neural Networks; Natural Language Processing.

- by Merve Işık
- •
- Artificial Intelligence, Data Mining, Turkish Language, Text Classification
- by Ismail Fahmi
- •
- Digital Library, Text Classification, learning algorithm
- by Eibe Frank and +1
- •
- Text Classification, High Dimensional Data, Cross Validation, High Dimensionality

The majority of the state-of-the-art text categorization algorithms are supervised and therefore require prior training. Besides the rigor involved in developing training datasets and the requirement for repetition of training for different texts, working with multilingual texts poses additional unique challenges. One of these challenges is that the developer is required to have many different languages involved. Term expansion such as query expansion has been applied in numerous applications; however, a major drawback of most of these applications is that the actual meaning of terms is not usually taken into consideration. Considering the semantics of terms is necessary because of the polysemous nature of most natural language words. In this paper, as a specific contribution to the document index approach for text categorization, we present a joint multilingual/cross-lingual text categorization algorithm (JointMC) based on semantic term expansion of class topic terms through an optimized knowledge-based word sense disambiguation. The lexical knowledge in BabelNet is used for the word sense disambiguation and expansion of the topics' terms. The categorization algorithm computes the distributed semantic similarity between the expanded class topics and the text documents in the test corpus. We evaluate our categorization algorithm using a multilabel text categorization problem. The multilabel categorization task uses the JRC-Acquis dataset. The JRC-Acquis dataset is based on subject domain classification of the European Commission's EuroVoc microthesaurus. We compare the performance of the classifier with a model of it using the original class topics. Furthermore, we compare the performance of our classifier with two state-of-the-art supervised algorithms (each for multilingual and cross-lingual tasks) using the same dataset. Empirical results obtained on five experimental languages show that categorization with expanded topics shows a very wide performance margin when compared to usage of the original topics. Our algorithm outperforms the existing supervised technique, which used the same dataset. Cross-language categorization surprisingly shows similar performance and is marginally better for some of the languages.

- by Peter Reutemann
- •
- Semi-supervised Learning, Knowledge, Text Classification, Scaling up
- by Ivo Rakovac
- •
- Text Classification
- by Simon Tong
- •
- Machine Learning, Active Learning, Support Vector Machines, Text Classification

Text Classification is the process of accommodating different categories of text on the basis of the content. It is a
fundamental task of Natural Language Processing (NLP) having varied applications like sentiment analysis, spam detection,
topic labelling and intent labelling. The first step of the classifiers is extraction i.e. to convert words and phrases into vectors
which refers to the frequency of a word in a predefined dictionary of words. There are various machine learning algorithms
that can be used for classification. In this paper, we will implement best first, information gain and gain ratio feature selection on
certain classifiers such as Naive Bayes, Bagging, Random Forest and Naive Bayes Multinomial. We will find and compare the
Accuracy, Training Time, Testing Time, Mean Absolute Error and Recall for the feature selections for each classifier. It will help
to find which classifier and feature selection method is best suited for performing text classification.
Index Words: Naive Bayes(NB), Naive Bayes Multinomial(MN), Information Gain(IG), Gain Ratio(GR),Gini Index(GI), Odds
Ratio(OR),Chi-Square(CHI),Term Frequency(TF), Document Frequency(DF) Distinguishing feature selector (DFS), Area Under Curve (AUC), Mean absolute error ( MAE), Natural Lan- guage Processing(NLP), Machine Learning(ML),Bag of Words
(BOW), Customer Relationship Management(CRM)

- by S M Kamruzzaman
- •
- Information Retrieval, Genetic Algorithm, Text Classification, Hybrid Learning

The increasing use of methods in natural language processing (NLP) which are based on huge corpora require that the lexical, morpho-syntactic and syntactic homogeneity of texts be mastered. We have developed a methodology and associate tools for text calibration or "profiling" within the ELRA benchmark called "Contribution to the construction of contemporary french corpora" based on multivariate analysis of linguistic features. We have integrated these tools within a modular architecture based on a generic model allowing us on the one hand flexible annotation of the corpus with the output of NLP and statistical tools and on the other hand retracing the results of these tools through the annotation layers back to the primary textual data. This allows us to justify our interpretations.

Natural Language Processing with a combination of Neural Network methods
such as Convolutional Neural Network (CNN) that is included in the Deep
Learning method and carries out a repetitive learning process to get the best
representation of each word in the text. CNN Works by finding the pattern of a
word among other words in the input matrix. The learning process in several
convolution layers is carried out parallel and in sequence. Thus, each word is
independent of other words around it. Twitter is a source of data that interests
researchers to make research objects. However, the text in tweets contains many
non-formal languages, abbreviations and everyday languages. Thus, it is more
difficult to identify the information in it, when compared with the formal text. In
this research, the Natural Language Processing method is implemented using the
CNN algorithm to classify information related to the emergency-respond phase.
This classification model was trained using two types of datasets, namely the
crawling dataset of 1967 texts, and the dataset in the form of tweet texts from Twitter totalling 853 sentences and tested using 89 different text tweets. From the results of 3 iterations with 10 epoch training per iteration, an accuracy of 98% was obtained and a loss of 4% was obtained. Thus, it can be concluded that the algorithm functions optimally in identifying information.

- by Journal of Software Engineering & Intelligent Systems and +1
- •
- Natural Language Processing, Machine Learning, Twitter, Text Classification
- by Harry Zhang
- •
- Text Classification, Scaling up, Time Complexity, Naive Bayes

Text Mining is the automatic discovery of new, previously unknown information, by automatic analysis of various textual resources. Text mining starts by extracting facts and events from textual sources and then enables forming new hypotheses that are further explored by traditional Data Mining and data analysis methods. In this chapter we will define text mining and describe the three main approaches for performing information extraction. In addition, we will describe how we can visually display and analyze the outcome of the information extraction process.

- by Ronen Feldman
- •
- Data Mining, Text Mining, Information Extraction, Text Classification
- by pinki meher
- •
- Information Security, Machine Learning, Data Mining, Computer Security

Unsolicited communications currently accounts for over sixty percent of all sent e-mail with projections reaching the mid-eighties. While much spam is innocuous, a portion is engineered by criminals to prey upon, or scam, unsuspecting people. The senders of scam spam attempt to mask their messages as non-spam and con through a range of tactics, including pyramid schemes, securities fraud, and identity theft via phisher mechanisms (e.g. faux PayPal or AOL websites). To lessen the suspicion of fraudulent activities, scam messages sent by the same individual, or collaborating group, augment the text of their messages and assume an endless number of pseudonyms with an equal number of different stories. In this paper, we introduce ScamSlam, a software system designed to learn the underlying number criminal cells perpetrating a particular type of scam, as well as to identify which scam spam messages were written by which cell. The system consists of two main components; 1) a filtering mec...

- by Edoardo Airoldi
- •
- Information Retrieval, SPAM, Text Classification, Text Analysis

Several text mining techniques have been proposed to deal with the huge number of textual documents that are available and that have been published nowadays. Mainly classification techniques, which assign pre-defined labels to new documents, and clustering techniques, which separates texts into clusters. The techniques proposed in literature are usually applied to few textual collections, which are not sufficient to indicate how good a technique is or which characteristics of the collections make a technique obtain better results than others. Besides, new techniques are compared with traditional algorithms considering a small range of parameters, which make the comparison unfair. This technical report solve this lack by i) providing a characterization of 45 text collections; ii) providing classification results using traditional algorithms as Naïve Bayes, Multinomial Naïve Bayes, C4.5, k-Nearest Neighbors, and Support Vector Machines using a big range of values for the parameters; and iii) providing clustering results using traditional hierarchical algorithms as Bisecting k-Means and Unweighted Pair Group Method with Arithmetic Mean. We also make all the collections used in this technical report available in an on-line repository.

- by Rafael Rossi
- •
- Text Mining, Benchmarking, Text Classification, Text Categorization

We present a system that gathers and analyzes online discussion as it relates to consumer products. Weblogs and online message boards provide forums that record the voice of the public. Woven into this discussion is a wide range of opinion and commentary about consumer products. Given its volume, format and content, the appropriate approach to understanding this data is large-scale web and text data mining. By using a wide variety of state-of-the-art techniques including crawling, wrapping, text classification and computational linguistics, online discussion is gathered and annotated within a framework that provides for interactive analysis that yields marketing intelligence for our customers.

Text classification is a very important research area in machine learning. Artificial Intelligence is reshaping text classification techniques to better acquire knowledge. In spite of the growth and spread of AI in text mining research for various languages such as English, Japanese, Chinese, etc., its role with respect to Myanmar text is not well understood yet. The aim of this paper is comparative study of machine learning algorithms such as Naïve Bayes (NB), k-nearest neighbours (KNN), support vector machine (SVM) algorithms for Myanmar Language News classification. There is no comparative study of machine learning algorithms in Myanmar News. The news is classified into one of four categories (political, Business, Entertainment and Sport). Dataset is collected from 12,000 documents belongs to 4 categories. Well-known algorithms are applied on collected Myanmar language News dataset from websites. The goal of text classification is to classify documents into a certain number of pre-defined categories. News corpus is used for training and testing purpose of the classifier. Feature selection method, chi square algorithm achieves comparable performance across a number of classifiers. In this paper, the experimental results also show support vector machine is better accuracy to other classification algorithms employed in this research. Due to Myanmar Language is complex, it is more important to study and understand the nature of data before proceeding into mining.

Generating accurate and timely internal and external audit reports may seem difficult for some auditors due to limited time or expertise in matching the correct clauses of the standard with the textual statement of findings. To overcome this gap, this paper presents the design of text classification models using support vector machine (SVM) and long short-term memory (LSTM) neural network in order to automatically classify audit findings and standard requirements according to text patterns. Specifically, the study explored the optimization of datasets, holdout percentage and vocabulary of learned words called NumWords, then analyzed their capability to predict training accuracy and timeliness performance of the proposed text classification models. The study found that SVM (96.74%) and LSTM (97.54%) were at par with each other in terms of the best training accuracy, although SVM (67.96±17.93 seconds [s]) was found to be significantly faster than LSTM (136.67±96.42 s) in any dataset size. The study proposed optimization formulas that highlight dataset and holdout as predictors of accuracy, while dataset and NumWords as predictors of timeliness. In terms of actual implementation, both classification models were able to accurately classify 20 out of 20 sample audit findings at 1 and 3 s, respectively. Hence, the extent of choosing between the two algorithms depend on the datasets size, learned words, holdout percentage, and workstation speed. This paper is part of a series, which explores the use of Artificial Intelligence (AI) techniques in optimizing the performance of QMS in the context of a state university.

- by Ralph Sherwin Corpuz
- •
- Optimization, Text Classification, Support Vector Machines (SVMs), LSTM

Student obligations imply writing a large number of homework assignments and term papers. Usually, they are submitted in electronic form. Checking papers for plagiarism isn’t an easy task. Quantity prevents teachers and professors to check all of them by hand. Therefore, there is a need for a system that will perform this task automatically. This paper describes principles behind one such system. Text contained in papers written by students, but also ones that can be found on internet, is converted in n-gram models, which are kept, and used later for comparison with newly generated ones. Potential application of this system is at the Faculty of Electronic Engineering in Niš, Department of Computer Science, where it can be used to check student papers, written in Serbian language.

As most information (over 80%) is stored as text, text mining is believed to have a high commercial potential value. knowledge may be discovered from many sources of information; yet, unstructured texts remain the largest readily available source of knowledge .Text classification which classifies the documents according to predefined categories .In this paper we are tried to give the introduction of text classification, process of text classification as well as the overview of the classifiers and tried to compare
the some existing classifier on basis of few criteria like time complexity, principal and performance.

- by Izzat M Alsmadi
- •
- Information Retrieval, Digital Library, Arabic Language, Text Classification
- by Casey Whitelaw
- •
- Text Classification

Social media has opened new avenues and opportunities for financial banking institutions to improve the quality of their products and services and to understand and to adapt to their cus-tomers' needs. By directly analyzing the feedback of its customers, financial banking institutions can provide personalized products and services tailored to their customer needs. This paper presents a research framework for creation of a financial banking dataset in order to be used for Sentiment Classification using various Machine Learning methods and techniques. The dataset contains 2234 financial banking comments from Romanian financial banking social media collected via web scraping technique.

Монография посвящена актуальному направлению изучения проблем стилистики методами математической лингвистики. В книге анализируются роль и место стилеметрии в филологических исследованиях, обсуждаются ее познавательные принципы и задачи, а также ее взаимодействие с другими измеряющими дисциплинами (наукометрией, искусствометрией, психометрикой и т. п.). Книга рассчитана на широкий круг специалистов в области текстологии, источниковедения, документалистики, стилистики, квантитативной лингвистики и автоматической обработки текста.

- by Gregory Martynenko
- •
- Russian Literature, Interdisciplinarity, Stylistics, Stylometrics

With the rise of weblogs and the increasing tendency of online publications to turn to message-board style reader feedback venues, informal political discourse is becoming an important feature of the intellectual landscape of the Internet, creating a challenging and worthwhile area for experimentation in techniques for sentiment analysis. We describe preliminary statistical tests on a new dataset of political discussion group postings which indicate that posts made in direct response to other posts in a thread have a strong tendency to represent an opposing political viewpoint to the original post. We conclude that traditional text classification methods will be inadequate to the task of sentiment analysis in this domain, and that progress is to be made by exploiting information about how posters interact with each other.

- by Tony Mullen
- •
- Sentiment Analysis, Text Classification, Statistical Test
- by Behnam Sadeghi
- •
- Philology, Bioinformatics, Computer Science, Algorithms
- by Mohammed Al-Kabi
- •
- Information Retrieval, Digital Library, Arabic Language, Text Classification

The massive amount of semi-structured data contained within the text documents makes the process of classifying them manually a very difficult task. Automatic text classification is the process of classifying documents based on their contents into a predefined set of categories. This paper provides a comparison of the performance of well-known text classification techniques including genetic algorithm, k nearest neighbor, decision tree, support vector machine and Naïve Bayes. Light stemmer and Chi method have been implemented as preprocessing and features selection techniques. The effectiveness of the classifiers is evaluated in terms of macro-average F1 measure. In order to evaluate the five classification techniques, a text corpus has been collected. Results showed that the performance of the support vector machine and the Naïve Bayes classifiers outperforms the other classifiers in term of the classification accuracy.

This paper describes two classification supervised machine learning techniques of text data
(tweets) based on Naive Bayes classifier and logistic regression. For creating features, a
bag-of-words method is used. The goal of the project is to help enrich the existing Census
data that only include home/work locations of city residents with additional destination
points, classified as leisure activities based on text data from Twitter. This project is a proof
of concept and provides an example of algorithm that can be used to train the models for a
type of activity classification prediction of tweets

- by Ekaterina Levitskaya
- •
- Text Classification, Naive Bayes Classifier

In recent years, there has been an exponential growth in the number of complex documents and texts that require a deeper understanding of machine learning methods to be able to accurately classify texts in many applications. Many machine learning approaches have achieved surpassing results in natural language processing. The success of these learning algorithms relies on their capacity to understand complex models and non-linear relationships within data. However, finding suitable structures, architectures, and techniques for text classification is a challenge for researchers. In this paper, a brief overview of text classification algorithms is discussed. This overview covers different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluations methods. Finally, the limitations of each technique and their application in real-world problems are discussed.

- by Kamran Kowsari
- •
- Machine Learning, Text Mining, Text Classification, Textual analysis

— Text mining is drawing enormous attention in this era as there is a huge amount of text data getting generated and it is required very hardly to manage this data to grasp maximum benefit out of it. Text classification is an essential sub-part of text mining where the related text data is assigned to a particular predefined category. In our study, we discussed different classifier techniques which are popularly used in recent years. There is comparison between different classifiers like SVM, Naïve Bayes, Neural Networks etc. which is expressed in a tabular form in this paper.

- by Aanchal Sharma
- •
- Artificial Intelligence, Information Security, Machine Learning, Data Mining

The aim of sentiment analysis is to automatically extract the opinions from a certain text and decide its sentiment. In this paper, we introduce the first publicly-available Twitter dataset on Sunnah and Shia (SSTD), as part of a religious hate speech which is a sub problem of the general hate speech. We, further, provide a detailed review of the data collection process and our annotation guidelines such that a reliable dataset annotation is guaranteed. We employed many stand-alone classification algorithms on the Twitter hate speech dataset, including Random Forest, Complement NB, DecisionTree, and SVM and two deep learning methods CNN and RNN. We further study the influence of word embedding dimensions FastText and word2vec. In all our experiments, all classification algorithms are trained using a random split of data (66% for training and 34% for testing). The two datasets were stratified sampling of the original dataset. The CNN-FastText achieves the highest F-Measure (52.0%) followed by the CNN-Word2vec (49.0%), showing that neural models with FastText word embedding outperform classical feature-based models.

The main goal of this dissertation is to put different text classification tasks in
the same frame, by mapping the input data into the common vector space of linguistic
attributes. Subsequently, several classification problems of great importance for natural
language processing are solved by applying the appropriate classification algorithms.
The dissertation deals with the problem of validation of bilingual translation pairs, so
that the final goal is to construct a classifier which provides a substitute for human evaluation and which decides whether the pair is a proper translation between the appropriate
languages by means of applying a variety of linguistic information and methods.
In dictionaries it is useful to have a sentence that demonstrates use for a particular dictionary entry. This task is called the classification of good dictionary examples. In this thesis,
a method is developed which automatically estimates whether an example is good or bad
for a specific dictionary entry.
Two cases of short message classification are also discussed in this dissertation. In the
first case, classes are the authors of the messages, and the task is to assign each message
to its author from that fixed set. This task is called authorship identification. The other
observed classification of short messages is called opinion mining, or sentiment analysis.
Starting from the assumption that a short message carries a positive or negative attitude
about a thing, or is purely informative, classes can be: positive, negative and neutral.
These tasks are of great importance in the field of natural language processing and the
proposed solutions are language-independent, based on machine learning methods: support vector machines, decision trees and gradient boosting. For all of these tasks, a
demonstration of the effectiveness of the proposed methods is shown on for the Serbian
language.