Short Text Classification Research Papers (original) (raw)

The amount of text data mining in the world and in our life seems ever increasing and there's no end to it. The concept (Text Data Mining) defined as the process of deriving high-quality information from text. It has been applied on... more

The amount of text data mining in the world and in our life seems ever increasing and there's no end to it. The concept (Text Data Mining) defined as the process of deriving high-quality information from text. It has been applied on different fields including: Pattern mining, opinion mining, and web mining. The concept of Text Data Mining is based around the global Stemming of different forms of Arabic words. Stemming is defined like the method of reducing inflected (or typically derived) words to their word stem, base or root kind typically a word kind. We use the REP-Tree to improve text representation. In addition, test new combinations of weighting schemes to be applied on Arabic text data for classification purposes. For processing, WEKA workbench is used. The results in the paper on data set of BBC-Arabic website also show the efficiency and accuracy of REP-TREE in Arabic text classification.

In recent years, there has been an exponential growth within the number of complex documents and texts. It requires a deeper understanding of machine learning methods to be ready to accurately classify texts in many applications.... more

In recent years, there has been an exponential growth within the number of complex documents and texts. It requires a deeper understanding of machine learning methods to be ready to accurately classify texts in many applications. Understanding the rapidly growing short text is extremely important. Short text is different from traditional documents in its length. With the recent explosive growth of e-commerce and online communication, a replacement genre of text, short text, has been extensively applied in many areas. Numerous researches specialise in short text mining. It's a challenge to classify the short text due to its natural characters, like sparseness, large-scale, immediacy, non-standardization etc. With the rapid development of the web, Web users and Web service are generating more and more short text, including tweets, search snippets, product reviews then on. There's an urgent demand to know the short text. For instance an honest understanding of tweets can help advertisers put relevant advertisements along the tweets, which makes revenue without hurting user experience Short text classification is one among important tasks in tongue Processing (NLP). Unlike paragraphs or documents, short texts are more ambiguous .They do not have enough contextual information, which poses challenge for classification. We retrieve knowledge from external knowledge source to reinforce the semantic representation of short texts. We take conceptual information as a sort of data and incorporate it into deep neural networks. Here we are going to study different methods available for text classification and categorisation.

Many e-health services are available to users today, but they often suffer from lack of personalization. In this paper, we present a system to generate personalized health recommendations from various providers, based on classification of... more

Many e-health services are available to users today, but they often suffer from lack of personalization. In this paper, we present a system to generate personalized health recommendations from various providers, based on classification of health related calendar events on the user’s smartphone. Due to privacy constraints, such personal data often cannot be uploaded to external servers, hence the classification and personalization has to run on the client device. We use a server to train our model to classify calendar events using SVM and fastText, while the prediction is run on the client device using the trained model. The class labels from the classified calendar events, weighted in order of recency, are used to build a vector, which we treat as a representation of user interest while personalizing the recommendations. This vector is used to re-rank health related recommendations obtained from third party providers based on relevance. We describe the implementation details of our system and some tests on its accuracy and relevance to provide relevant health related recommendations. While we used the calendar app to classify events, our system can also be extended for other apps such as messaging.

Many e-health services are available to users today, but they often suffer from lack of personalization. In this paper, we present a system to generate personalized health recommendations from various providers, based on classification of... more

Many e-health services are available to users today, but they often suffer from lack of personalization. In this paper, we present a system to generate personalized health recommendations from various providers, based on classification of health related calendar events on the user's smartphone. Due to privacy constraints, such personal data often cannot be uploaded to external servers, hence the classification and personalization has to run on the client device. We use a server to train our model to classify calendar events using SVM and fastText, while the prediction is run on the client device using the trained model. The class labels from the classified calendar events, weighted in order of recency, are used to build a vector, which we treat as a representation of user interest while personalizing the recommendations. This vector is used to re-rank health related recommendations obtained from third party providers based on relevance. We describe the implementation details of our system and some tests on its accuracy and relevance to provide relevant health related recommendations. While we used the calendar app to classify events, our system can also be extended for other apps such as messaging.

Classification of short text messages is becoming more and more relevant in these years, where billion of users use online social networks to communicate with other people. Understanding message content can have a huge impact on many... more

Classification of short text messages is becoming more and more relevant in these years, where billion of users use online social networks to communicate with other people. Understanding message content can have a huge impact on many data analysis processes, ranging from the study of online social behavior to targeted advertisement, to security and privacy purposes.
In this paper, we propose a new unsupervised knowledge-based classifier for short text messages, where each category is represented by an ego-network.
A short text is classified into a category depending on how far its words are from the ego of that category. We show how this technique can be used both in single label and in multi-label classification, and how it outperforms the state of the art for short text messages classification.

Big, fine-grained enterprise registration data that includes time and location information enables us to quantitatively analyze, visualize, and understand the patterns of industries at multiple scales across time and space. However, data... more

Big, fine-grained enterprise registration data that includes time and location information enables us to quantitatively analyze, visualize, and understand the patterns of industries at multiple scales across time and space. However, data quality issues like incompleteness and ambiguity, hinder such analysis and application. These issues become more challenging when the volume of data is immense and constantly growing. High Performance Computing (HPC) frameworks can tackle big data computational issues, but few studies have systematically investigated imputation methods for enterprise registration data in this type of computing environment. In this paper, we propose a big data imputation workflow based on Apache Spark as well as a bare-metal computing cluster, to impute enterprise registration data. We integrated external data sources, employed Natural Language Processing (NLP), and compared several machine-learning methods to address incompleteness and ambiguity problems found in enterprise registration data. Experimental results illustrate the feasibility, efficiency, and scalability of the proposed HPC-based imputation framework, which also provides a reference for other big georeferenced text data processing. Using these imputation results, we visualize and briefly discuss the spatiotemporal distribution of industries in China, demonstrating the potential applications of such data when quality issues are resolved.

Abstract: One fundamental issue in today’s Online Social Networks (OSNs) is to give users the ability to control the messages posted on their own private space to avoid that unwanted content is displayed. Up to now, OSNs provide little... more

Abstract: One fundamental issue in today’s Online Social Networks (OSNs) is to give users the ability to control the messages posted on their own private space to avoid that unwanted content is displayed. Up to now, OSNs provide little support to this requirement. To fill the gap, in this paper, we propose a system allowing OSN users to have a direct control on the messages posted on their walls. This is achieved through a flexible rule-based system, which allows users to customize the filtering criteria to be applied to their walls, and a Machine Learning-based soft classifier automatically labeling messages in support of content-based filtering.
Keywords: Online social networks, information filtering, short text classification, policy based personalization.
Title: A System to Filter Unwanted Messages from OSN User Walls
Author: K.R DEEPTI
International Journal of Computer Science and Information Technology Research
ISSN 2348-120X (online), ISSN 2348-1196 (print)
Research Publish Journals

Author(s): ROSARIO, RYAN ROBERT | Advisor(s): Wu, Yingnian | Abstract: Text classification typically performs best with large training sets, but short texts are very common on the World Wide Web. Can we use resampling and data... more

Author(s): ROSARIO, RYAN ROBERT | Advisor(s): Wu, Yingnian | Abstract: Text classification typically performs best with large training sets, but short texts are very common on the World Wide Web. Can we use resampling and data augmentation to construct larger texts using similar terms? Several current methods exist for working with short text that rely on using external data and contexts, or workarounds. Our focus is to test a new preprocessing approach that uses resampling, inspired by the bootstrap, combined with data augmentation, by treating each short text as a population and sampling similar words from a semantic space to create a longer text. We use blog post titles collected from the Technorati blog aggregator as experimental data with each title appearing in one of ten categories. We first test how well the raw short texts are classified using a variant of SVM designed specifically for short texts as well as a supervised topic model and an SVM model that uses semantic vecto...