Short Text Classification Research Papers (original) (raw)
2025, TELKOMNIKA Telecommunication Computing Electronics and Control
Even though it is considered a more traditional method compared to more modern algorithms, term frequency inversed document frequency (TF-IDF) nevertheless produces good results in a range of text mining tasks. This study assesses the... more
Even though it is considered a more traditional method compared to more modern algorithms, term frequency inversed document frequency (TF-IDF) nevertheless produces good results in a range of text mining tasks. This study assesses the effectiveness of several TF-IDF modifications for short text classification. Imbalanced datasets are another issue that is addressed in this research. To rectify the imbalanced issue, we integrate standard, logscaled, and boolean TF-IDF in short text classification with undersampling and oversampling methods. Precision, recall, and f-measure metrics are used to evaluate each experiment. The best result is obtained when applying boolean TF-IDF with the oversampling method. Oversampling methods outperform the undersampling methods in every experiment, although there are some cases where experiments with undersampling methods are considerable. Additionally, our conducted study reveals that employing modified TF-IDF, such as boolean or log-scaled versions, provides greater advantages to classification performance, particularly in handling imbalanced datasets, when compared to solely relying on the standard TF-IDF approach.
2025, PLOS ONE
To analyse large numbers of texts, social science researchers are increasingly confronting the challenge of text classification. When manual labeling is not possible and researchers have to find automatized ways to classify texts,... more
To analyse large numbers of texts, social science researchers are increasingly confronting the challenge of text classification. When manual labeling is not possible and researchers have to find automatized ways to classify texts, computer science provides a useful toolbox of machine-learning methods whose performance remains understudied in the social sciences. In this article, we compare the performance of the most widely used text classifiers by applying them to a typical research scenario in social science research: a relatively small labeled dataset with infrequent occurrence of categories of interest, which is a part of a large unlabeled dataset. As an example case, we look at Twitter communication regarding climate change, a topic of increasing scholarly interest in interdisciplinary social science research. Using a novel dataset including 5,750 tweets from various international organizations regarding the highly ambiguous concept of climate change, we evaluate the performance of methods in automatically classifying tweets based on whether they are about climate change or not. In this context, we highlight two main findings. First, supervised machine-learning methods perform better than state-of-the-art lexicons, in particular as class balance increases. Second, traditional machine-learning methods, such as logistic regression and random forest, perform similarly to sophisticated deep-learning methods, whilst requiring much less training time and computational resources. The results have important implications for the analysis of short texts in social science research.
2025, PLOS ONE
To analyse large numbers of texts, social science researchers are increasingly confronting the challenge of text classification. When manual labeling is not possible and researchers have to find automatized ways to classify texts,... more
To analyse large numbers of texts, social science researchers are increasingly confronting the challenge of text classification. When manual labeling is not possible and researchers have to find automatized ways to classify texts, computer science provides a useful toolbox of machine-learning methods whose performance remains understudied in the social sciences. In this article, we compare the performance of the most widely used text classifiers by applying them to a typical research scenario in social science research: a relatively small labeled dataset with infrequent occurrence of categories of interest, which is a part of a large unlabeled dataset. As an example case, we look at Twitter communication regarding climate change, a topic of increasing scholarly interest in interdisciplinary social science research. Using a novel dataset including 5,750 tweets from various international organizations regarding the highly ambiguous concept of climate change, we evaluate the performance of methods in automatically classifying tweets based on whether they are about climate change or not. In this context, we highlight two main findings. First, supervised machine-learning methods perform better than state-of-the-art lexicons, in particular as class balance increases. Second, traditional machine-learning methods, such as logistic regression and random forest, perform similarly to sophisticated deep-learning methods, whilst requiring much less training time and computational resources. The results have important implications for the analysis of short texts in social science research.
2024
Knowledge discovery is a process of discovering useful knowledge from a collection of data. This widely used data mining technique is a process that includes data preparation and selection, data cleansing, incorporating prior knowledge on... more
Knowledge discovery is a process of discovering useful knowledge from a collection of data. This widely used data mining technique is a process that includes data preparation and selection, data cleansing, incorporating prior knowledge on data sets and interpreting accurate solutions from the observed results. Text mining is a sub domain of knowledge discovery from the text data. The presented study provides a broad way of understanding the text mining and their applications in different domain of real time applications. The text mining includes the process of text classification and text clustering. On the other hand, the cluster analysis is performed on the un-labelled and unstructured data. In this paper, I have presented a study of various research papers that explore the area of Text Clustering approaches in various genres.
2024
Knowledge discovery is a process of discovering useful knowledge from a collection of data. This widely used data mining technique is a process that includes data preparation and selection, data cleansing, incorporating prior knowledge on... more
Knowledge discovery is a process of discovering useful knowledge from a collection of data. This widely used data mining technique is a process that includes data preparation and selection, data cleansing, incorporating prior knowledge on data sets and interpreting accurate solutions from the observed results. Text mining is a sub domain of knowledge discovery from the text data. The presented study provides a broad way of understanding the text mining and their applications in different domain of real time applications. The text mining includes the process of text classification and text clustering. On the other hand, the cluster analysis is performed on the un-labelled and unstructured data. In this paper, I have presented a study of various research papers that explore the area of Text Clustering approaches in various genres.
2024
One principal issue in today On-line Social Systems (OSNs) is to give clients the capacity to control the messages and images posted all alone private space to dodge that undesirable substance is shown. Up to now OSNs give little backing... more
One principal issue in today On-line Social Systems (OSNs) is to give clients the capacity to control the messages and images posted all alone private space to dodge that undesirable substance is shown. Up to now OSNs give little backing to this prerequisite. This is accomplished through an adaptable guideline based framework further more, a Machine Learning based delicate classifier consequently marking messages in backing of substance based sifting. In this paper, we likewise propose a novel way to deal with CBIR(Content Based Image Retrieval) framework in view of Genetic Algorithm to channel undesirable pictures.
2024
The younger generation is using mostly the form of social networking sites. The online social network (OSN) helps an individual to connect with their friends, family and the society to collect and share information with others. Nowadays,... more
The younger generation is using mostly the form of social networking sites. The online social network (OSN) helps an individual to connect with their friends, family and the society to collect and share information with others. Nowadays, the OSN is facing the problem of people posting the indecent messages on an individual's wall which annoys other people on seeing them. The OSN provides little support to prevent unwanted messages. So the proposed system allowing OSN users to have direct control over the message posted on their wall. This is achieved through a filter wall (FW) able to filter unwanted messages from OSN user walls. The proposed system provides security to online social networks
2024
Statistical analysis of parliamentary roll call votes is an important topic in political science as it reveals ideological positions of members of parliament and factions. However, these positions depend on the issues debated and voted... more
Statistical analysis of parliamentary roll call votes is an important topic in political science as it reveals ideological positions of members of parliament and factions. However, these positions depend on the issues debated and voted upon as well as on attitude towards the governing coalition. Therefore, analysis of carefully selected sets of roll call votes provides deeper knowledge about members of parliament behavior. However, in order to classify roll call votes according to their topic automatic text classifiers have to be employed, as these votes are counted in thousands. In this paper we present results of an ongoing research on thematic classification of roll call votes of the Lithuanian Parliament. Also, this paper is a part of a larger project aiming to develop the infrastructure designed for monitoring and analyzing roll call voting in the Lithuanian Parliament.
2024, International Journal of Computer Science and Information Technology
The amount of text data mining in the world and in our life seems ever increasing and there's no end to it. The concept (Text Data Mining) defined as the process of deriving high-quality information from text. It has been applied on... more
The amount of text data mining in the world and in our life seems ever increasing and there's no end to it. The concept (Text Data Mining) defined as the process of deriving high-quality information from text. It has been applied on different fields including: Pattern mining, opinion mining, and web mining. The concept of Text Data Mining is based around the global Stemming of different forms of Arabic words. Stemming is defined like the method of reducing inflected (or typically derived) words to their word stem, base or root kind typically a word kind. We use the REP-Tree to improve text representation. In addition, test new combinations of weighting schemes to be applied on Arabic text data for classification purposes. For processing, WEKA workbench is used. The results in the paper on data set of BBC-Arabic website also show the efficiency and accuracy of REP-TREE in Arabic text classification.
2024, International Journal of Innovative Research in Computer and Communication Engineering
One of the basic problems in Online Social Networks (OSNs) is to provide the dexterity to control the text (messages) posted on their own profile so as to prevent that unwanted content is displayed. In this paper we propose a system that... more
One of the basic problems in Online Social Networks (OSNs) is to provide the dexterity to control the text (messages) posted on their own profile so as to prevent that unwanted content is displayed. In this paper we propose a system that allows OSN users to have a direct hold on the messages posted on their private walls. This can be achieved through a filtering system that allows users to apply their filtering criteria, thereby allowing content-based filtering in support of filtering based upon relationship types.
2024, 2015 10th International Conference on Malicious and Unwanted Software (MALWARE)
This study presents a malware classification system designed to classify malicious processes at run-time on production hosts. The system monitors process-level system call activity and uses information extracted from system call traces as... more
This study presents a malware classification system designed to classify malicious processes at run-time on production hosts. The system monitors process-level system call activity and uses information extracted from system call traces as inputs to the classifier. The system is advantageous because it does not require the use of specialized analysis environments. Instead, a 'lightweight' service application monitors process execution and classifies new malware samples based on their behavioral similarity to known malware. This study compares the effectiveness of multiple feature sets, ground truth labeling schemes, and machine learning algorithms for malware classification. The accuracy of the classification system is evaluated against processlevel system call traces of recently discovered malware samples collected from production environments. Experimental results indicate that accurate classification results can be achieved using relatively short system call traces and simple representations.
2024, nternational journal of advanced research in computer and communication engineering
In this era of internet, the Online Social Networks (OSNs) are the platform to build social relations among people who share interests, activities, backgrounds or real life connections. OSNs have gained a ubiquitous status and this has... more
In this era of internet, the Online Social Networks (OSNs) are the platform to build social relations among people who share interests, activities, backgrounds or real life connections. OSNs have gained a ubiquitous status and this has led to security issues of posting unwanted messages on user wall. Therefore, in order to make the OSN user wall a secured wall, we are introducing a flexible-rule based system which provides users to control the messages that are posted on their walls and allows user to customise the filtering criteria to be applied on their walls. This system exploits machine learning based soft classifier for automatically labelling messages in support of content based filtering.
2024
In the present day scenario online social networks (OSN) are very trendy and one of the most interactive medium to share, communicate and exchange different kinds of information like text, image, audio, video etc. The OSN users are... more
In the present day scenario online social networks (OSN) are very trendy and one of the most interactive medium to share, communicate and exchange different kinds of information like text, image, audio, video etc. The OSN users are provided with a privilege to have a direct control on posting or commenting certain stuff on their walls with the help of information filtering. This can be achieved through text pattern matching system, that allows the user to filter their open space (wall) and a privilege to add new words treated as unwanted. To provide a healthy environment to the user to interact, pattern matching of texts are done with the blacklisted vocabulary. If the posted content is verified as ethical and not spam , it is then allowed to be posted on someone's wall, otherwise the posted text will be blurred or encoded with special symbols or would be placed into blacklisted words. This can be achieved through a flexible rule-based system, which allows the users to customize...
2024, International Journal for Research in Applied Science and Engineering Technology
The Braille system has been employed by the visually impaired for reading and writing. Due to limited availability and high cost of the Braille text books an efficient usage of the books becomes a necessity. With the introduction and... more
The Braille system has been employed by the visually impaired for reading and writing. Due to limited availability and high cost of the Braille text books an efficient usage of the books becomes a necessity. With the introduction and popularization of text to speech converters, an enormous visit literacy rates are being seen amongst the visually impaired. Also, since Braille isn't well-known to the masses, communication by the visually impaired with the surface world becomes an arduous task. lots of research is being applied in conversion of English text to Braille but not many concentrates on the choice i.e. conversion of Braille to English. This paper proposes a technique to convert a scanned Braille document to text which might be read out too many through the Computer. The Braille documents are preprocessed to boost the dots and reduce the noise. The Braille cells are segmented and the dots from each cell are extracted and are mapped to the acceptable alphabets of the language. The decipherment of the Braille images requires image classification. There are basically two approaches to image classification which are supervised and unsupervised learning approach. We are implementing two supervised classification models to judge and compare accuracies, namely Support Vector Machine (SVM) and convolution Neural Network (CNN). At the end we use speech synthesizer to convert text into spoken speech.
2024, Anais do Encontro Nacional de Inteligência Artificial e Computacional (ENIAC)
Classifying sentences in industrial, technical or scientific reports can enhance text mining and information retrieval tasks with useful machinereadable metadata. This paper describes a search engine that employs sentence classification... more
Classifying sentences in industrial, technical or scientific reports can enhance text mining and information retrieval tasks with useful machinereadable metadata. This paper describes a search engine that employs sentence classification so as to search for abstracts from scholarly papers in Petroleum Engineering. The sentences were classified into four classes, based on the popular IMRAD categories. We produced a dataset containing more than 2,200 manually labeled sentences from 278 scholarly articles in the field of Petroleum Engineering in order to be used as training and testing data. The classifier with best results was logistic regression, with an accuracy of 86.4%. The information retrieval system built on top of the classification system yielded a mAP of 0.80.
2024, ProQuest LLC eBooks
An example of the bootstrap sampling step treating the corpus as a population and the documents as the sampling units.
2024
We analyze methods for selecting topics in news articles to explain stock returns. We find, through empirical and theoretical results, that supervised Latent Dirichlet Allocation (sLDA) implemented through Gibbs sampling in a stochastic... more
We analyze methods for selecting topics in news articles to explain stock returns. We find, through empirical and theoretical results, that supervised Latent Dirichlet Allocation (sLDA) implemented through Gibbs sampling in a stochastic EM algorithm will often overfit returns to the detriment of the topic model. We obtain better out-of-sample performance through a random search of plain LDA models. A branching procedure that reinforces effective topic assignments often performs best. We test these methods on an archive of over 90,000 news articles about S&P 500 firms.
2024, International Journal of Information System Modeling and Design
Extracting knowledge from unstructured text and then classifying it is gaining importance after the data explosion on the web. The traditional text classification approaches are becoming ubiquitous, but the hybrid of semantic knowledge... more
Extracting knowledge from unstructured text and then classifying it is gaining importance after the data explosion on the web. The traditional text classification approaches are becoming ubiquitous, but the hybrid of semantic knowledge representation with statistical techniques can be more promising. The developed method attempts to fabricate neural networks to expedite and improve the simulation of ontology-based classification. This paper weighs upon the accurate results between the ontology-based text classification and traditional classification based on the artificial neural network (ANN) using distinguished parameters such as accuracy, precision, etc. The experimental analysis shows that the proposed findings are substantially better than the conventional text classification, taking the course of action into account. The authors also ran tests to compare the results of the proposed research model with one of the latest researches, resulting in a cut above accuracy and F1 score of the proposed model for various experiments performed at the different number of hidden layers and neurons.
2023, PLOS ONE
To analyse large numbers of texts, social science researchers are increasingly confronting the challenge of text classification. When manual labeling is not possible and researchers have to find automatized ways to classify texts,... more
To analyse large numbers of texts, social science researchers are increasingly confronting the challenge of text classification. When manual labeling is not possible and researchers have to find automatized ways to classify texts, computer science provides a useful toolbox of machine-learning methods whose performance remains understudied in the social sciences. In this article, we compare the performance of the most widely used text classifiers by applying them to a typical research scenario in social science research: a relatively small labeled dataset with infrequent occurrence of categories of interest, which is a part of a large unlabeled dataset. As an example case, we look at Twitter communication regarding climate change, a topic of increasing scholarly interest in interdisciplinary social science research. Using a novel dataset including 5,750 tweets from various international organizations regarding the highly ambiguous concept of climate change, we evaluate the performance of methods in automatically classifying tweets based on whether they are about climate change or not. In this context, we highlight two main findings. First, supervised machine-learning methods perform better than state-of-the-art lexicons, in particular as class balance increases. Second, traditional machine-learning methods, such as logistic regression and random forest, perform similarly to sophisticated deep-learning methods, whilst requiring much less training time and computational resources. The results have important implications for the analysis of short texts in social science research.
2023, International Journal for Research in Applied Science and Engineering Technology
Computer vision and machine learning are two young and promising technology. Paired together they have been in use for inspection, identification, object recognition and many more. Agricultural markets suffer heavily due to the loss... more
Computer vision and machine learning are two young and promising technology. Paired together they have been in use for inspection, identification, object recognition and many more. Agricultural markets suffer heavily due to the loss caused by diseases. It can be easily prevented if only the farmers and cold storage owners knew what exactly the disease the product is suffering from. With the increasing reach of smartphone, digital cameras have been getting available to the masses. We are aiming to develop algorithms that can detect and estimate the diseases in vegetables and fruits. Such algorithms can be accommodated in smartphones and can potentially change the way the current inspection of the agricultural produces. Here we have tried to identify the key algorithms from recent work that could help us developing more advanced algorithms.
2023, Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing Volume 1 - EMNLP '09
A significant portion of the world's text is tagged by readers on social bookmarking websites. Credit attribution is an inherent problem in these corpora because most pages have multiple tags, but the tags do not always apply with equal... more
A significant portion of the world's text is tagged by readers on social bookmarking websites. Credit attribution is an inherent problem in these corpora because most pages have multiple tags, but the tags do not always apply with equal specificity across the whole document. Solving the credit attribution problem requires associating each word in a document with the most appropriate tags and vice versa. This paper introduces Labeled LDA, a topic model that constrains Latent Dirichlet Allocation by defining a one-to-one correspondence between LDA's latent topics and user tags. This allows Labeled LDA to directly learn word-tag correspondences. We demonstrate Labeled LDA's improved expressiveness over traditional LDA with visualizations of a corpus of tagged web pages from del.icio.us. Labeled LDA outperforms SVMs by more than 3 to 1 when extracting tag-specific document snippets. As a multi-label text classifier, our model is competitive with a discriminative baseline on a variety of datasets.
2023, International Journal of Computer Science and Information Technology
The amount of text data mining in the world and in our life seems ever increasing and there's no end to it. The concept (Text Data Mining) defined as the process of deriving high-quality information from text. It has been applied on... more
The amount of text data mining in the world and in our life seems ever increasing and there's no end to it. The concept (Text Data Mining) defined as the process of deriving high-quality information from text. It has been applied on different fields including: Pattern mining, opinion mining, and web mining. The concept of Text Data Mining is based around the global Stemming of different forms of Arabic words. Stemming is defined like the method of reducing inflected (or typically derived) words to their word stem, base or root kind typically a word kind. We use the REP-Tree to improve text representation. In addition, test new combinations of weighting schemes to be applied on Arabic text data for classification purposes. For processing, WEKA workbench is used. The results in the paper on data set of BBC-Arabic website also show the efficiency and accuracy of REP-TREE in Arabic text classification.
2023
Electronic health records (EHRs) contain important clinical information about pa-tients. Some of these data are in the form of free text and require preprocessing to be able to used in automated systems. Effi-cient and effective use of... more
Electronic health records (EHRs) contain important clinical information about pa-tients. Some of these data are in the form of free text and require preprocessing to be able to used in automated systems. Effi-cient and effective use of this data could be vital to the speed and quality of health care. As a case study, we analyzed clas-sification of CT imaging reports into bi-nary categories. In addition to regular text classification, we utilized topic mod-eling of the entire dataset in various ways. Topic modeling of the corpora provides in-terpretable themes that exist in these re-ports. Representing reports according to their topic distributions is more compact than bag-of-words representation and can be processed faster than raw text in sub-sequent automated processes. A binary topic model was also built as an unsuper-vised classification approach with the as-sumption that each topic corresponds to a class. And, finally an aggregate topic clas-sifier was built where reports are c...
2023, Bonfring International Journal of Data Mining
Conventional schemes to document classification need labeled data to build consistent and precise classifiers. On the other hand, labeled data are rarely available, and normally too expensive to obtain. Provided a learning task for which... more
Conventional schemes to document classification need labeled data to build consistent and precise classifiers. On the other hand, labeled data are rarely available, and normally too expensive to obtain. Provided a learning task for which training data are not available, abundant labeled data possibly will exist for a different however related domain. One would like to make use of the related labeled data as auxiliary information to accomplish the classification task in the target domain. In recent times, the paradigm of transfer learning has been introduced to enable efficient learning strategies when auxiliary data obey a different probability distribution. A co-clustering based classification schemes has been proposed earlier to deal with cross-domain text classification. Here, the idea underlying this approach is extended by making the latent semantic relationship between the two domains explicit. This objective is achieved with the use of Wikipedia. Consequently, the pathway that permits propagating labels between the two domains not only captures common words, however also semantic concepts in accordance with the content of documents. Results empirically demonstrates the efficacy of the semantic-based approach to cross-domain classification using a variety of real data. I.
2023, IEEE Transactions on Knowledge and Data Engineering
One fundamental issue in today On-line Social Networks (OSNs) is to give users the ability to control the messages posted on their own private space to avoid that unwanted content is displayed. Up to now OSNs provide little support to... more
One fundamental issue in today On-line Social Networks (OSNs) is to give users the ability to control the messages posted on their own private space to avoid that unwanted content is displayed. Up to now OSNs provide little support to this requirement. To fill the gap, in this paper, we propose a system allowing OSN users to have a direct control on the messages posted on their walls. This is achieved through a flexible rule-based system, that allows users to customize the filtering criteria to be applied to their walls, and a Machine Learning based soft classifier automatically labeling messages in support of content-based filtering.
2023
The web page recommendation is generated by using the navigational history from web server log files. Semantic Variable Length Markov Chain Model (SVLMC) is a web page recommendation system used to generate recommendation by combining a... more
The web page recommendation is generated by using the navigational history from web server log files. Semantic Variable Length Markov Chain Model (SVLMC) is a web page recommendation system used to generate recommendation by combining a higher order Markov model with rich semantic data. The problem of state space complexity and time complexity in SVLMC was resolved by Semantic Variable Length confidence pruned Markov Chain Model (SVLCPMC) and Support vector machine based SVLCPMC (SSVLCPMC) meth-ods respectively. The recommendation accuracy was further improved by quickest change detection using Kullback-Leibler Divergence method. In this paper, socio semantic information is included with the similarity score which improves the recommendation accuracy. The social information from the social websites such as twitter is considered for web page recommendation. Initially number of web pages is collected and the similari-ty between web pages is computed by comparing their semantic informa...
2023, Proceedings of the Twelfth Conference on Computational Natural Language Learning - CoNLL '08
Detecting the semantic coherence of a document is a challenging task and has several applications such as in text segmentation and categorization. This paper is an attempt to distinguish between a 'semantically coherent' true document and... more
Detecting the semantic coherence of a document is a challenging task and has several applications such as in text segmentation and categorization. This paper is an attempt to distinguish between a 'semantically coherent' true document and a 'randomly generated' false document through topic detection in the framework of latent Dirichlet analysis. Based on the premise that a true document contains only a few topics and a false document is made up of many topics, it is asserted that the entropy of the topic distribution will be lower for a true document than that for a false document. This hypothesis is tested on several false document sets generated by various methods and is found to be useful for fake content detection applications.
2023, International Journal of Information System Modeling and Design
Extracting knowledge from unstructured text and then classifying it is gaining importance after the data explosion on the web. The traditional text classification approaches are becoming ubiquitous, but the hybrid of semantic knowledge... more
Extracting knowledge from unstructured text and then classifying it is gaining importance after the data explosion on the web. The traditional text classification approaches are becoming ubiquitous, but the hybrid of semantic knowledge representation with statistical techniques can be more promising. The developed method attempts to fabricate neural networks to expedite and improve the simulation of ontology-based classification. This paper weighs upon the accurate results between the ontology-based text classification and traditional classification based on the artificial neural network (ANN) using distinguished parameters such as accuracy, precision, etc. The experimental analysis shows that the proposed findings are substantially better than the conventional text classification, taking the course of action into account. The authors also ran tests to compare the results of the proposed research model with one of the latest researches, resulting in a cut above accuracy and F1 score...
2023, International Journal of Computer Science and Information Technology
The amount of text data mining in the world and in our life seems ever increasing and there's no end to it. The concept (Text Data Mining) defined as the process of deriving high-quality information from text. It has been applied on... more
The amount of text data mining in the world and in our life seems ever increasing and there's no end to it. The concept (Text Data Mining) defined as the process of deriving high-quality information from text. It has been applied on different fields including: Pattern mining, opinion mining, and web mining. The concept of Text Data Mining is based around the global Stemming of different forms of Arabic words. Stemming is defined like the method of reducing inflected (or typically derived) words to their word stem, base or root kind typically a word kind. We use the REP-Tree to improve text representation. In addition, test new combinations of weighting schemes to be applied on Arabic text data for classification purposes. For processing, WEKA workbench is used. The results in the paper on data set of BBC-Arabic website also show the efficiency and accuracy of REP-TREE in Arabic text classification.
2023
When the relevance feedback, which is one of the most popular information retrieval model, is used in an information retrieval system, a related word is extracted based on the first retrival result. Then these words are added into the... more
When the relevance feedback, which is one of the most popular information retrieval model, is used in an information retrieval system, a related word is extracted based on the first retrival result. Then these words are added into the original query, and retrieval is performed again using updated query. Generally, Using such query expansion technique, retrieval performance using the query expansion falls in comparison with the performance using the original query. As the cause, there is a few synonyms in the thesaurus and although some synonyms are added to the query, the same documents are retireved as a result. In this paper, to solve the problem over such related words, we propose latent context relevance in consideration of the relevance between query and each index words in the document set.
2023, International Journal of Digital Earth
Understanding and detecting the intended meaning in social media is challenging because social media messages contain varieties of noise and chaos that are irrelevant to the themes of interests. For example, conventional supervised... more
Understanding and detecting the intended meaning in social media is challenging because social media messages contain varieties of noise and chaos that are irrelevant to the themes of interests. For example, conventional supervised classification approaches would produce inconsistent solutions to detecting and clarifying whether any given Twitter message is really about a wildfire event. Consequently, a renovated workflow was designed and implemented. The workflow consists of four sequential procedures: (1) Apply the latent semantic analysis and cosine similarity calculation to examine the similarity between Twitter messages; (2) Apply Affinity Propagation to identify exemplars of Twitter messages; (3) Apply the cosine similarity calculation again to automatically match the exemplars to known training results, and (4) Apply accumulative exemplars to classify Twitter messages using a support vector machine approach. The overall correction ratio was over 90% when a series of ongoing and historical wildfire events were examined.
2022
The bag of words representation of documents is often unsatisfactory as it ignores relationships between important terms that do not co-occur literally. Improvements might be achieved by expanding the vocabulary with other relevant word,... more
The bag of words representation of documents is often unsatisfactory as it ignores relationships between important terms that do not co-occur literally. Improvements might be achieved by expanding the vocabulary with other relevant word, like synonyms.
2022, Proceedings of the 31st Annual ACM Symposium on Applied Computing
Classifying tweets is an intrinsically hard task as tweets are short messages which makes traditional bags of words based approach inefficient. In fact, bags of words approaches ignores relationships between important terms that do not... more
Classifying tweets is an intrinsically hard task as tweets are short messages which makes traditional bags of words based approach inefficient. In fact, bags of words approaches ignores relationships between important terms that do not co-occur literally. In this paper we resort to word-word co-occurence information from a large corpus to expand the vocabulary of another corpus consisting of tweets. Our results show that we are able to reduce the number of erroneous classifications by 14% using co-occurence information. CCS Concepts •Information systems → Data mining; Web searching and information discovery; Social networks; •Applied computing → Document management and text processing;
2022
Topic modeling is a technique for reducing dimensionality of large corpuses of text. Latent Dirichlet allocation (LDA), the most prevalent form of topic modeling, improved upon earlier methods by introducing Bayesian iterative updates,... more
Topic modeling is a technique for reducing dimensionality of large corpuses of text. Latent Dirichlet allocation (LDA), the most prevalent form of topic modeling, improved upon earlier methods by introducing Bayesian iterative updates, providing a sound theoretical basis for modeling by iteration. Yet a piece of the modeling puzzle remains unsolved; the number of topics to model, K, is an as yet unanswered question. This number of topics may also be called the dimensionality of the model. With this is an integrally related puzzle; how to determine when a model has been best fit. Presented here are a brief history of the development of topic modeling from its inception preceding LDA to the present; and a comparison of methods for determining what is a best-fit topic model, in pursuit of the most appropriate K
2022, IET Networks
Web behaviour analysis of a collective user has provided a powerful means for studying the collective user interests on the Internet. However, the existing research merely analyses the behaviour of a single user who accesses multiple... more
Web behaviour analysis of a collective user has provided a powerful means for studying the collective user interests on the Internet. However, the existing research merely analyses the behaviour of a single user who accesses multiple applications or multiple users who access one application. The authors propose a web behaviour classification model for collective user, in which the title fields in HTTP flows are extracted from the mirrored network traffic that has been already captured for any given period of time. The title fields, considered as the short abstract of the whole web pages browsed by the users, are vectorized by natural language processing technologies. Specifically, the Latent Dirichlet allocation (LDA) algorithm is used to calculate the topic distribution probability matrix. Afterward, the multi-class classifiers are trained and tested using the manually labelled probability distribution matrix from the output of the LDA algorithm to classify the user behaviour topics. The experiments demonstrate that the highest classification accuracy of the model reaches 81.2% by combing the LDA algorithm with Random Forest classifiers under the classification model. This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
2022, Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Topic models provide insights into document collections, and their supervised extensions also capture associated document-level metadata such as sentiment. However, inferring such models from data is often slow and cannot scale to big... more
Topic models provide insights into document collections, and their supervised extensions also capture associated document-level metadata such as sentiment. However, inferring such models from data is often slow and cannot scale to big data. We build upon the "anchor" method for learning topic models to capture the relationship between metadata and latent topics by extending the vector-space representation of word-cooccurrence to include metadataspecific dimensions. These additional dimensions reveal new anchor words that reflect specific combinations of metadata and topic. We show that these new latent representations predict sentiment as accurately as supervised topic models, and we find these representations more quickly without sacrificing interpretability.
2022, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Supervised models of NLP rely on large collections of text which closely resemble the intended testing setting. Unfortunately matching text is often not available in sufficient quantity, and moreover, within any domain of text, data is... more
Supervised models of NLP rely on large collections of text which closely resemble the intended testing setting. Unfortunately matching text is often not available in sufficient quantity, and moreover, within any domain of text, data is often highly heterogenous. In this paper we propose a method to distill the important domain signal as part of a multi-domain learning system, using a latent variable model in which parts of a neural model are stochastically gated based on the inferred domain. We compare the use of discrete versus continuous latent variables, operating in a domain-supervised or a domain semi-supervised setting, where the domain is known only for a subset of training inputs. We show that our model leads to substantial performance improvements over competitive benchmark domain adaptation methods, including methods using adversarial learning.
2022
The semantics are derived from textual data that provide representations for Machine Learning algorithms. These representations are interpretable form of high dimensional sparse matrix that are given as an input to the machine learning... more
The semantics are derived from textual data that provide representations for Machine Learning algorithms. These representations are interpretable form of high dimensional sparse matrix that are given as an input to the machine learning algorithms. Since learning methods are broadly classified as parametric and non-parametric learning methods, in this paper we provide the effects of these type of algorithms on the high dimensional sparse matrix representations. In order to derive the representations from the text data, we have considered TF-IDF representation with valid reason in the paper. We have formed representations of 50, 100, 500, 1000 and 5000 dimensions respectively over which we have performed classification using Linear Discriminant Analysis and Naive Bayes as parametric learning method, Decision Tree and Support Vector Machines as non-parametric learning method. We have later provided the metrics on every single dimension of the representation and effect of every single a...
2022, Bonfring International Journal of Data Mining
Conventional schemes to document classification need labeled data to build consistent and precise classifiers. On the other hand, labeled data are rarely available, and normally too expensive to obtain. Provided a learning task for which... more
Conventional schemes to document classification need labeled data to build consistent and precise classifiers. On the other hand, labeled data are rarely available, and normally too expensive to obtain. Provided a learning task for which training data are not available, abundant labeled data possibly will exist for a different however related domain. One would like to make use of the related labeled data as auxiliary information to accomplish the classification task in the target domain. In recent times, the paradigm of transfer learning has been introduced to enable efficient learning strategies when auxiliary data obey a different probability distribution. A co-clustering based classification schemes has been proposed earlier to deal with cross-domain text classification. Here, the idea underlying this approach is extended by making the latent semantic relationship between the two domains explicit. This objective is achieved with the use of Wikipedia. Consequently, the pathway that permits propagating labels between the two domains not only captures common words, however also semantic concepts in accordance with the content of documents. Results empirically demonstrates the efficacy of the semantic-based approach to cross-domain classification using a variety of real data. I.
2022
Today online social media are most valuable and essential member of human life. People can online communicate with their family, society and friends to exchange several type of information including text, images, audio & video data.... more
Today online social media are most valuable and essential member of human life. People can online communicate with their family, society and friends to exchange several type of information including text, images, audio & video data. Therefore, there is need of control of user over the contents which are published on his wall. Hence , user is able to avoid undesirable content that is displayed on his wall. But, currently social networks provide this service in very small extent. To provide this service, we plan a user defined filtering rules and machine learning categorization technique in this paper. Filtering rules grant users to personalize the filtering norms that are employed to the contents which are published on their walls. Automatic categorization based on content of messages into proposed categories is possible through machine learning technique. Also, proposed system can restrict the undesired images that are published on online social network (OSN) user private space by u...
2022
A social network is a set of people or organizations or other social entities connected by set of social relationships such as friendship, co-working or information exchange. Online Social Networks (OSN) usually not support to the user... more
A social network is a set of people or organizations or other social entities connected by set of social relationships such as friendship, co-working or information exchange. Online Social Networks (OSN) usually not support to the user for message filtering. To solve this issue, which allows OSN users to have a direct control on the messages posted on their walls. The users can control the unwanted messages posted on their own private space .To avoid unwanted messages displayed and they can also block their friend from friends list using filtering rule, content based filtering and short text classification.
2022, 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)
Inferring locations from user texts on social media platforms is a non-trivial and challenging problem relating to public safety. We propose a novel non-uniform grid-based approach for location inference from Twitter messages using... more
Inferring locations from user texts on social media platforms is a non-trivial and challenging problem relating to public safety. We propose a novel non-uniform grid-based approach for location inference from Twitter messages using Quadtree spatial partitions. The proposed algorithm uses natural language processing (NLP) for semantic understanding and incorporates Cosine similarity and Jaccard similarity measures for feature vector extraction and dimensionality reduction. We chose Twitter as our experimental social media platform due to its popularity and effectiveness for the dissemination of news and stories about recent events happening around the world. Our approach is the first o f i ts k ind t o m ake l ocation inference from tweets using Quadtree spatial partitions and NLP, in hybrid word-vector representations. The proposed algorithm achieved significant c lassification ac curacy an d ou tperformed state-of-theart grid-based content-only location inference methods by up to 24% in correctly predicting tweet locations within a 161km radius and by 300km in median error distance on benchmark datasets.
2022
Statistical analysis of parliamentary roll call votes is an important topic in political science as it reveals ideological positions of members of parliament and factions. However, these positions depend on the issues debated and voted... more
Statistical analysis of parliamentary roll call votes is an important topic in political science as it reveals ideological positions of members of parliament and factions. However, these positions depend on the issues debated and voted upon as well as on attitude towards the governing coalition. Therefore, analysis of carefully selected sets of roll call votes provides deeper knowledge about members of parliament behavior. However, in order to classify roll call votes according to their topic automatic text classifiers have to be employed, as these votes are counted in thousands. In this paper we present results of an ongoing research on thematic classification of roll call votes of the Lithuanian Parliament. Also, this paper is a part of a larger project aiming to develop the infrastructure designed for monitoring and analyzing roll call voting in the Lithuanian Parliament.
2022, ICST Transactions on Scalable Information Systems
Topic modelling is the new revolution in text mining. It is a statistical technique for revealing the underlying semantic structure in large collection of documents. After analysing approximately 300 research articles on topic modeling, a... more
Topic modelling is the new revolution in text mining. It is a statistical technique for revealing the underlying semantic structure in large collection of documents. After analysing approximately 300 research articles on topic modeling, a comprehensive survey on topic modelling has been presented in this paper. It includes classification hierarchy, Topic modelling methods, Posterior Inference techniques, different evolution models of latent Dirichlet allocation (LDA) and its applications in different areas of technology including Scientific Literature, Bioinformatics, Software Engineering and analysing social network is presented. Quantitative evaluation of topic modeling techniques is also presented in detail for better understanding the concept of topic modeling. At the end paper is concluded with detailed discussion on challenges of topic modelling, which will definitely give researchers an insight for good research.
2022
The use of semantic models is relevant in automated learning systems, in solving certain tasks, such as: extracting knowledge from texts, information retrieval, abstracting, checking the correctness of vocabulary terms and definitions,... more
The use of semantic models is relevant in automated learning systems, in solving certain tasks, such as: extracting knowledge from texts, information retrieval, abstracting, checking the correctness of vocabulary terms and definitions, automatic generation of associative links in hypertext databases, etc. No less important is the development of new tools and instruments to automate semantic analysis. Such methods of analysis allow you to collect basic information about a particular topic, focus and mood of the texts, which will further simplify the automated work with them, such as cataloging, search and comparison. The objectives of this study are: development of the LSA method with support for processing Ukrainian-language texts, justification of the choice of technologies for the implementation of methods and tools of semantic analysis, study of the effectiveness of the developed method and software.
2022
In recent years, Online Social Networks have become an important part of daily life for many. One fundamental issue in today user wall(s) is to give users the ability to control the messages posted on their own private space to avoid that... more
In recent years, Online Social Networks have become an important part of daily life for many. One fundamental issue in today user wall(s) is to give users the ability to control the messages posted on their own private space to avoid that unwanted content is displayed. Up to now user walls provide little support to this requirement. To fill the gap, I propose a system allowing user wall users to have a direct control on the messages posted on their walls. This is achieved through a flexible rule-based system, that allows users to customize the filtering criteria to be applied to their walls, and Machine Learning based soft classifier automatically labeling messages in support of content-based filtering.
2022
This paper proposes a system that implements a content-based message filtering service for Online Social Networks (OSNs). Our system allows OSN users to have a direct control on the messages that are posted on their walls. This is done... more
This paper proposes a system that implements a content-based message filtering service for Online Social Networks (OSNs). Our system allows OSN users to have a direct control on the messages that are posted on their walls. This is done through a rule-based system, that allows a user to customize the filtering criteria, which is to be applied to their walls, and a Machine Learning based classifier which can automatically produce membership labels for the support of our content-based filtering mechanism. Keywords— On-line Social Networks; Information Filtering; Short Text Classification; Natural Language
2022, 2017 IEEE International Conference on Big Data (Big Data)
In traditional text classification, classes are mutually exclusive, i.e. it is not possible to have one text or text fragment classified into more than one class. On the other hand, in multi-label classification an individual text may... more
In traditional text classification, classes are mutually exclusive, i.e. it is not possible to have one text or text fragment classified into more than one class. On the other hand, in multi-label classification an individual text may belong to several classes simultaneously. This type of classification is required by a large number of current applications such as big data classification, images and video annotation. Supervised learning is the most used type of machine learning in the classification task. It requires large quantities of labeled data and the intervention of a human tagger in the creation of the training sets. When the data sets become very large or heavily noisy, this operation can be tedious, prone to error and time consuming. In this case, semi-supervised learning, which requires only few labels, is a better choice. In this paper, we study and evaluate several methods to address the problem of multi-label classification using semisupervised learning and data from social networks. First, we propose a linguistic pre-processing involving tokenisation, recognition of named entities and hashtag segmentation in order to decrease the noise in this type of massive and unstructured real data and then we perform a word sense disambiguation using WordNet. Second, several experiments related to multi-label classification and semi-supervised learning are carried out on these data sets and compared to each other. These evaluations compare the results of the approaches considered. This paper proposes a method for combining semisupervised methods with a graph method for the extraction of subjects in social networks using a multi-label classification approach. Experiments show that the performance of the proposed model increases in 4 p.p. the precision of the classification when compared to a baseline.
2022, arXiv: Machine Learning
The histogram method is a powerful non-parametric approach for estimating the probability density function of a continuous variable. But the construction of a histogram, compared to the parametric approaches, demands a large number of... more
The histogram method is a powerful non-parametric approach for estimating the probability density function of a continuous variable. But the construction of a histogram, compared to the parametric approaches, demands a large number of observations to capture the underlying density function. Thus it is not suitable for analyzing a sparse data set, a collection of units with a small size of data. In this paper, by employing the probabilistic topic model, we develop a novel Bayesian approach to alleviating the sparsity problem in the conventional histogram estimation. Our method estimates a unit's density function as a mixture of basis histograms, in which the number of bins for each basis, as well as their heights, is determined automatically. The estimation procedure is performed by using the fast and easy-to-implement collapsed Gibbs sampling. We apply the proposed method to synthetic data, showing that it performs well.
2022
Topic modeling has emerged as a popular learning technique not only in mining text representations, but also in modeling authors’ interests and influence, as well as predicting linkage among documents or authors. However, few existing... more
Topic modeling has emerged as a popular learning technique not only in mining text representations, but also in modeling authors’ interests and influence, as well as predicting linkage among documents or authors. However, few existing topic models distinguish and make use of the prior knowledge in regard to the different importance of documents (authors) over topics. In this paper, we focus on the ability of topic models in modeling author interests and influence. We introduce a pair-wise based learningto-rank algorithm into the topic modeling process with the hypothesis that investigating and exploring the prior-knowledge on authors’ different importance over topics can help to achieve more accurate and cohesive topic modeling results. Moreover, the framework integrating learning-to-rank mechanism with topic modeling can help to facilitate ranking in new authors. In this paper, we particularly apply this integrated model into two applications: the task of predicting future award wi...