Automatic Classification of Unstructured Blog Text (original) (raw)

Automatic Text Classification Of News Blog using Machine Learning

In recent years, due to the tremendous growth of information, text classification becomes a need for humans. In this project the data is to be classified into the various groups as per the existing content. This can be done by the training data to the machine. A set of full-text documents is used to train the machine. This paper illustrates the classification process by using automatic text classification. We have vectorized the training data using a count vectorizer. Then the TF-IDF (Term Frequency-Inverse Document Frequency) is used for the normalizing data. Finally the Stochastic Gradient Descent Machine algorithm is used to classify the data.

Blog classification: Adding linguistic knowledge to improve the k-nn algorithm

Intelligent Information Processing IV, 2008

Blogs are interactive and regularly updated websites which can be seen as diaries. These websites are composed by articles based on distinct topics. Thus, it is necessary to develop Information Retrieval approaches for this new web knowledge. The first important step of this process is the categorization of the articles. The paper above compares several methods using linguistic knowledge with k-NN algorithm for automatic categorization of weblogs articles.

Enhancing Concept Based Modeling Approach for Blog Classification

Blogs are user generated content discusses on various topics. For the past 10 years, the social web content is growing in a fast pace and research projects are finding ways to channelize these information using text classification techniques. Existing classification technique follows only boolean (or crisp) logic. This paper extends our previous work with a framework where fuzzy clustering is optimized with fuzzy similarity to perform blog classification. The knowledge base-Wikipedia, a widely accepted by the research community was used for our feature selection and classification. Our experimental result proves that proposed framework significantly improves the precision and recall in classifying blogs.

Enhancing Automatic Blog Classification Using Concept-Category Vectorization

Knowledge Engineering …, 2011

The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document and are linked to publications on ResearchGate, letting you access and read them immediately. Y. Wang and T. Li (Eds.): Knowledge Engineering and Management, AISC 123, pp. 487-497. springerlink.com

An in-depth exploration of Bangla blog post classification

Bulletin of Electrical Engineering and Informatics, 2021

Bangla blog is increasing rapidly in the era of information, and consequently, the blog has a diverse layout and categorization. In such an aptitude, automated blog post classification is a comparatively more efficient solution in order to organize Bangla blog posts in a standard way so that users can easily find their required articles of interest. In this research, nine supervised learning models which are support vector machine (SVM), multinomial naïve Bayes (MNB), multi-layer perceptron (MLP), k-nearest neighbours (k-NN), stochastic gradient descent (SGD), decision tree, perceptron, ridge classifier and random forest are utilized and compared for classification of Bangla blog post. Moreover, the performance on predicting blog posts against eight categories, three feature extraction techniques are applied, namely unigram TF-IDF (term frequency-inverse document frequency), bigram TF-IDF, and trigram TF-IDF. The majority of the classifiers show above 80% accuracy. Other performance evaluation metrics also show good results while comparing the selected classifiers.

Blog Classification Using K-Means

Proceedings of the 11th International Conference on Enterprise Information, 2009

With the recent exponential growth of blogs, a vast amount of important data has appeared on blogs. However, dynamic, autonomous, and personal features of such blogs make blog pages be quite different from those on general web pages in many aspects. As a result, this also causes many problems which cannot be handled properly by general search engines. One of the problems which we focused in this study is that blog pages are inherently poorly-organized and very much duplicated. This means the blog search engines cannot but provide the poorly-organized and duplicated results. To solve this problem, we propose a blog classification method using K-means and present a blog search result reorganization approach based on this method. In this study, firstly, we review the current status and their performances of blogs and blog search engines. Secondly, we adopt the K-means algorithm as a base algorithm and devise a blog title classification method to reorganize the blog titles resulted by a search engine. Finally, by implementing a prototype system of our algorithm, we evaluate our algorithm's effectiveness, and present a conclusion and the directions for future work. We expect this algorithm can improve the current blog search engines' usability.

A comparative study of machine learning techniques in blog comments spam filtering

2010

In this paper we compare four machine learning techniques for blog comments spam filtering. the machine learning techniques are the Naïve Bayes, K-nearest neighbor, neural networks and the support vector machines. For this comparative study we used a blog comment corpus that has been affected by spam, which is our study case in this work. We classify the comments of this blog comments corpus, which have 50 pages and 1024 blog comments are classified in spam an non-spam. The percentage of spam of this corpus is 67%.

Discovery of Potential Topics from Blog Articles by Machine Learning

This paper presents a method for potential topic discovery from blogsphere. We define a potential topic as an unpopular phrase that has potential to become a hot topic. To discover potential topics, this method builds a classifier to detect potentiality of a topic from topic frequency transitions in blog articles. First, this method extracts candidates of potential topics from categorized blog articles because categorization enables us to extract specialists. To extract potential topics from the candidates, a classifier for detecting potential topics is built from topic frequency transition data. For this learning, we propose two types of learning methods: supervised learning and semi-supervised learning. Though supervised learning provides more precise results, it requires enormous size of labeled data. Creating labeled data is costly and difficult. On the other hands, semi-supervised learning can build classifier from small size of labeled data and a lot of unlabeled data. Experimental results with real blog data show the effectiveness of the proposed method.

Weblog and short text feature extraction and impact on categorisation

Journal of Intelligent & Fuzzy Systems

The characterisation and categorisation of weblogs and other short texts has become an important research theme in the areas of topic/trend detection, and pattern recognition, amongst others. The value of analysing and characterising short text is to understand and identify the features that can identify and distinguish them, thereby improving input to the classification process. In this research work, we analyse a large number of text features and establish which combinations are useful to discriminate between the different genres of short text. Having identified the most promising features, we then confirm our findings by performing the categorisation task using three approaches: the Gaussian and SVM classifiers and the K-means clustering algorithm. Several hundred combinations of features were analysed in order to identify the best combinations and the results confirmed the observations made. The novel aspect of our work is the detection of the best combination of individual metrics which are identified as potential features to be used for the categorisation process.