Enhancing Concept Based Modeling Approach for Blog Classification (original) (raw)

Enhancing Automatic Blog Classification Using Concept-Category Vectorization

Knowledge Engineering …, 2011

The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document and are linked to publications on ResearchGate, letting you access and read them immediately. Y. Wang and T. Li (Eds.): Knowledge Engineering and Management, AISC 123, pp. 487-497. springerlink.com

Blog classification: Adding linguistic knowledge to improve the k-nn algorithm

Intelligent Information Processing IV, 2008

Blogs are interactive and regularly updated websites which can be seen as diaries. These websites are composed by articles based on distinct topics. Thus, it is necessary to develop Information Retrieval approaches for this new web knowledge. The first important step of this process is the categorization of the articles. The paper above compares several methods using linguistic knowledge with k-NN algorithm for automatic categorization of weblogs articles.

Blog Classification Using K-Means

Proceedings of the 11th International Conference on Enterprise Information, 2009

With the recent exponential growth of blogs, a vast amount of important data has appeared on blogs. However, dynamic, autonomous, and personal features of such blogs make blog pages be quite different from those on general web pages in many aspects. As a result, this also causes many problems which cannot be handled properly by general search engines. One of the problems which we focused in this study is that blog pages are inherently poorly-organized and very much duplicated. This means the blog search engines cannot but provide the poorly-organized and duplicated results. To solve this problem, we propose a blog classification method using K-means and present a blog search result reorganization approach based on this method. In this study, firstly, we review the current status and their performances of blogs and blog search engines. Secondly, we adopt the K-means algorithm as a base algorithm and devise a blog title classification method to reorganize the blog titles resulted by a search engine. Finally, by implementing a prototype system of our algorithm, we evaluate our algorithm's effectiveness, and present a conclusion and the directions for future work. We expect this algorithm can improve the current blog search engines' usability.

Automatic Classification of Unstructured Blog Text

Journal of Intelligent Learning Systems and Applications, 2013

Automatic classification of blog entries is generally treated as a semi-supervised machine learning task, in which the blog entries are automatically assigned to one of a set of pre-defined classes based on the features extracted from their textual content. This paper attempts automatic classification of unstructured blog entries by following pre-processing steps like tokenization, stop-word elimination and stemming; statistical techniques for feature set extraction, and feature set enhancement using semantic resources followed by modeling using two alternative machine learning models-the naïve Bayesian model and the artificial neural network model. Empirical evaluations indicate that this multi-step classification approach has resulted in good overall classification accuracy over unstructured blog text datasets with both machine learning model alternatives. However, the naïve Bayesian classification model clearly out-performs the ANN based classification model when a smaller feature-set is available which is usually the case when a blog topic is recent and the number of training datasets available is restricted.

Categorization of Blogs through Similarity Analysis

2007 IEEE Intelligence and Security Informatics, 2007

We describe a new model for evaluating similarities the set E of undirected edges. An edge ei1 between vertices v; among a large number of web logs, and compare several algorithms using the model. Possible uses of this include isolating and v1 represents a URL link that both blogs point to. and tracking like-minded networks for surveillance and improved categorization. Our model consists of similarity In figure 2. la, we see two blogs A and B, each with two analysis combined with clustering. Experimental results show outbound links. An example scenario is where A and B that our algorithm is able to separate blogs into categories, represent two sports blogs, based on a news article from ESPN consistently achieving over 90% success rate. regarding a recent basketball game. Figure 2. la shows how Keywordssocial networking, clustering, weblog this scenario is represented in our model. I.

A Technical Study and Analysis on Fuzzy Similarity Based Models For Text Classification

International Journal of Data Mining & Knowledge Management Process, 2012

In this new and current era of technology, advancements and techniques, efficient and effective text document classification is becoming a challenging and highly required area to capably categorize text documents into mutually exclusive categories. Fuzzy similarity provides a way to find the similarity of features among various documents. In this paper, a technical review on various fuzzy similarity based models is given. These models are discussed and compared to frame out their use and necessity. A tour of different methodologies is provided which is based upon fuzzy similarity related concerns. It shows that how text and web documents are categorized efficiently into different categories. Various experimental results of these models are also discussed. The technical comparisons among each model's parameters are shown in the form of a 3-D chart. Such study and technical review provide a strong base of research work done on fuzzy similarity based text document categorization.

Mining Wikipedia Knowledge to improve document indexing and classification

2010

Weblogs are an important source of information that requires automatic techniques to categorize them into "topic-based" content, to facilitate their future browsing and retrieval. In this paper we propose and illustrate the effectiveness of a new tf.idf measure. The proposed Conf.idf, Catf.idf measures are solely based on the mapping of terms-to-concepts-to-categories (TCONCAT) method that utilizes Wikipedia. The Knowledge base-Wikipedia is considered as a large scale Web encyclopaedia, that has high-quality and huge number of articles and categorical indexes. Using this system, our proposed framework consists of two stages to solve weblog classification problem. The first stage is to find out the terms belonging to a unique concept (article), as well as to disambiguate the terms belonging to more than one concept. The second stage is the determination of the categories to which these found concepts belong to. Experimental result confirms that, proposed system can distinguish the weblogs that belongs to more than one category efficiently and has a better performance and success than the traditional statistical Natural Language Processing-NLP approaches.

Classifying unlabeled short texts using a fuzzy declarative approach

Language Resources and Evaluation, 2012

Web 2.0 provides user-friendly tools that allow persons to create and publish content online. User generated content often takes the form of short texts (e.g., blog posts, news feeds, snippets, etc). This has motivated an increasing interest on the analysis of short texts and, specifically, on their categorisation. Text categorisation is the task of classifying documents into a certain number of predefined categories. Traditional text classification techniques are mainly based on word frequency statistical analysis and have been proved inadequate for the classification of short texts where word occurrence is too small. On the other hand, the classic approach to text categorization is based on a learning process that requires a large number of labeled training texts to achieve an accurate performance. However labeled documents might not be available, when unlabeled documents can be easily collected. This paper presents an approach to text categorisation which does not need a preclassified set of training documents. The proposed method only requires the category names as user input. Each one of these categories is defined by means of an ontology of terms modelled by a set of what we call proximity equations. Hence, our method is not category occurrence frequency based, but highly depends on the definition of that category and how the text fits that definition. Therefore, the proposed approach is an appropriate method for short text classification where the frequency of occurrence of a category is very small or even zero. Another feature of our method is that the classification process is based on the ability of an extension of the standard Prolog language, named Bousi∼Prolog, for flexible matching and knowledge representation. This declarative approach provides a text classifier which is quick and easy to build, and a classification process which is easy for the user to understand. The results of experiments showed that the proposed method achieved a reasonably useful performance.

Weblog extraction with fuzzy classification methods

2009

This paper uses folksonomies and fuzzy clustering algorithms to establish term-relevant related results. This paper will propose a meta search engine with the ability to search for vaguely associated terms and aggregate them into several meaningful cluster categories. The potential of the fuzzy Weblog extraction is illustrated using a simple example and added value and possible future studies are discussed in the conclusion.

Enhancing Concept Based Modeling Approach for Blog Classification (original) (raw)

Related papers