balancing (original) (raw)

Improving KNN-based e-mail classification into folders generating class-balanced datasets

2008

In this paper we deal with an e-mail classification problem known as e- mail foldering, which consists on the classification of incoming mail into the dierent folders previously cre- ated by the user. This task has re- ceived less attention in the literature than spam filtering and is quite com- plex due to the (usually large) car- dinality (number of folders) and lack of balance (documents per class) of the class variable. On the other hand, proximity based algorithms have been used in a wide range of fields since decades ago. One of the main drawbacks of these classifiers, known as lazy classifiers, is their computational load due to their need to compute the distance of a new sample to each point in the vectorial space to decide which class it belongs to. This is why most of the devel- oped techniques for these classifiers consist on edition and condensation of the training set. In this work we make an approach to the problem of e-mail classification into folders. It is suggested...

Learning to classify e-mail

Information Sciences, 2007

In this paper we study supervised and semi-supervised classification of e-mails. We consider two tasks: filing e-mails into folders and spam e-mail filtering. Firstly, in a supervised learning setting, we investigate the use of random forest for automatic e-mail filing into folders and spam e-mail filtering. We show that random forest is a good choice for these tasks as it runs fast on large and high dimensional databases, is easy to tune and is highly accurate, outperforming popular algorithms such as decision trees, support vector machines and naïve Bayes. We introduce a new accurate feature selector with linear time complexity. Secondly, we examine the applicability of the semi-supervised co-training paradigm for spam e-mail filtering by employing random forests, support vector machines, decision tree and naïve Bayes as base classifiers. The study shows that a classifier trained on a small set of labelled examples can be successfully boosted using unlabelled examples to accuracy rate of only 5% lower than a classifier trained on all labelled examples. We investigate the performance of co-training with one natural feature split and show that in the domain of spam e-mail filtering it can be as competitive as co-training with two natural feature splits.

Incremental E-Mail Classification and Rule Suggestion Using Simple Term Statistics

2014

Abstract. In this paper, we present and use a method for e-mail categorization based on simple term statistics updated incrementally. We apply simple term statistics to two different tasks. The first task is to predict folders for classification of e-mails when large numbers of messages are required to remain unclassified. The second task is to support users who define rule bases for the same classification task, by suggesting suitable keywords for constructing Ripple Down Rule bases in this scenario. For both tasks, the results are compared with a number of standard machine learning algorithms. The comparison shows that the simple term statistics method achieves a higher level of accuracy than other machine learning methods when taking computation time into account. 1

A Review of Text Classification Approaches for E-mail Management

ijetch.org

Abstract—The continuing explosive growth of textual content within the World Wide Web has given rise to the need for sophisticated Text Classification (TC) techniques that combine efficiency with high quality of results. E-mail filtering and email organization is an ...

E-mail classification by decision forests

2003

We investigate the use of decision forests for automated e-mail filing into folders and junk e-mail filtering. The experiments show that decision forests offer the following advantages: (i) ability to deal with the large dimensionality of feature vectors in text categorization, (ii) improved accuracy of the ensemble over the individual classifiers and favourable comparison with a number of other highly accurate classifiers including neural networks and boosted decision trees, and (iii) acceptable computational expenses.

Phrases and feature selection in e-mail classification

2004

In this paper we study the effectiveness of using a phrase-based representation in e-mail classification, and the affect this approach has on a number of machine learning algorithms. We also evaluate various feature selection methods and reduction levels for the bag-of-words representation on several learning algorithms and corpora. The results show that the phrasebased representation and feature selection methods can be used to increase the performance of e-mail classifiers.

Towards an adaptive mail classifier

Proc. of Italian Association …, 2002

We introduce a technique based on data mining algorithms for classifying incoming messages, as a basis for an overall architecture for maintenance and management of e-mail messages. We exploit clustering techniques for grouping structured and unstructured information extracted from e-mail messages in an unsupervised way, and exploit the resulting algorithm in the process of folder creation (and maintenance) and e-mail redirection. Some initial experimental results show the effectiveness of the technique, both from an efficiency and a quality-of-results viewpoint.

E-mail categorization using partially related training examples

Proceedings of the 5th Information Interaction in Context Symposium on - IIiX '14, 2014

Automatic e-mail categorization with traditional classification methods requires labelling of training data. In a reallife setting, this labelling disturbs the working flow of the user. We argue that it might be helpful to use documents, which are generally well-structured in directories on the file system, as training data for supervised e-mail categorization and thereby reducing the labelling effort required from users. Previous work demonstrated that the characteristics of documents and e-mail messages are too different to use organized documents as training examples for e-mail categorization using traditional supervised classification methods.

An Approach to Email Classification Using Bayesian Theorem

Global journal of computer science and technology, 2012

Email Classifiers based on Bayesian theorem have been very effective in Spam filtering due to their strong categorization ability and high precision. This paper proposes an algorithm for email classification based on Bayesian theorem. The purpose is to automatically classify mails into predefined categories. The algorithm assigns an incoming mail to its appropriate category by checking its textual contents. The experimental results depict that the proposed algorithm is reasonable and effective method for email classification.