The Impact of Overfitting and Overgeneralization on the Classification Accuracy in Data Mining (original) (raw)
Related papers
A meta-heuristic approach for improving the accuracy in some classification algorithms
Computers & Operations Research, 2011
Current classification algorithms usually do not try to achieve a balance between fitting and generalization when they infer models from training data. Furthermore, current algorithms ignore the fact that there may be different penalty costs for the false-positive, false-negative, and unclassifiable types. Thus, their performance may not be optimal or may even be coincidental. This paper proposes a meta-heuristic approach, called the Convexity Based Algorithm (CBA), to address these issues. The new approach aims at optimally balancing the data fitting and generalization behaviors of models when some traditional classification approaches are used. The CBA first defines the total misclassification cost (TC) as a weighted function of the three penalty costs and the corresponding error rates as mentioned above. Next it partitions the training data into regions. This is done according to some convexity properties derivable from the training data and the traditional classification method to be used in conjunction with the CBA. Next the CBA uses a genetic approach to determine the optimal levels of fitting and generalization. The TC is used as the fitness function in this genetic approach. Twelve reallife datasets from a wide spectrum of domains were used to better understand the effectiveness of the proposed approach. The computational results indicate that the CBA may potentially fill in a critical gap in the use of current or future classification algorithms.
Selection of Accurate and Robust Classification Model for Binary Classification Problems
Communications in Computer and Information Science, 2009
In this paper we aim to investigate the trade off in selection of an accurate, robust and costeffective classification model for binary classification problem. With empirical observation we present the evaluation of one-class and two-class classification model. We have experimented with four two-class and one-class classifier models on five UCI datasets. We have evaluated the classification models with Receiver Operating Curve (ROC), Cross validation Error and pair-wise measure Q statistics. Our finding is that in the presence of large amount of relevant training data the two-class classifiers perform better than one-class classifiers for binary classification problem. It is due to the ability of the two class classifier to use negative data samples in its decision. In scenarios when sufficient training data is not available the one-class classification model performs better.
2020
a b s t r a c t Medical data mining has recently become one of the most popular topics in the data mining community. This is due to the societal importance of the field and also the particular computational challenges posed in this domain of data mining. However, current medical data mining approaches oftentimes use identical costs or just ignore them for the different cases of classification errors. Thus, their outcome may be unexpected. This paper applies a new meta-heuristic approach, called the Homogeneity-Based Algorithm (or HBA), for optimizing the classification accuracy when analyzing some medical datasets. The HBA first expresses the objective as an optimization problem in terms of the error rates and the associated penalty costs. These costs may be dramatically different in medical applications as the implications of having a false-positive and a false-negative case may be tremendously different. When the HBA is combined with traditional classification algorithms, it enhan...
PERFORMANCE ANALYSIS OF LEARNING AND CLASSIFICATION
There are different learning and classification algorithms that are used to learn patterns and categorize data according to the frequency and relationship between attributes in a given dataset. The desired result has always been higher accuracy in predicting future values or events from the given dataset. These algorithms are crucial in the field of knowledge discovery and data mining and have been extensively researched and improved for accuracy.
Choosing Classification Algorithms and Its Optimum Parameters based on Data Set Characteristics
Choosing a correct classification algorithm for a given data set is an important task considering the existing multiple classifiers. A method of recommending a suitable algorithm and its optimum parameters for a given data set is proposed. Firstly, six different types of measures are computed for each data set to be representation of its characteristics. Then, the performance and optimum parameters for a given algorithm are computed by using grid search method. Afterwards, a model was built to predict the variance of classifiers for a given data set and another model was built to predict the best suitable algorithm. The proposed method tries to predict the optimum parameter for a certain algorithm based on knowledge learning from history data sets. To evaluate the performance of the proposed method, some extensive experiments for four different types of algorithms are conducted upon the UCI data sets. The results indicate that the proposed method is effective.
A niching genetic programming-based multi-objective algorithm for hybrid data classification
This paper introduces a multi-objective algorithm based on genetic programming to extract classification rules in databases composed of hybrid data, i.e., regular (e.g. numerical, logical, and textual) and nonregular (e.g. geographical) attributes. This algorithm employs a niche technique combined with a population archive in order to identify the rules that are more suitable for classifying items amongst classes of a given data set. The algorithm is implemented in such a way that the user can choose the function set that is more adequate for a given application. This feature makes the proposed approach virtually applicable to any kind of data set classification problem. Besides, the classification problem is modeled as a multi-objective one, in which the maximization of the accuracy and the minimization of the classifier complexity are considered as the objective functions. A set of different classification problems, with considerably different data sets and domains, has been considered: wines, patients with hepatitis, incipient faults in power transformers and level of development of cities. In this last data set, some of the attributes are geographical, and they are expressed as points, lines or polygons. The effectiveness of the algorithm has been compared with three other methods, widely employed for classification: Decision Tree (C4.5), Support Vector Machine (SVM) and Radial Basis Function (RBF). Statistical comparisons have been conducted employing one-way ANOVA and Tukey's tests, in order to provide reliable comparison of the methods. The results show that the proposed algorithm achieved better classification effectiveness in all tested instances, what suggests that it is suitable for a considerable range of classification applications.
Misclassification Penalties in Associative Classification
International journal of engineering research and technology, 2012
The data mining is a method to find small amount of useful data from very large amount of data. There are two classical techniques in the field of data mining namely associative rule mining and classical rule mining. In order to have advantages of both a new approach was developed by combining both the methods. This new approach is called associative classification. It has given significant improvement like better accuracy over the conventional classification system e.g. C4.5. There are many methods developed for the associative classification in the due course like CBA, CMAR, CPAR, Hyper Heuristic, and CARGBA. However the effect of the misclassification penalties on the classification has not been examined. Out of all the available methods in associative classification the CPAR has the highest accuracy. This work is a study of effect of the misclassification penalties on the classification process of the associative classification method CPAR. From the many methods available bagging is selected for the misclassification penalty effect. A new approach namely M-CPPAR (Modified CPAR) is proposed in this work. After the study it can be concluded that if misclassification penalty effect is considered during the classification process the accuracy of the CPAR can be improved.
Comparative Study of Advanced Classification Methods
Citation/Export MLA Shruti A, B. I. Khodanpur, “Comparative Study of Advanced Classification Methods”, March 15 Volume 3 Issue 3 , International Journal on Recent and Innovation Trends in Computing and Communication (IJRITCC), ISSN: 2321-8169, PP: 1216 - 1220, DOI: 10.17762/ijritcc2321-8169.150371 APA Shruti A, B. I. Khodanpur, March 15 Volume 3 Issue 3, “Comparative Study of Advanced Classification Methods”, International Journal on Recent and Innovation Trends in Computing and Communication (IJRITCC), ISSN: 2321-8169, PP: 1216 - 1220, DOI: 10.17762/ijritcc2321-8169.150371
Algorithm Tuning from Comparative Analysis of Classification Algorithms
International Journal of Scientific and Research Publications (IJSRP), 2018
Machine Learning is the upcoming research area to solve various problems and classification is one of main problems in the field of machine learning. This paper describes various Supervised Machine Learning (ML) classification techniques, compares various supervised learning algorithms as well as determines the most efficient classification algorithm based on the dataset. Wine-quality-white dataset is taken from UCI machine learning repository. Six different machine learning algorithms are considered: Logistic Regression (LR), Linear Discriminant Analysis (LDA), K-Nearest Neighbors (KNN), Classification and Regression Trees (CART), Gaussian Naïve Bayes (NB) and Support Vector Machine (SVM). By tuning of neighbors for KNN, the best configuration is K= 1.
Assessing and improving classification rules
2021
The last few years have witnessed a ·resurgence of research effort aimed at developing improved techniques for supervised classification problems. In a large part this resurgence of interest has been stimulated by the novelty of multi-layer feedforward neural networks (Hertz et al, 1991; Ripley, 1996) and similar complex and flexible models such as MARS (Friedman, 1991), projection pursuit regression (Friedman and Stuetzle, 1981), and additive models in general (Hastie and Tibshirani, 1990)). The flexibility of these models is in striking contrast to the simplicity of models such as simple linear discriminant analysis, perceptrons, and logistic discriminant analysis, which assume highly restricted forms of decision surface. The merit of the flexibility of neural networks is countered by the dangers that they will overfit the design data. This relationship between model flexibility and the danger of overfitting has long been understood within the statistical community. For example, i...