Ethem Alpaydin - Profile on Academia.edu (original) (raw)

Papers by Ethem Alpaydin

International Journal of Pattern Recognition and Artificial Intelligence, 2009

We define the problem of optimizing the architecture of a multilayer perceptron (MLP) as a state ... more We define the problem of optimizing the architecture of a multilayer perceptron (MLP) as a state space search and propose the MOST (Multiple Operators using Statistical Tests) framework that incrementally modifies the structure and checks for improvement using cross-validation. We consider five variants that implement forward/backward search, using single/multiple operators, and searching depth-first/breadth-first. On 44 classification and 30 regression datasets, we exhaustively search for the optimal and evaluate the goodness based on: (1) Order, the accuracy with respect to the optimal, and (2) Rank, the computational complexity. We check for the effect of two resampling methods (5 × 2, 10-fold cv), four statistical tests (5 × 2 cv t, 10-fold cv t, Wilcoxon, sign) and two corrections for multiple comparisons (Bonferroni, Holm). We also compare with Dynamic Node Creation (DNC) and Cascade Correlation (CC). Our results show that: (1) On most datasets, networks with few hidden units are optimal, (2) forward searching finds simpler architectures, (3) variants using single node additions (deletions) generally stop early and get stuck in simple (complex) networks, (4) choosing the best of multiple operators finds networks closer to the optimal, (5) MOST variants generally find simpler networks having lower or comparable error rates than DNC and CC.

MultiStage Cascading of Multiple Classifiers: One Man's Noise is Another Man's Data

Abstract For building implementable and industryvaluable classification solutions, machine learni... more Abstract For building implementable and industryvaluable classification solutions, machine learning methods must focus not only on accuracy but also on computational and space complexity. We discuss a multistage method, namely cascading, where there is a sequence of classifiers ordered in terms of increasing complexity and specificity such that early classifiers are simple and general whereas later ones are more complex and specific, being localized on patterns rejected by the previous classifiers. We present the technique and its ...

Cascading Multiple Classifiers And Representations For Optical And Pen-Based Handwritten Digit Recognition

Abstract We discuss a multistage method, cascading, where there is a sequence of classifiers orde... more Abstract We discuss a multistage method, cascading, where there is a sequence of classifiers ordered in terms of complexity (of the classifier or the representation) and specificity, in that early classifiers are simple and general and later ones are more complex and are local. For building portable, low-cost handwriting recognizers, memory and computational requirements are as critical as accuracy and our proposed method, cascading, is a way to gain from having multiple classifiers, without much losing from cost. ...

Distributed and local neural classifiers for phoneme recognition

Pattern Recognition Letters, 1994

Abstract The comparative performances of distributed and local neural networks for the speech rec... more Abstract The comparative performances of distributed and local neural networks for the speech recognition problem are investigated. We consider a feed-forward network with one or more hidden layers. Depending on the response characteristics of the hidden units, we name the network distributed or local. If the hidden units use the sigmoid non-linearity, then hidden units have a global response and we call such networks distributed. If each hidden unit responds only to inputs in a certain local region in the input space, then the network is ...

Recently, instead of selecting a single kernel, multiple kernel learning (MKL) has been proposed ... more Recently, instead of selecting a single kernel, multiple kernel learning (MKL) has been proposed which uses a convex combination of kernels, where the weight of each kernel is optimized during training. However, MKL assigns the same weight to a kernel over the whole input space. In this paper, we develop a localized multiple kernel learning (LMKL) algorithm using a gating model for selecting the appropriate kernel function locally. The localizing gating model and the kernelbased classifier are coupled and their optimization is done in a joint manner. Empirical results on ten benchmark and two bioinformatics data sets validate the applicability of our approach. LMKL achieves statistically similar accuracy results compared with MKL by storing fewer support vectors. LMKL can also combine multiple copies of the same kernel function localized in different parts. For example, LMKL with multiple linear kernels gives better accuracy results than using a single linear kernel on bioinformatics data sets.

IEEE Transactions on Neural Networks, 2001

Univariate decision trees at each decision node consider the value of only one feature leading to... more Univariate decision trees at each decision node consider the value of only one feature leading to axis-aligned splits. In a linear multivariate decision tree, each decision node divides the input space into two with a hyperplane. In a nonlinear multivariate tree, a multilayer perceptron at each node divides the input space arbitrarily, at the expense of increased complexity and higher risk of overfitting. We propose omnivariate trees where the decision node may be univariate, linear, or nonlinear depending on the outcome of comparative statistical tests on accuracy thus matching automatically the complexity of the node with the subproblem defined by the data reaching that node. Such an architecture frees the designer from choosing the appropriate node type, doing model selection automatically at each node. Our simulation results indicate that such a decision tree induction method generalizes better than trees with the same types of nodes everywhere and induces small trees.

Neural Networks, 1998

The relation between hard c-means (HCM), fuzzy c-means (FCM), fuzzy learning vector quantization ... more The relation between hard c-means (HCM), fuzzy c-means (FCM), fuzzy learning vector quantization (FLVQ), soft competition scheme (SCS) of and probabilistic Gaussian mixtures (GM) have been pointed out recently by . We extend this relation to their training, showing that learning rules by these models to estimate the cluster centers can be seen as approximations to the expectation-maximization (EM) method as applied to Gaussian mixtures. HCM and unsupervised, LVQ use 1-of-c type competition. In FCM and FLVQ, membership is the ¹2/(m ¹ 1)th power of the distance. In SCS and GM, Gaussian function is used. If the Gaussian membership function is used, the weighted within-groups sum of squared errors used as the fuzzy objective function corresponds to the maximum likelihood estimate in Gaussian mixtures with equal priors and covariances. The fuzzy clustering method named fuzzy c-means alternating optimization procedure (FCM-AO) proposed to optimize the former is then equivalent to batch EM and SCS's update rule is a variant of the online version of EM. The advantages of the probabilistic framework are: (i) we no longer have spurious spread parameters that needs fine tuning as m in fuzzy vector quantization or b in SCS; instead we have a variance term that has a sound interpretation and that can be estimated from the sample; (ii) EM guarantees that the likelihood does not decrease, thus it converges to the nearest local optimum; (iii) EM also allows us to estimate the underlying distance norm and the cluster priors which we could not with the other approaches. We compare Gaussian mixtures trained with EM with LVQ (HCM), SCS and FLVQ on the IRIS dataset and see that it is more accurate due to its being able to take into account the covariance information. We finally note that vector quantization is generally an intermediate step before finding a final output for which supervision may be possible. Thus, instead of an uncoupled approach where an unsupervised method is used first to find the cluster parameters followed by supervised training of the mapping based on the memberships, we advocate a coupled approach where the cluster parameters and mapping are trained supervised in a coupled way. The uncoupled approach ignores the error at the outputs which may not be ideal. ᭧

A polychotomizer which assigns the input to one of K 3 classes is constructed using a set of dich... more A polychotomizer which assigns the input to one of K 3 classes is constructed using a set of dichotomizers which assign the input to one of two classes. De ning classes in terms of the dichotomizers is the binary decomposition matrix of size K L where each of the K 3 classes is written as error-correcting output codes (ECOC), i.e., an array of the responses of binary decisions made by L dichotomizers. We use linear dichotomizers and by combining them suitably, we build nonlinear polychotomizers, thereby reducing complex decisions into a group of simpler decisions. We propose a new method to learn the error-correcting codes from data based on soft weight sharing which forces parameters to take one of a set (here two: ?1= + 1) values. Simulation results on eight datasets indicate that compared with a linear one-per-class polychotomizer and ECOC proper, these methods generate more accurate classi ers, using less dichotomizers than pairwise classi ers.

Techniques for Combining Multiple Learners

Abstract Learners based on different paradigms can be combined for improved accuracy. Each learni... more Abstract Learners based on different paradigms can be combined for improved accuracy. Each learning method assumes a certain model that comes with a set of assumptions which may lead to error if the assumptions do not hold. Learning is an ill-posed problem and with finite data each algorithm converges to a different solution and fails under different circumstances. Our previous experience with statistical and neural classifiers was that classifiers based on these paradigms do generalize differently, fail on different patterns ...

IEEE Transactions on Neural Networks, 2002

Part I of this paper defines the class of constructive unsupervised on-line learning simplified a... more Part I of this paper defines the class of constructive unsupervised on-line learning simplified adaptive resonance theory (SART) clustering networks. Proposed instances of class SART are the symmetric Fuzzy ART (S-Fuzzy ART) and the Gaussian ART (GART) network. In Part II of our work, a third network belonging to class SART, termed fully self-organizing SART (FOSART), is presented and discussed. FOSART is a constructive, soft-to-hard competitive, topology-preserving, minimum-distance-to-means clustering algorithm capable of: 1) generating processing units and lateral connections on an example-driven basis and 2) removing processing units and lateral connections on a minibatch basis. FOSART is compared with Fuzzy ART, S-Fuzzy ART, GART and other well-known clustering techniques (e.g., neural gas and self-organizing map) in several unsupervised learning tasks, such as vector quantization, perceptual grouping and 3-D surface reconstruction. These experiments prove that when compared with other unsupervised learning networks, FOSART provides an interesting balance between easy user interaction, performance accuracy, efficiency, robustness, and flexibility. Index Terms-Absolute and relative membership function, adaptive resonance theory, clustering, Delaunay triangulation, soft-to-hard competitive learning, topology preserving mapping, Voroni partition.

IEEE Transactions on Neural Networks, 2002

Methods of Combining Multiple Classifiers Based on Different Representations for Pen-based Handwritten Digit Recognition

Pen-based handwriting recognition has enormous practical utility. It isdifferent from optical rec... more Pen-based handwriting recognition has enormous practical utility. It isdifferent from optical recognition in that the input is a temporal signal of pen movementsas opposed to a static spatial pattern. We examine various ways of combining multiplelearners which are trained with different representations of the same input signal: dynamic (pen movements) and static (final 2D image). We notice that the classifiers based ondifferent representations fail for different patterns and investigate ways to combine thetwo representations. We benchmark ...

Selective Attention for Handwritten Digit Recognition

Completely parallel object recognition is NP-complete. Achievinga recognizer with feasible comple... more Completely parallel object recognition is NP-complete. Achievinga recognizer with feasible complexity requires a compromise betweenparallel and sequential processing where a system selectivelyfocuses on parts of a given image, one after another. Successivefixations are generated to sample the image and these samples areprocessed and abstracted to generate a temporal context in whichresults are integrated over time. A computational model based on apartially recurrent feedforward network is proposed and made credibleby ...

A mixture of factor analyzer is a semiparametric density estimator that performs clustering and d... more A mixture of factor analyzer is a semiparametric density estimator that performs clustering and dimensionality reduction in each cluster (component) simultaneously. It performs nonlinear dimensionality reduction by modeling the density as a mixture of local linear models. The approach can be used for classification by modeling each class-conditional density using a mixture model and the complete data is then a mixture of mixtures. We propose an incremental mixture of factor analysis algorithm where the number of components (local models) in the mixture and the number of factors in each component (local dimensionality) are determined adaptively. Our results on different pattern classification tasks prove the utility of our approach and indicate that our algorithms find a good trade-off between model complexity and accuracy. 1

Linear Discriminant Trees

Artificial Intelligence Review, 1997

Lazy learning methods like the k-nearest neighbor classifier require storing the whole training s... more Lazy learning methods like the k-nearest neighbor classifier require storing the whole training set and may be too costly when this set is large. The condensed nearest neighbor classifier incrementally stores a subset of the sample, thus decreasing storage and computation requirements. We propose to train multiple such subsets and take a vote over them, thus combining predictions from a set of concept descriptions. We investigate two voting schemes: simple voting where voters have equal weight and weighted voting where weights depend on classifiers' confidences in their predictions. We consider ways to form such subsets for improved performance: When the training set is small, voting improves performance considerably. If the training set is not small, then voters converge to similar solutions and we do not gain anything by voting. To alleviate this, when the training set is of intermediate size, we use bootstrapping to generate smaller training sets over which we train the voters. When the training set is large, we partition it into smaller, mutually exclusive subsets and then train the voters. Simulation results on six datasets are reported with good results. We give a review of methods for combining multiple learners. The idea of taking a vote over multiple learners can be applied with any type of learning scheme.

IEEE Transactions on Neural Networks, 1996

Neural Computation, 1999

Dietterich (1998) reviews five statistical tests and proposes the 5 × 2 cv t test for determining... more Dietterich (1998) reviews five statistical tests and proposes the 5 × 2 cv t test for determining whether there is a significant difference between the error rates of two classifiers. In our experiments, we noticed that the 5 × 2 cv t test result may vary depending on factors that should not affect the test, and we propose a variant, the combined 5 × 2 cv F test, that combines multiple statistics to get a more robust test. Simulation results show that this combined version of the test has lower type I error and higher power than 5 × 2 cv proper.

Support vector machines (SVMs) are primarily designed for 2-class classification problems. Althou... more Support vector machines (SVMs) are primarily designed for 2-class classification problems. Although in several papers it is mentioned that the combination of K SVMs can be used to solve a K-class classification problem, such a procedure requires some care. In this paper, the scaling problem of different SVMs is highlighted. Various normalization methods are proposed to cope with this problem and their efficiencies are measured empirically. This simple way of ssing SVMs to learn a K-class classification problem consists in choosing the maximum applied to the outputs of K SVMs solving a one-per-class decomposition of the general problem. In the second part of this paper, more sophisticated techniques are suggested. On the one hand, a stacking of the K SVMs with other classification techniques is proposed. On the other end, the one-per-class decomposition scheme is replaced by more elaborated schemes based on error-correcting codes. An incremental algorithm for the elaboration of pertinent decomposition schemes is mentioned, which exploits the properties of SVMs for an efficient computation.

Introduction to Machine Learning

The goal of machine learning is to program computers to use example data or past experience to so... more The goal of machine learning is to program computers to use example data or past experience to solve a given problem. Many successful applications of machine learning exist already, including systems that analyze past sales data to predict customer behavior, recognize faces or spoken speech, optimize robot behavior so that a task can be completed using minimum resources, and extract knowledge from bioinformatics data. Introduction to Machine Learning is a comprehensive textbook on the subject, covering a broad array of ...

International Journal of Pattern Recognition and Artificial Intelligence, 2009

MultiStage Cascading of Multiple Classifiers: One Man's Noise is Another Man's Data

Cascading Multiple Classifiers And Representations For Optical And Pen-Based Handwritten Digit Recognition

Distributed and local neural classifiers for phoneme recognition

Pattern Recognition Letters, 1994

IEEE Transactions on Neural Networks, 2001

Neural Networks, 1998

Techniques for Combining Multiple Learners

IEEE Transactions on Neural Networks, 2002

Methods of Combining Multiple Classifiers Based on Different Representations for Pen-based Handwritten Digit Recognition

Selective Attention for Handwritten Digit Recognition

Linear Discriminant Trees

Artificial Intelligence Review, 1997

IEEE Transactions on Neural Networks, 1996

Neural Computation, 1999

Introduction to Machine Learning