Carla Brodley | Tufts University (original) (raw)

Papers by Carla Brodley

This paper presents a new approach to identifying and eliminating mislabeled training instances. ... more This paper presents a new approach to identifying and eliminating mislabeled training instances. The goal of this technique is to improve classi cation accuracies produced by learning algorithms by improving the quality of the training data. The approach employs an ensemble of classi ers that serve as a lter for the training data. Using an n-fold cross validation, the training data is passed through the lter. Only instances that the lter classi es correctly are passed to thenal learning algorithm. We present an empirical evaluation of the approach for the task of automated land cover mapping from remotely sensed data. Labeling error arises in these data from a multitude of sources including lack of consistency in the vegetation classi cation used, variable measurement techniques, and variation in the spatial sampling resolution. Our evaluation shows that for noise levels of less than 40%, ltering results in higher predictive accuracy than not ltering, and for levels of class noise less than or equal to 20% ltering allows the base-line accuracy to be retained. Our empirical results suggest that the ensemble lter approach is an e ective method for identifying labeling errors, and further, that the approach will signi cantly bene t ongoing research to develop accurate and robust remote sensing-based methods to map land cover at global scales.

Abstract. Execution speed of programs on modern computer architectures is sensitive, by a factor ... more Abstract. Execution speed of programs on modern computer architectures is sensitive, by a factor of two or more, to the order in which instructions are presented to the processor. To realize potential execution efficiency, it is now customary for an optimizing compiler to employ a heuristic algorithm for instruction scheduling. These algorithms are currently hand-crafted. We show how to cast the local instruction scheduling problem as a machine learning task, so that one obtains a heuristic scheduling algorithm automatically.

Abstract Execution speed of programs on modern computer architectures is sensitive, by a factor o... more Abstract Execution speed of programs on modern computer architectures is sensitive, by a factor of two or more, to the order in which instructions are presented to the processor. To realize potential execution efficiency, it is now customary for an optimizing compiler to employ a heuristic algorithm for instruction scheduling. These algorithms are painstakingly hand-crafted, which is expenseive and time-consuming.

Abstract The anomaly detection problem has been widely studied in the computer security literatur... more Abstract The anomaly detection problem has been widely studied in the computer security literature. In this paper we present a machine learning approach to anomaly detection. Our system builds user profiles based on command sequences and compares current input sequences to the profile using a similarity measure. The system must learn to classify current behavior as consistent or anomalous with past behavior using only positive examples of the account's valid user.

Abstract We present an approach to user re-authentication based on the data collected from the co... more Abstract We present an approach to user re-authentication based on the data collected from the computer's mouse device. Our underlying hypothesis is that one can successfully model user behavior on the basis of user-invoked mouse movements. Our implemented system raises an alarm when the current behavior of user X, deviates sufficiently from learned" normal" behavior of user X. We apply a supervised learning method to discriminate among k users.

Abstract This article presents an algorithm for inducing multiclass decision trees with multivari... more Abstract This article presents an algorithm for inducing multiclass decision trees with multivariate tests at internal decision nodes. Each test is constructed by training a linear machine and eliminating variables in a controlled manner. Empirical results demonstrate that the algorithm builds small accurate trees across a variety of tasks.

ABSTRACT For many feature selection problems, a human defines the features that are potentially u... more ABSTRACT For many feature selection problems, a human defines the features that are potentially useful, and then a subset is chosen from the original pool of features using an automated feature selection algorithm. In con trast to supervised learning, class information is not available to guide the feature search for unsupervised learning tasks.

We describe a novel framework for class noise mitigation that assigns a vector of class membershi... more We describe a novel framework for class noise mitigation that assigns a vector of class membership probabilities to each training instance, and uses the confidence on the current label as a weight during training. The probability vector should be calculated such that clean instances have a high confidence on its current label, while mislabeled instances have a low confidence on its current label and a high confidence on its correct label. Past research focuses on techniques that either discard or correct instances.

Abstract We describe the task of user-oriented anomaly detection for computer security. In this d... more Abstract We describe the task of user-oriented anomaly detection for computer security. In this domain the goal is to develop a model of a computer user's normal behavioral patterns and to detect anomalous conditions as deviations from expected behaviors. We present an instance-based learning (IBL) system for profiling users and examine some domain constraints with respect to our approach.

Abstract The anomaly-detection problem can be formulated as one of learning to characterize the b... more Abstract The anomaly-detection problem can be formulated as one of learning to characterize the behaviors of an individual, system, or network in terms of temporal sequences of discrete data. We present an approach on the basis of instance-based learning (IBL) techniques.

Abstract The problem of how to learn from examples has been studied throughout the history of mac... more Abstract The problem of how to learn from examples has been studied throughout the history of machine learning, and many successful learning algorithms have been developed. A problem that has received less attention is how to select which algorithm to use for a given learning task. The ability of a chosen algorithm to induce a good generalization depends on how appropriate the model class underlying the algorithm is for the given task.

The July 2005 announcement by computer security researcher Michael Lynn at the Black Hat security... more The July 2005 announcement by computer security researcher Michael Lynn at the Black Hat security conference of a software flaw in Cisco Systems routers grabbed media attention worldwide. The flaw was an instance of a buffer overflow, a security vulnerability that has been discussed for 40 years yet remains one of the most frequently reported types of remote attack against computer systems.

ABSTRACT We describe KDD-Cup 2000, the yearly competition in data mining. For the first time the ... more ABSTRACT We describe KDD-Cup 2000, the yearly competition in data mining. For the first time the Cup included insight problems in addition to prediction problems, thus posing new challenges in both the knowledge discovery and the evaluation criteria, and highlighting the need to" peel the onion" and drill deeper into the reasons for the initial patterns found.

Abstract This paper studies cluster ensembles for high dimensional data clustering. We examine th... more Abstract This paper studies cluster ensembles for high dimensional data clustering. We examine three different approaches to constructing cluster ensembles. To address high dimensionality, we focus on ensemble construction methods that build on two popular dimension reduction techniques, random projection and principal component analysis (PCA).

Abstract We present a new approach to semi-supervised anomaly detection. Given a set of training ... more Abstract We present a new approach to semi-supervised anomaly detection. Given a set of training examples believed to come from the same distribution or class, the task is to learn a model that will be able to distinguish examples in the future that do not belong to the same class. Traditional approaches typically compare the position of a new data point to the set of``normal''training data points in a chosen representation of the feature space.

Abstract Many machine learning researchers view the task of inductive generalization as beginning... more Abstract Many machine learning researchers view the task of inductive generalization as beginning after the data is collected, assuming that the useful features have been identi ed and that representative data has been collected. This assumption has led researchers to focus, with considerable success, on algorithm development. As a result, little attention has been paid to applying machine learning algorithms.

Abstract For some supervised learning tasks, researchers can control the data generation process.... more Abstract For some supervised learning tasks, researchers can control the data generation process. In such cases, it would be beneficial to have feedback during learning to guide future data collection. Our research is motivated by a real-world problem: discrimination of vapors with an “artificial nose”. The nose's accuracy is vital, because it will be deployed to detect harmful gases in critical situations, such as an airport or a subway.

As the number of sequenced genomes has grown, we have become increasingly aware of the impact of ... more As the number of sequenced genomes has grown, we have become increasingly aware of the impact of horizontal gene transfer on our understanding of genome evolution. Methods for detecting horizontal gene transfer from sequence abound. Among the most accurate are methods based on phylogenetic tree inference, but even these can perform poorly in some cases, such as when multiple trees fit the data equally well. In addition, they tend to be computationally intensive, making them poorly suited to genomic-scale applications.

It is often difficult to come up with a well-principled approach to the selection of low-level fe... more It is often difficult to come up with a well-principled approach to the selection of low-level features for characterizing images for content-based retrieval. This is particularly true for medical imagery, where gross characterizations on the basis of color and other global properties do not work.

Learning to Schedule Straight-Line Code Eliot Moss, Paul Utgoff, John Cavazos Doina Precup, Darko... more Learning to Schedule Straight-Line Code Eliot Moss, Paul Utgoff, John Cavazos Doina Precup, Darko Stefanovi(: Dept. of Comp. Sci., Univ. of Mass. Amherst, MA 01003 Carla Brodley, David Scheeff Sch. of Elec. and Comp. Eng. Purdue University W. Lafayette, IN 47907 Abstract Program execution speed on modem computers is sensitive, by a factor of two or more, to the order in which instructions are presented to the processor.