Subgroup Discovery with Proper Scoring Rules (original) (raw)

ROC analysis of example weighting in subgroup discovery

algorithms

Abstract. This paper presents two new ways of example weighting for subgroup discovery. The proposed example weighting schemes are applicable to any subgroup discovery algorithm that uses the weighted covering approach to discover interesting subgroups in data. To show ...

Analysis of Example Weighting in Subgroup Discovery by Comparison of Three Algorithms on a Real-life Data Set

This paper investigates the implications of example weight- ing in subgroup discovery by comparing three state-of-the-art subgroup discovery algorithms, APRIORI-SD, CN2-SD, and SubgroupMiner on a real-life data set. While both APRIORI-SD and CN2-SD use example weighting in the process of subgroup discovery, SubgroupMiner does not. Moreover, APRIORI-SD uses example weighting in the post-processing step of selecting the 'best' rules, while CN2-SD uses example weighting during rule induction. The results of the application of the three subgroup discovery algorithms on a real-life data set { the UK Trac challenge data set are presented in the form of ROC curves showing that APRIORI- SD slightly outperforms CN2-SD; both APRIORI-SD and CN2-SD are good in nding small and highly accurate subgroups (describing minority classes), while SubgroupMiner found larger and less accurate subgroups (describing the majority class). We show by using ROC analysis that these results are not surp...

Robust subgroup discovery

2021

We introduce the problem of robust subgroup discovery, i.e., finding a set of interpretable descriptions of subsets that 1) stand out with respect to one or more target attributes, 2) are statistically robust, and 3) non-redundant. Many attempts have been made to mine either locally robust subgroups or to tackle the pattern explosion, but we are the first to address both challenges at the same time from a global modelling perspective. First, we formulate the broad model class of subgroup lists, i.e., ordered sets of subgroups, for univariate and multivariate targets that can consist of nominal or numeric variables, and that includes traditional top-1 subgroup discovery in its definition. This novel model class allows us to formalise the problem of optimal robust subgroup discovery using the Minimum Description Length (MDL) principle, where we resort to optimal Normalised Maximum Likelihood and Bayesian encodings for nominal and numeric targets, respectively. Second, as finding optimal subgroup lists is NP-hard, we propose SSD++, a greedy heuristic that finds good subgroup lists and guarantees that the most significant subgroup found according to the MDL criterion is added in each iteration, which is shown to be equivalent to a Bayesian one-sample proportions, multinomial, or t-test between the subgroup and dataset marginal target distributions plus a multiple hypothesis testing penalty. We empirically show on 54 datasets that SSD++ outperforms previous subgroup set discovery methods in terms of quality and subgroup list size.

Subgroup discovery in data sets with multi-dimensional responses

2011

Most of the present subgroup discovery approaches aim at finding subsets of attribute-value data with unusual distribution of a single output variable. In general, real-life problems may be described with richer, multi-dimensional descriptions of the outcome. The discovery task in such domains is to find subsets of data instances with similar outcome description that are separable from the rest of the instances in the input space. We have developed a technique that directly addresses this problem and uses a combination of agglomerative clustering to find subgroup candidates in the space of output attributes, and predictive modeling to score and describe these candidates in the input attribute space. Experiments with the proposed method on a set of synthetic and on a real social survey data set demonstrate its ability to discover relevant and interesting subgroups from the data with multi-dimensional responses.

Discovering outstanding subgroup lists for numeric targets using MDL

ECML PKDD 2020: Machine Learning and Knowledge Discovery in Databases , 2020

The task of subgroup discovery (SD) is to find interpretable descriptions of subsets of a dataset that stand out with respect to a target attribute. To address the problem of mining large numbers of redundant subgroups, subgroup set discovery (SSD) has been proposed. State-of-the-art SSD methods have their limitations though, as they typically heavily rely on heuristics and/or user-chosen hyperparameters. We propose a dispersion-aware problem formulation for subgroup set discovery that is based on the minimum description length (MDL) principle and subgroup lists. We argue that the best subgroup list is the one that best summarizes the data given the overall distribution of the target. We restrict our focus to a single numeric target variable and show that our formalization coincides with an existing quality measure when finding a single subgroup, but that-in addition-it allows to trade off subgroup quality with the complexity of the subgroup. We next propose SSD++, a heuristic algorithm for which we empirically demonstrate that it returns outstanding subgroup lists: non-redundant sets of compact subgroups that stand out by having strongly deviating means and small spread.

An overview on subgroup discovery: Foundations and applications

2010

Abstract Subgroup discovery is a data mining technique which extracts interesting rules with respect to a target variable. An important characteristic of this task is the combination of predictive and descriptive induction. An overview related to the task of subgroup discovery is presented. This review focuses on the foundations, algorithms, and advanced studies together with the applications of subgroup discovery presented throughout the specialised bibliography.

Apriori-SD: Adapting Association Rule Learning to Subgroup Discovery

Applied Artificial Intelligence, 2006

& This paper presents a subgroup discovery algorithm APRIORI-SD, developed by adapting association rule learning to subgroup discovery. The paper contributes to subgroup discovery, to a better understanding of the weighted covering algorithm, and the properties of the weighted relative accuracy heuristic by analyzing their performance in the ROC space. An experimental comparison with rule learners CN2, RIPPER, and APRIORI-C on UCI data sets demonstrates that APRIORI-SD produces substantially smaller rulesets, where individual rules have higher coverage and significance. APRIORI-SD is also compared to subgroup discovery algorithms CN2-SD and SubgroupMiner. The comparisons performed on U.K. traffic accident data show that APRIORI-SD is a competitive subgroup discovery algorithm. Standard rule learning algorithms are designed to construct classification and prediction rules (Michalski et al. 1986; Clark amd Niblett 1989; Cohen 1995). In addition to this area of machine learning, referred to as supervised learning or predictive induction, developments in descriptive induction have recently gained much attention, in particular association rule learning (Agrawal et al. 1993), subgroup discovery (Wrobel 1997; 2001), and other approaches to non-classificatory induction. This paper considers the task of subgroup discovery defined as follows (Wrobel 1997; 2001) given a population of individuals and a specific property of the individuals that we are interested in, find population subgroups that are statistically ''most interesting,'' e.g., are as large as possible and have the most unusual statistical (distributional) characteristics with respect to the property of interest.

A Tool for Interactive Subgroup Discovery Using Distribution Rules

Lecture Notes in Computer Science, 2007

We describe an approach and a tool for the discovery of subgroups within the framework of distribution rule mining. Distribution rules are a kind of association rules particularly suited for the exploratory study of numerical variables of interest. Being an exploratory technique, the result of a distribution mining process is typically a very large number of patterns. Exploring such results is thus a complex task and limits the use of the technique. To overcome this shortcoming we developed a tool, written in Java, which supports subgroup discovery in a post-processing step. The tool engages the analyst in an interactive process of subgroup discovery by means of a graphical interface with well defined statistical grounds, where domain knowledge can be used during the identification of such subgroups amid the population. We show a case study to analyze the results of students in a large scale university admission examination.

Mining atypical groups for a target quantitative attribute

2008 IEEE Conference on Cybernetics and Intelligent Systems, 2008

An important task in data analysis is the understanding of unexpected or atypical behaviors in a group of individuals. Which categories of individuals earn the higher salaries or, on the contrary, which ones earn the lower salaries? We present the problem of how data concerning atypical groups can be mined compared with a target quantitative attribute, like for instance the attribute "salary", and in particular for the high and low values of a user-defined interval. Our search therefore focuses on conjunctions of attributes whose distribution differs significantly from the learning set for the interval's high and low values of the target attribute. Such atypical groups can be found by adapting an existing measure, the intensity of inclination. This measure frees us from the transformation step of quantitative attributes, that is to say the step of discretization followed by a complete disjunctive coding. Thus, we propose an algorithm for mining such groups using pruning rules in order to reduce the complexity of the problem. This algorithm has been developed and integrated into the WEKA software for knowledge extraction. Finally we give an example of data extraction from the American census database IPUMS.