A Tool for Interactive Subgroup Discovery Using Distribution Rules (original) (raw)
Related papers
Visual Interactive Subgroup Discovery with Numerical Properties of Interest
Lecture Notes in Computer Science, 2006
Subgroup discovery consists in finding subsets of individuals from a given population which have distinctive collective properties with regard to one or more properties of interest. The interest of a subgroup can be objectively assessed using appropriate statistics, but it can also be evaluated by a data analyst or domain expert. In this paper we propose an approach to subgroup discovery via distribution rules (a kind of association rules with a probability distribution on the consequent) for numerical properties of interest. The objective interest of the subgroups is measured through statistical goodness of fit tests. The subjective interest of the subgroups can be assessed by the data analyst through a visual interactive subgroup browsing procedure.
This paper investigates the implications of example weight- ing in subgroup discovery by comparing three state-of-the-art subgroup discovery algorithms, APRIORI-SD, CN2-SD, and SubgroupMiner on a real-life data set. While both APRIORI-SD and CN2-SD use example weighting in the process of subgroup discovery, SubgroupMiner does not. Moreover, APRIORI-SD uses example weighting in the post-processing step of selecting the 'best' rules, while CN2-SD uses example weighting during rule induction. The results of the application of the three subgroup discovery algorithms on a real-life data set { the UK Trac challenge data set are presented in the form of ROC curves showing that APRIORI- SD slightly outperforms CN2-SD; both APRIORI-SD and CN2-SD are good in nding small and highly accurate subgroups (describing minority classes), while SubgroupMiner found larger and less accurate subgroups (describing the majority class). We show by using ROC analysis that these results are not surp...
An overview on subgroup discovery: Foundations and applications
2010
Abstract Subgroup discovery is a data mining technique which extracts interesting rules with respect to a target variable. An important characteristic of this task is the combination of predictive and descriptive induction. An overview related to the task of subgroup discovery is presented. This review focuses on the foundations, algorithms, and advanced studies together with the applications of subgroup discovery presented throughout the specialised bibliography.
REVIEW ON DATA MINING TECHNIQUES FOR SUBGROUP DISCOVERY
Subgroup discovery is a data mining technique which focuses fascinating rules regarding a target variable. A paramount feature for this method is the combination of predictive and descriptive induction. This survey gives highlights on the establishments, algorithms, and progressed studies together with the applications of subgroup discovery. This paper shows a novel data mining systems for the investigation and extraction of learning from infor mation created by electricity meters. In spite of the fact that a rich source of data for energy utilization analysis, power meters deliver a voluminous, quick paced, transient stream of information those traditional methodologies are not able to address a ltogether. So as to beat these issues, it is imperative for a data mining framework to consolidate usefulness for break summarization and incremental analysis utilizing intelligent procedures. In subgroups whose sizes are large and patterns are not usual h as to be discovered. Their models have to be generated first. The many algorithms have been used to overcome the wider range of data mining problems. This paper gives a survey on subgroup discovery patterns from smart electricity meter data.
Apriori-SD: Adapting Association Rule Learning to Subgroup Discovery
Applied Artificial Intelligence, 2006
& This paper presents a subgroup discovery algorithm APRIORI-SD, developed by adapting association rule learning to subgroup discovery. The paper contributes to subgroup discovery, to a better understanding of the weighted covering algorithm, and the properties of the weighted relative accuracy heuristic by analyzing their performance in the ROC space. An experimental comparison with rule learners CN2, RIPPER, and APRIORI-C on UCI data sets demonstrates that APRIORI-SD produces substantially smaller rulesets, where individual rules have higher coverage and significance. APRIORI-SD is also compared to subgroup discovery algorithms CN2-SD and SubgroupMiner. The comparisons performed on U.K. traffic accident data show that APRIORI-SD is a competitive subgroup discovery algorithm. Standard rule learning algorithms are designed to construct classification and prediction rules (Michalski et al. 1986; Clark amd Niblett 1989; Cohen 1995). In addition to this area of machine learning, referred to as supervised learning or predictive induction, developments in descriptive induction have recently gained much attention, in particular association rule learning (Agrawal et al. 1993), subgroup discovery (Wrobel 1997; 2001), and other approaches to non-classificatory induction. This paper considers the task of subgroup discovery defined as follows (Wrobel 1997; 2001) given a population of individuals and a specific property of the individuals that we are interested in, find population subgroups that are statistically ''most interesting,'' e.g., are as large as possible and have the most unusual statistical (distributional) characteristics with respect to the property of interest.
Subgroup Discovery with Proper Scoring Rules
Machine Learning and Knowledge Discovery in Databases
Subgroup Discovery is the process of finding and describing sufficiently large subsets of a given population that have unusual distributional characteristics with regard to some target attribute. Such subgroups can be used as a statistical summary which improves on the default summary of stating the overall distribution in the population. A natural way to evaluate such summaries is to quantify the difference between predicted and empirical distribution of the target. In this paper we propose to use proper scoring rules, a well-known family of evaluation measures for assessing the goodness of probability estimators, to obtain theoretically well-founded evaluation measures for subgroup discovery. From this perspective, one subgroup is better than another if it has lower divergence of target probability estimates from the actual labels on average. We demonstrate empirically on both synthetic and real-world data that this leads to higher quality statistical summaries than the existing methods based on measures such as Weighted Relative Accuracy.
2011 15th International Conference on Information Visualisation, 2011
Discovering interesting patterns in datasets is a very important data mining task. Subgroup patterns are local findings identifying the subgroups of a population with some unusual, unexpected, or deviating distribution of a target attribute. However, this pattern discovery task poses several compelling challenges. First, computational data mining techniques can generally only discover and extract pre-defined patterns. Second, since the extracted patterns are typically multi-dimensional arbitrary-shaped regions, it is very difficult to convey in an easily interpretable manner. Finally, in order to assist analysts in exploring their discoveries and understanding the relationships among patterns, as well as connections between patterns and the underlying data instances, an integrated visualization system is greatly needed. In this paper, we present a novel subgroup pattern extraction and visualization system, called the Nugget Browser, that takes advantage of both data mining methods and interactive visual exploration. The system accepts analysts' mining queries interactively, converts the query results into an understandable form, builds visual representations, and supports navigation and exploration for further analyses.
Rule induction for subgroup discovery: A case study in mining UK traffic accident data
… of the international multi-conference on …, 2002
Rule learning is typically used in solving classification and prediction tasks. However, learning of classification rules can be adapted also to subgroup discovery. Such an adaptation has already been done for the CN2 rule learning algorithm. In previous work this new algorithm, called CN2-SD, has been described in detail and applied to the well known UCI data sets showing its
Decision Support Through Subgroup Discovery: Three Case Studies and the Lessons Learned
Machine Learning, 2004
This paper presents ways to use subgroup discovery to generate actionable knowledge for decision support. Actionable knowledge is explicit symbolic knowledge, typically presented in the form of rules, that allows the decision maker to recognize some important relations and to perform an appropriate action, such as targeting a direct marketing campaign, or planning a population screening campaign aimed at detecting individuals with high disease risk. Different subgroup discovery approaches are outlined, and their advantages over using standard classification rule learning are discussed. Three case studies, a medical and two marketing ones, are used to present the lessons learned in solving problems requiring actionable knowledge generation for decision support.
Mining atypical groups for a target quantitative attribute
2008 IEEE Conference on Cybernetics and Intelligent Systems, 2008
An important task in data analysis is the understanding of unexpected or atypical behaviors in a group of individuals. Which categories of individuals earn the higher salaries or, on the contrary, which ones earn the lower salaries? We present the problem of how data concerning atypical groups can be mined compared with a target quantitative attribute, like for instance the attribute "salary", and in particular for the high and low values of a user-defined interval. Our search therefore focuses on conjunctions of attributes whose distribution differs significantly from the learning set for the interval's high and low values of the target attribute. Such atypical groups can be found by adapting an existing measure, the intensity of inclination. This measure frees us from the transformation step of quantitative attributes, that is to say the step of discretization followed by a complete disjunctive coding. Thus, we propose an algorithm for mining such groups using pruning rules in order to reduce the complexity of the problem. This algorithm has been developed and integrated into the WEKA software for knowledge extraction. Finally we give an example of data extraction from the American census database IPUMS.