Classification of Partially Observed Data with Association Trees (original) (raw)

Classification Based on Predictive Association Rules of Incomplete Data

IEICE Transactions on Information and Systems, 2012

Classification based on predictive association rules (CPAR) is a widely used associative classification method. Despite its efficiency, the analysis results obtained by CPAR will be influenced by missing values in the data sets, and thus it is not always possible to correctly analyze the classification results. In this letter, we improve CPAR to deal with the problem of missing data. The effectiveness of the proposed method is demonstrated using various classification examples.

Proactive Data Mining with Decision Trees

SpringerBriefs in Electrical and Computer Engineering, 2014

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.

A Multiple Imputation Approach for Handling Missing Data in Classification and Regression Trees

Journal of Behavioral Data Science, 2021

Decision trees (DTs) is a machine learning technique that searches the predictor space for the variable and observed value that leads to the best prediction when the data are split into two nodes based on the variable and splitting value. The algorithm repeats its search within each partition of the data until a stopping rule ends the search. Missing data can be problematic in DTs because of an inability to place an observation with a missing value into a node based on the chosen splitting variable. Moreover, missing data can alter the selection process because of its inability to place observations with missing values. Simple missing data approaches (e.g., listwise deletion, majority rule, and surrogate split) have been implemented in DT algorithms; however, more sophisticated missing data techniques have not been thoroughly examined. We propose a modified multiple imputation approach to handling missing data in DTs, and compare this approach with simple missing data approaches as ...

An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data

Journal of Machine Learning Research, 2010

There are many different methods used by classification tree algorithms when missing data occur in the predictors, but few studies have been done comparing their appropriateness and performance. This paper provides both analytic and Monte Carlo evidence regarding the effectiveness of six popular missing data methods for classification trees appli ed to binary response data. We show that in the

Learning Decision Tree Classifiers From Attribute Value Taxonomies and Partially Specified Data

MACHINE LEARNING-INTERNATIONAL …, 2003

We consider the problem of learning to classify partially specified instances i.e., instances that are described in terms of attribute values at different levels of precision, using user-supplied attribute value taxonomies (AVT). We formalize the problem of learning from AVT and data and present an AVT-guided decision tree learning algorithm (AVT-DTL) to learn classification rules at multiple levels of abstraction. The proposed approach generalizes existing techniques for dealing with missing values to handle instances with partially missing values. We present experimental results that demonstrate that AVT-DTL is able to effectively learn robust high accuracy classifiers from partially specified examples. Our experiments also demonstrate that the use of AVT-DTL outperforms standard decision tree algorithm (C4.5 and its variants) when applied to data with missing attribute values; and produces substantially more compact decision trees than those obtained by standard approach.

P&al Classification using Association Rules

Many real-life problems require a partial classification of the data. We use the term "partial classification" to describe the discovery of models that show char- acteristics of the data classes, but may not cover all classes and all examples of any given class. Complete classification may be infeasible or undesirable when there are a very large number of class attributes, most attributes values are missing, or the class distribution is highly skewed and the user is interested in under- standing the iow-frequency ciass. -We show how asso- ciation rules can be used for partial classification in such domains, and present two case studies: reduc- ing telecommunicatio ns order failures and detecting redundant medical tests.

Proactive data mining using decision trees

2012 IEEE 27th Convention of Electrical and Electronics Engineers in Israel, 2012

Most of the existing data mining algorithms are 'passive'. That is, they produce models which can describe patterns, but leave the decision on how to react to these patterns in the hands of the user. In contrast, in this work we describe a proactive approach to data mining, and describe an implementation of that approach, using decision trees. We show that the proactive role requires the algorithms to consider additional domain knowledge, which is exogenous to the training set. We also suggest a novel splitting criterion, termed maximalutility, which is driven by the proactive agenda.

Addressing the problem of missing data in decision tree modeling

Journal of Applied Statistics, 2017

Tree-based models (TBMs) can substitute missing data using the surrogate approach (SUR). The aim of this study is to compare the performance of statistical imputation against the performance of SUR in TBMs. Employing empirical data, a TBM was constructed. Thereafter, 10%, 20%, and 40% of variable values appeared as the first split was deleted, and imputed with and without the use of outcome variables in the imputation model (IMP− and IMP+). This was repeated one thousand times. Absolute relative bias above 0.10 was defined as sever (SARB). Subsequently, in a series of simulations, the following parameters were changed: the degree of correlation among variables, the number of variables truly associated with the outcome, and the missing rate. At a 10% missing rate, the proportion of times SARB was observed in either SUR or IMP− was two times higher than in IMP+ (28% versus 13%). When the missing rate was increased to 20%, all these proportions were approximately doubled. Irrespective of the missing rate, IMP+ was about 65% less likely to produce SARB than SUR. Results of IMP− and SUR were comparable up to a 20% missing rate. At a high missing rate, IMP− was 76% more likely to provide SARB estimates. Statistical imputation of missing data and the use of outcome variable in the imputation model is recommended, even in the content of TBM.

Using association rules for better treatment of missing values

2009

The quality of training data for knowledge discovery in databases (KDD) and data mining depends upon many factors, but handling missing values is considered to be a crucial factor in overall data quality. Today real world datasets contains missing values due to human, operational error, hardware malfunctioning and many other factors. The quality of knowledge extracted, learning and decision problems depend directly upon the quality of training data. By considering the importance of handling missing values in KDD and data mining tasks, in this paper we propose a novel Hybrid Missing values Imputation Technique (HMiT) using association rules mining and hybrid combination of k-nearest neighbor approach. To check the effectiveness of our HMiT missing values imputation technique, we also perform detail experimental results on real world datasets. Our results suggest that the HMiT technique is not only better in term of accuracy but it also take less processing time as compared to current best missing values imputation technique based on k-nearest neighbor approach, which shows the effectiveness of our missing values imputation technique.