PROMISSING: Pruning Missing Values in Neural Networks (original) (raw)
Related papers
Applied Medical Informatics, 2021
Background: Not all datasets are created equal. There are some happy scenarios when the researcher has the luxury of curating the dataset and ensuring all the desired fields are filled. However, especially when retrieving data from large EMR databases out in the wild, missing values are the norm (e.g., only some patients would have blood sugar readings, only a portion of the patients had transaminases determined, etc.). But the vast majority of machine learning models are not supporting missing values-hence, traditionally, we call the imputation methods to the rescue. But when a significant portion of values is missing, the imputation methods might insert incorrect data. Also, dropping the cases with missing values is not feasible when working with EMR data (too many instances are removed). Aim: To implement a missing-values-proof ensemble modeling method without sacrificing the predictive power. Materials and Methods: Using a realistic synthetic patient generator (Synthea), we generated large FHIR datasets under various settings. We implemented a cartesian genetic programming model to develop an automated acceptance test that, in turn, extracted missing-values-free subsets from the training partition of the original dataset. One of the core functions of the acceptance test is to ensure the potential missing values patterns are not going to insert bias in the model training step. We trained one or more models on each missing-values-free training subset. The resulted models are integrated into a global ensemble model. Results: Our approach resulted in superior model performance metrics without the need for missing values imputation methods. Conclusions: Machine learning on datasets containing missing values is feasible when employing specialized training dataset generation pipelines. Removing the missing values imputation methods from the workflow eliminates potentially incorrect data insertion, resulting in more robust models.
A survey on missing data in machine learning
Journal of Big Data
Machine learning has been the corner stone in analysing and extracting information from data and often a problem of missing values is encountered. Missing values occur because of various factors like missing completely at random, missing at random or missing not at random. All these may result from system malfunction during data collection or human error during data pre-processing. Nevertheless, it is important to deal with missing values before analysing data since ignoring or omitting missing values may result in biased or misinformed analysis. In literature there have been several proposals for handling missing values. In this paper, we aggregate some of the literature on missing data particularly focusing on machine learning techniques. We also give insight on how the machine learning approaches work by highlighting the key features of missing values imputation techniques, how they perform, their limitations and the kind of data they are most suitable for. We propose and evaluat...
Handling Missing Values via a Neural Selective Input Model
Neural Network World, 2012
Missing data represent an ubiquitous problem with numerous and diverse causes. Handling Missing Values (MVs) properly is a crucial issue, in particular in Machine Learning (ML) and pattern recognition. To date, the only option available for standard Neural Networks (NNs) to handle this problem has been to rely on pre-processing techniques such as imputation for estimating the missing data values, which limited considerably the scope of their application. To circumvent this limitation we propose a Neural Selective Input Model (NSIM) that accommodates different transparent and bound models, while providing support for NNs to handle MVs directly. By embedding the mechanisms to support MVs we can obtain better models that reflect the uncertainty caused by unknown values. Experiments on several UCI datasets with both different distributions and proportion of MVs show that the NSIM approach is very robust and yields good to excellent results. Furthermore, the NSIM performs better than the state-of-theart imputation techniques either with higher prevalence of MVs in a large number of features or with a significant proportion of MVs, while delivering competitive performance in the remaining cases. We demonstrate the usefulness and validity of the NSIM, making this a first-class method for dealing with this problem.
Improving deep learning performance with missing values via deletion and compensation
Neural Computing and Applications, 2019
Missing values in a dataset is one of the most common difficulties in real applications. Many different techniques based on machine learning have been proposed in the literature to face this problem. In this work, the great representation capability of the stacked denoising auto-encoders is used to obtain a new method of imputating missing values based on two ideas: deletion and compensation. This method improves imputation performance by artificially deleting values in the input features and using them as targets in the training process. Nevertheless, although the deletion of samples is demonstrated to be really efficient, it may cause an imbalance between the distributions of the training and the test sets. In order to solve this issue, a compensation mechanism is proposed based on a slight modification of the error function to be optimized. Experiments over several datasets show that the deletion and compensation not only involve improvements in imputation but also in classification in comparison with other classical techniques.
Machine Learning Algorithms for Handling Missing Healthcare Data
The imputation of missing data in healthcare records is a critical task for ensuring the integrity and utility of medical datasets. Traditional methods often rely on simplistic approaches, leading to potential biases and inaccuracies in downstream analyses. In this paper, we propose a novel machine learning approach for imputing missing healthcare data, aiming to enhance the accuracy and robustness of imputation while preserving the underlying patterns and relationships in the data. Our approach leverages advanced machine learning techniques, including ensemble and deep learning architectures, to effectively capture complex dependencies and correlations within healthcare datasets. We demonstrate the efficacy of our method through comprehensive experiments on real-world healthcare datasets, showcasing superior imputation performance compared to conventional techniques. Furthermore, we discuss the interpretability and scalability aspects of our approach, highlighting its potential for practical deployment in healthcare analytics and decision support systems. Our proposed machine learning approach offers a promising solution for addressing missing data challenges in healthcare, paving the way for more accurate and reliable data-driven insights in medical research and practice.
Informatics in Medicine Unlocked, 2021
Recently, numerous studies have been conducted on Missing Value Imputation (MVI), intending the primary solution scheme for the datasets containing one or more missing attribute's values. The incorporation of MVI reinforces the Machine Learning (ML) models' performance and necessitates a systematic review of MVI methodologies employed for different tasks and datasets. It will aid beginners as guidance towards composing an effective ML-based decision-making system in various fields of applications. This article aims to conduct a rigorous review and analysis of the state-of-the-art MVI methods in the literature published in the last decade. Altogether, 191 articles, published from 2010 to August 2021, are selected for review using the well-known Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) technique. We summarize those articles with relevant definitions, theories, and analyses to provide essential information for building a precise decision-making framework. In addition, the evaluation metrics employed for MVI methods and ML-based classification models are also discussed and explored. Remarkably, the trends for the MVI method and its evaluation are also scrutinized from the last twelve years' data. To come up with the conclusion, several MLbased pipelines, where the MVI schemes are incorporated for performance enhancement, are investigated and reviewed for many different datasets. In the end, informative observations and recommendations are addressed for future research directions and trends in related fields of interest.
The impact of imputation quality on machine learning classifiers for datasets with missing values
Communications Medicine
Background Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete samples. The focus of the machine learning researcher is to optimise the classifier’s performance. Methods We utilise three simulated and three real-world clinical datasets with different feature types and missingness patterns. Initially, we evaluate how the downstream classifier performance depends on the choice of classifier and imputation methods. We employ ANOVA to quantitatively evaluate how the choice of missingness rate, imputation method, and classifier method influences the performance. Additionally, we compare commonly used methods for assessing imputation quality and introduce a class of discrepancy scores based on the sliced Wasserstein distance. We also assess the stability of t...
ExtraImpute: A Novel Machine Learning Method for Missing Data Imputation
Journal of Advances in Information Technology
Missing values are one of the common incidences that occurs in healthcare datasets. Its existence usually leads to undesirable results while conducting data analysis using machine learning methods. Recently, researchers have proposed several imputation approaches to deal with missing values in real-world datasets. Moreover, data imputation assists us to build a high-performance machine learning models to discover patterns in healthcare data that provides top-notch insights for a higher quality decision-making. In this paper, we propose a new imputation approach using Extremely Randomized Trees (Extra Trees) of machine learning ensemble learning methods named (ExtraImpute) to tackle numerical missing values in healthcare context. The proposed method has the ability to impute both continuous and discrete data features. This approach imputes each missing value that exists in features by predicting its value using other observed values in the dataset. To evaluate the efficiency of our algorithm, several experiments are conducted on five different benchmark healthcare datasets and compared to other commonly used imputation methods, viz. missForest, KNNImpute, Multivariate Imputation by Chained Equations (MICE), and SoftImpute. The results were validated using Root Mean Square Error (RMSE) and Coefficient of Determination () scores. From these results, it was observed that our proposed algorithm outperforms existing imputation techniques.
Influence of Missing Values on Artificial Neural Network Performance
Studies in health technology and informatics, 2001
The problem of databases containing missing values is a common one in the medical environment. Researchers must find a way to incorporate the incomplete data into the data set to use those cases in their experiments. Artificial neural networks (ANNs) cannot interpret missing values, and when a database is highly skewed, ANNs have difficulty identifying the factors leading to a rare outcome. This study investigates the impact on ANN performance when predicting neonatal mortality of increasing the number of cases with missing values in the data sets. Although previous work using the Canadian Neonatal Intensive Care Unit (NICU) Network s database showed that the ANN could not correctly classify any patients who died when the missing values were replaced with normal or mean values, this problem did not arise as expected in this study. Instead, the ANN consistently performed better than the constant predictor (which classifies all cases as belonging to the outcome with the highest traini...
A review of challenges and solutions for using machine learning approaches for missing data
A review of challenges and solutions for using machine learning approaches for missing data, 2024
Missing data poses significant challenges to the reliability of statistical analyses and predictive modeling across diverse research fields. This paper provides an in-depth review of both traditional and machine learning imputation techniques, enabling researchers to navigate the complexities of missing data with greater efficacy. We evaluate simple imputation methods, such as mean, median, and mode, and delve into more sophisticated strategies including regression- based, hot and cold deck, and probabilistic models like Gaussian Mixture Models and K-Nearest Neighbors. Furthermore, the paper explores cutting-edge machine learning approaches like Random Forest, Multiple Imputation by Chained Equations, and deep learning models such as autoencoders and Generative Adversarial Networks. Our comprehensive analysis highlights the effectiveness of each method, tailored to various missing data mechanisms MCAR, MAR, and NMAR providing actionable insights for researchers to enhance data integrity and improve the outcomes of their studies.