Gradient Boosting Research Papers - Academia.edu (original) (raw)
Gradient Boosting Decision Trees (GBDT) algorithms have been proven to be among the best algorithms in machine learning. XGBoost, the most popular GBDT algorithm, has won many competitions on websites like Kaggle. However, XGBoost is not... more
Gradient Boosting Decision Trees (GBDT) algorithms have been proven to be among the best algorithms in machine learning. XGBoost, the most popular GBDT algorithm, has won many competitions on websites like Kaggle. However, XGBoost is not the only GBDT algorithm with state-of-the-art performance. There are other GBDT algorithms that have more advantages than XGBoost and sometimes even more potent like LightGBM and CatBoost. This paper aims to compare the performance of CPU implementation of the top three gradient boosting algorithms. We start by explaining how the three algorithms work and the hyperparameters similarities between them. Then we use a variety of performance criteria to evaluate their performance. We divide the performance criteria into four: accuracy, speed, reliability, and ease of use. The performance of the three algorithms has been tested with five classification and regression problems. Our findings show that the LightGBM algorithm has the best performance of the ...
Understanding what aspects of the urban environment are associated with better socioeconomic/liveability outcomes is a long standing research topic. Several quantitative studies have investigated such relationships. However, most of such... more
Understanding what aspects of the urban environment are associated with better socioeconomic/liveability outcomes is a long standing research topic. Several quantitative studies have investigated such relationships. However, most of such works analysed single correlations, thus failing to obtain a more complete picture of how the urban environment can contribute to explain the observed phenomena. More recently, multivariate models have been suggested. However, they use a limited set of metrics, propose a coarse spatial unit of analysis, and assume linearity and independence among regressors. In this paper, we propose a quantitative methodology to study the relationship between a more comprehensive set of metrics of the urban environment and the valorisation of street segments that handles non-linearity and possible interactions among variables, through the use of Machine Learning (ML). The proposed methodology was tested on the French Riviera and outputs show a moderate predictive capacity (i.e., adjusted R 2 = 0.75) and insightful explanations on the nuanced relationships between selected features of the urban environment and street values. These findings are clearly location specific; however, the methodology is replicable and can thus inspire future research of this kind in different geographic contexts.
In the cyber era, Machine Learning (ML) has provided us with the solutions to these problems with the implementation of Gradient Boosting Machines (GBM). We have ample algorithms to choose from to do gradient boosting for our training... more
In the cyber era, Machine Learning (ML) has provided us with the solutions to these problems with the implementation of Gradient Boosting Machines (GBM). We have ample algorithms to choose from to do gradient boosting for our training data but still, we encounter different issues like poor accuracy, high loss, large variance in the result. Here, we are going to introduce you to a state of the art machine learning algorithm XGBoost built by Tianqi Chen, that will not only overcome the issues but also perform exceptionally well for regression and classification problems. This blog will help you discover the insights, techniques, and skills with XGBoost that you can then bring to your machine learning projects.
eXtreme Gradient Boosting (XGBoost) is a scalable and improved version of the gradient boosting algorithm (terminology alert) designed for efficacy, computational speed and model performance. It is an open-source library and a part of the Distributed Machine Learning Community. XGBoost is a perfect blend of software and hardware capabilities designed to enhance existing boosting techniques with accuracy in the shortest amount of time. Here’s a quick look at an objective benchmark comparison of XGBoost with other gradient boosting algorithms trained on random forest with 500 trees, performed by Szilard Pafka.
Nanopores in graphene, a 2D material, are currently being explored for various applications, such as gas separation, water desalination, and DNA sequencing. The shapes and sizes of nanopores play a major role in determining the... more
Nanopores in graphene, a 2D material, are currently being explored for various applications, such as gas separation, water desalination, and DNA sequencing. The shapes and sizes of nanopores play a major role in determining the performance of devices made out of graphene. However, given an arbitrary nanopore shape, anticipating its creation probability and formation time is a challenging inverse problem, solving which could help develop theoretical models for nanoporous graphene and guide experiments in tailoring pore sizes/shapes. In this work, we develop a machine learning framework to predict these target variables, i.e., formation probabilities and times, based on data generated using kinetic Monte Carlo simulations and chemical graph theory. Thereby, we enable the rapid quantification of the ease of formation of a given nanopore shape in graphene via silicon-catalyzed electron-beam etching and provide an experimental handle to realize it, in practice. We use structural features...
Gradient Boosting Decision Trees (GBDT) algorithms have been proven to be among the best algorithms in machine learning. XGBoost, the most popular GBDT algorithm, has won many competitions on websites like Kaggle. However, XGBoost is not... more
Gradient Boosting Decision Trees (GBDT) algorithms have been proven to be among the best algorithms in machine learning. XGBoost, the most popular GBDT algorithm, has won many competitions on websites like Kaggle. However, XGBoost is not the only GBDT algorithm with state-of-the-art performance. There are other GBDT algorithms that have more advantages than XGBoost and sometimes even more potent like LightGBM and CatBoost. This paper aims to compare the performance of CPU implementation of the top three gradient boosting algorithms. We start by explaining how the three algorithms work and the hyperparameters similarities between them. Then we use a variety of performance criteria to evaluate their performance. We divide the performance criteria into four: accuracy, speed, reliability, and ease of use. The performance of the three algorithms has been tested with five classification and regression problems. Our findings show that the LightGBM algorithm has the best performance of the three with a balanced combination of accuracy, speed, reliability, and ease of use, followed by XGBoost with the histogram method, and CatBoost came last with slow and inconsistent performance.
Background and objective: Currently, diabetes is one of the leading causes of death in the world. According to several factors diagnosis of this disease is complex and prone to human error. This study aimed to analyze the risk of having... more
Background and objective: Currently, diabetes is one of the leading causes of death in the world. According to several factors diagnosis of this disease is complex and prone to human error. This study aimed to analyze the risk of having diabetes based on laboratory information, life style and, family history with the help of machine learning algorithms. When the model is trained properly, people can examine their risk of having diabetes.
Material and Methods: To classify patients, by using Python, eight different machine learning algorithms (Logistic Regression, Nearest Neighbor, Decision Tree, Random Forest, Support Vector Machine, Naive Bayesian, Neural Network and Gradient Boosting) were analysed. were evaluated by accuracy, sensitivity, specificity and ROC curve parameters.
Results: The model based on the gradient boosting algorithm showed the best performance with a prediction accuracy of %95.50.
Conclusion: In the future, this model can be used for diagnosis diabete. The basis of this study is to do more research and develop models such as other learning machine algorithms.
Kanker serviks atau yang lebih dikenal dengan kanker leher rahim adalah tumor yang tumbuh di dalam leher rahim/serviks yang merupakan bagian terendah dari rahim yang menempel pada puncak vagina. Saat ini kanker serviks merupakan penyakit... more
Kanker serviks atau yang lebih dikenal dengan kanker leher rahim adalah tumor yang tumbuh di dalam leher rahim/serviks yang merupakan bagian terendah dari rahim yang menempel pada puncak vagina. Saat ini kanker serviks merupakan penyakit peringkat keempat penyebab kematian pada wanita. Untuk mendeteksi seseorang menderita kanker serviks salah satunya adalah dilakukan biopsy. Pada penelitian ini, disajikan beberapa variabel faktor risiko pada kanker serviks yang akan diprediksi dengan hasil harus melakukan biopsy atau tidak. Metode yang baik digunakan dalam memprediksi klasifikasi tersebut ialah random forest dengan cross validation tanpa resampling. Jika dilakukan resampling, metode yang terbaik ialah Logistic Regression.
For financial institutions, credit risk assessment is key to their business success. As the financial crisis has shown, it is also vital for the economy as a whole and thus, the performance of the prediction models globally attracts high... more
For financial institutions, credit risk assessment is key to their business success.
As the financial crisis has shown, it is also vital for the economy as a whole and
thus, the performance of the prediction models globally attracts high attention from
regulators. Though substantial developments have been made in the area of machine
learning recently, most banks still work with traditional methods like logistic
regression. Since data is expensive to obtain, most publications which applied machine
learning techniques to credit scoring have used datasets with less than 1,000
observations. A glance at Table 1 reveals another common shortcoming: the share
of defaults is mostly unrealistically high, which bypasses problems that are accompanied
by heavily imbalanced data. We address these two lackings by using a realworld
dataset from the Siemens Bank GmbH with 183,081 observations and a default
rate of 1.39%, which is within a typical range for a bank’s credit portfolio. Our
results give clear indication that especially state-of-the-art techniques like boosting
algorithms outperform traditional risk models significantly. We first briefly introduce
six different machine learning techniques, then proceed with describing the
data preprocessing and the experimental procedure and then show our results and
support our findings through statistical tests.
Monitoring the condition of rolling element bearing and diagnosing their faults are cumbrous jobs. Fortunately, we have machines to do the burdensome task for us. The contemporary development in the field of machine learning allows us not... more
Monitoring the condition of rolling element bearing and diagnosing their faults are cumbrous jobs. Fortunately, we have machines to do the burdensome task for us. The contemporary development in the field of machine learning allows us not only to extract features from fault signals accurately but to analyze them and predict future bearing faults almost accurately as well in a systematic manner. Utilizing an ensemble learning method named Gradient Boosting (GB) our paper proposes a technique to previse future fault classes based on the data obtained from analyzing the recorded fault data. To demonstrate the cogency of the method, we applied it on the REB fault data provided by the Case Western Reserve University (CWRU) Lab. Employing this supervised learning algorithm after preprocessing the fault signals using real cepstrum analysis, we can detect and prefigure different types of bearing faults with a staggering 99.58% accuracy.
Ensemble learning methods have received remarkable attention in the recent years and led to considerable advancement in the performance of the regression and classification problems. Bagging and boosting are among the most popular... more
Ensemble learning methods have received remarkable attention in the recent years and led to considerable advancement
in the performance of the regression and classification problems. Bagging and boosting are among the most popular ensemble learning
techniques proposed to reduce the prediction error of learning machines. In this study, bagging and gradient boosting algorithms are
incorporated into the model creation process for daily streamflow prediction.
This paper compares two tree-based ensembles (bagged regression trees BRT & gradient boosted regression trees GBRT) and two
artificial neural networks ensembles (bagged artificial neural networks BANN & gradient boosted artificial neural networks GBANN).
Proposed ensembles are benchmarked to a conventional ANN (multilayer perceptron MLP). Coefficient of determination, mean
absolute error and the root mean squared error measures are used for prediction performance evaluation. The results obtained in this
study indicate that ensemble learning models yield better prediction accuracy than a conventional ANN model. Moreover, ANN
ensembles are superior to tree-based ensembles.
The use of association rules extracted from daily geophysical measures allows for the detection of previously unknown connections between events, including emergency conditions. While these rules imply that the presence of a given symbol... more
The use of association rules extracted from daily geophysical measures allows for the detection of previously unknown connections between events, including emergency conditions. While these rules imply that the presence of a given symbol occurs while a second one is present, their classification performance may vary with respect to test data. We propose to build strong classifiers out of simpler association rules: their use shows promising results with respect to their accuracy.
In this paper, we compared the predictive capabilities of six different machine learning algorithms – linear regression, artificial neural network, random forest, extreme gradient boosting, light gradient boosting, and natural gradient... more
In this paper, we compared the predictive capabilities of six different machine learning algorithms – linear regression, artificial neural network, random forest, extreme gradient boosting, light gradient boosting, and natural gradient boosting – and demonstrated that a hybrid light gradient boosting and natural gradient boosting model provides the most desirable construction cost estimates in terms of the accuracy metrics,
uncertainty estimates, and training speed. We also present a game theory-based model interpretation technique to evaluate the average marginal contribution of each feature value, across all possible combinations of features, on the model predictions. The comparison between the predicted cost and the actual cost confirms good alignment with 𝑅2 ∼ 0.99, 𝑅𝑀𝑆𝐸 ∼ 0.5, and 𝑀𝐵𝐸 ∼ -0.009. Besides, the proposed hybrid model can provide uncertainty estimates through probabilistic predictions for real-valued outputs. This probabilistic prediction approach produces a holistic probability distribution over the entire outcome space to quantify the uncertainties related to construction cost predictions.
Aging is one of the chief biomedical problems of the 21st century. After decades of basic research on biogerontology (the science of aging), the aging process still remains an enigma. Although hundreds of "theories" on aging have been... more
Aging is one of the chief biomedical problems of the 21st century. After decades of basic research on biogerontology
(the science of aging), the aging process still remains an enigma. Although hundreds of "theories" on aging have
been formulated and many fundamental insights about age-related changes and genetic as well as environmental
interventions that change the pace of aging have been discovered, the actual why and how we age remain enigmatic.
In the post-genomic era there is an exponential increase in data. As a consequence it is a challenge to utilize all
information based on it and derive meaningful knowledge about biological phenomena. No individual scientist, no
group, nor consortium is capable of keeping up within their own field and are overwhelmed by the explosion of data
increase. Machine learning applied on biological data has the potential to solve this and cause a paradigm shift from
hypothesis-driven research (which predominates biological research including biogerontology) to data-driven
research.
This dissertation addresses this problem. In particular it proposes and executes the use of machine learning on
current existing data to predict drivers of aging (and therefore helps to distinguish causes from consequences),
interventions to counteract aging, and specific hypotheses to fill in research gaps that require experimental
validation.
The objective of this project is therefore to build computational models that are based on data relevant to the
phenomenon of aging and to predict as many of its aspects and dimensions as possible (thus elucidate their relations
to each other). For converting between and sorting within dimensions which are relevant to aging, different machine
learning models are evaluated. Ones models are built, it can be determined how much they can explain different
aspects of aging. Those models will also be capable of specifying which features are most relevant for prediction (in
both classification or regression). It is possible to train models that incorporate age-related changes based on
transcriptomic, proteomic, metabolomic, epigenomic as well as morphological data and their combinations. Machine
learning is further used to convert between and within them.
This work focuses on three types of predictors. Subsequently, discoveries are made with the statistical and learning
algorithms. The first model (lifespan predictor) is trained on predicting the lifespan based on genotype, environment
and combinations thereof. It is useful for predicting lifespan extending interventions on the population level. The
second model (age predictor) is trained on predicting the age given features measured on individuals. This is useful
for identifying biomarkers of aging and to determine the effects of interventions on the level of individuals. The third
model predicts functions/regulations of biological entities in regard to the aging process based on heterogeneous data
such as ontologies and diverse omics including time-series gene expression profiles (which can be visualized as
plots), and linked data. It is used to understand the role of genes and proteins as well as perhaps other entities such as
small molecules including lipids and other metabolites. Functions of proteins, which are still unknown, especially
those involved in yeast lipid metabolism and its regulation, can be predicted.
For this purpose we use primarily yeast as model organism as well as data on humans. Other biomedical model
organisms might be added if found beneficial.
The novel aspects of this research are for instance that 1) aging is investigated systematically in an unbiased
data-driven approach, 2) lifespan is predicted as continuous values, 3) age is predicted by combining multiple omics
data, 4) functions and regulations of biological entities like genes are predicted with high confidence from
heterogeneous data sources.
This thesis discovered that genetics is the most important feature of lifespan determination. Phenotypic features
related to lipid and membranes such as vacuolar morphology and autophagy activity are important for lifespan
determination according to the best performing models. A age predictor based on transcriptomics and proteomics can
highly accurately determine the age. It selected features are associated with both translation and lipid metabolism.
Among the top selected features are transcripts of genes when deleted exhibit abnormal vacuolar morphology as well
as targets of Opi1. Opi1 itself and its regulators were found to be differentially regulated post-transcriptional or
post-translational. Lastly, a function predictor for genes was created that achieved exceptional accuracy of
classifying aging genes. It learned for instance that piecemeal autophagy of the nucleus is strongly predictive for
aging-suppressor genes while cytoplasmic translation is strongly predictive for gerontogenes
This paper examines the role and efficiency of the non-convex loss functions for binary classification problems. In particular, we investigate how to design a simple and effective boosting algorithm that is robust to the outliers in the... more
This paper examines the role and efficiency of the non-convex loss functions for binary classification problems. In particular, we investigate how to design a simple and effective boosting algorithm that is robust to the outliers in the data. The analysis of the role of a particular non-convex loss for prediction accuracy varies depending on the diminishing tail properties of the gradient of the loss -- the ability of the loss to efficiently adapt to the outlying data, the local convex properties of the loss and the proportion of the contaminated data. In order to use these properties efficiently, we propose a new family of non-convex losses named γ-robust losses. Moreover, we present a new boosting framework, Arch Boost, designed for augmenting the existing work such that its corresponding classification algorithm is significantly more adaptable to the unknown data contamination. Along with the Arch Boosting framework, the non-convex losses lead to the new class of boosting algorithms, named adaptive, robust, boosting (ARB). Furthermore, we present theoretical examples that demonstrate the robustness properties of the proposed algorithms. In particular, we develop a new breakdown point analysis and a new influence function analysis that demonstrate gains in robustness. Moreover, we present new theoretical results, based only on local curvatures, which may be used to establish statistical and optimization properties of the proposed Arch boosting algorithms with highly non-convex loss functions. Extensive numerical calculations are used to illustrate these theoretical properties and reveal advantages over the existing boosting methods when data exhibits a number of outliers.
Stochastic time series analysis of high-frequency stock market data is a very challenging task for the analysts due to the lack availability of efficient tool and techniques for big data analytics. This has opened the door of... more
Stochastic time series analysis of high-frequency stock market data is a very challenging task for the analysts due to the lack availability of efficient tool and techniques for big data analytics. This has opened the door of opportunities for the developer and researcher to develop intelligent and machine learning based tools and techniques for data analytics. This paper proposed an ensemble for stock market data prediction using three most prominent machine learning based techniques. The stock market dataset with raw data size of 39364 KB with all attributes and processed data size of 11826 KB having 872435 instances. The proposed work implements an ensemble model comprises of Deep Learning, Gradient Boosting Machine (GBM) and distributed Random Forest techniques of data analytics. The performance results of the ensemble model are compared with each of the individual methods i.e. deep learning, Gradient Boosting Machine (GBM) and Random Forest. The ensemble model performs better a...
Physical activity is essential for physical and mental health, and its absence is highly associated with severe health conditions and disorders. Therefore, tracking activities of daily living can help promote quality of life. Wearable... more
Physical activity is essential for physical and mental health, and its absence is highly associated with severe health conditions and disorders. Therefore, tracking activities of daily living can help promote quality of life. Wearable sensors in this regard can provide a reliable and economical means of tracking such activities, and such sensors are readily available in smartphones and watches. This study is the first of its kind to develop a wearable sensor-based physical activity classification system using a special class of supervised machine learning approaches called boosting algorithms. The study presents the performance analysis of several boosting algorithms (extreme gradient boosting—XGB, light gradient boosting machine—LGBM, gradient boosting—GB, cat boosting—CB and AdaBoost) in a fair and unbiased performance way using uniform dataset, feature set, feature selection method, performance metric and cross-validation techniques. The study utilizes the Smartphone-based datase...
This work describes investigations and results obtained using different gradient boosting algorithms (e.g., XGBoost, LightGBM, CatBoost), feature engineering techniques(e.g., Categorical Encoding, Boruta-SHAP, Pseudo-Labels), and other... more
This paper aims to explore models based on the extreme gradient boosting (XGBoost) approach for business risk classification. Feature selection (FS) algorithms and hyper-parameter optimizations are simultaneously considered during model... more
This paper aims to explore models based on the extreme gradient boosting (XGBoost) approach for business risk classification. Feature selection (FS) algorithms and hyper-parameter optimizations are simultaneously considered during model training. The five most commonly used FS methods including weight by Gini, weight by Chi-square, hierarchical variable clustering, weight by correlation, and weight by information are applied to alleviate the effect of redundant features. Two hyper-parameter optimization approaches, random search (RS) and Bayesian tree-structuredParzen Estimator (TPE), are applied in XGBoost. The effect of different FS and hyper-parameter optimization methods on the model performance are investigated by the Wilcoxon Signed Rank Test. The performance of XGBoost is compared to the traditionally utilized logistic regression (LR) model in terms of classification accuracy, area under the curve (AUC), recall, and F1 score obtained from the 10-fold cross validation. Results show that hierarchical clustering is the optimal FS method for LR while weight by Chi-square achieves the best performance in XG-Boost. Both TPE and RS optimization in XGBoost outperform LR significantly. TPE optimization shows a superiority over RS since it results in a significantly higher accuracy and a marginally higher AUC, recall and F1 score. Furthermore, XGBoost with TPE tuning shows a lower variability than the RS method. Finally, the ranking of feature importance based on XGBoost enhances the model interpretation. Therefore, XGBoost with Bayesian TPE hyper-parameter optimization serves as an operative while powerful approach for business risk modeling.
Aging is one of the chief biomedical problems of the 21st century. After decades of basic research on biogerontology (the science of aging), the aging process still remains an enigma. Although hundreds of "theories" on aging have been... more
Aging is one of the chief biomedical problems of the 21st century. After decades of basic research on biogerontology
(the science of aging), the aging process still remains an enigma. Although hundreds of "theories" on aging have
been formulated and many fundamental insights about age-related changes and genetic as well as environmental
interventions that change the pace of aging have been discovered, the actual why and how we age remain enigmatic.
In the post-genomic era there is an exponential increase in data. As a consequence it is a challenge to utilize all
information based on it and derive meaningful knowledge about biological phenomena. No individual scientist, no
group, nor consortium is capable of keeping up within their own field and are overwhelmed by the explosion of data
increase. Machine learning applied on biological data has the potential to solve this and cause a paradigm shift from
hypothesis-driven research (which predominates biological research including biogerontology) to data-driven
research.
This dissertation addresses this problem. In particular it proposes and executes the use of machine learning on
current existing data to predict drivers of aging (and therefore helps to distinguish causes from consequences),
interventions to counteract aging, and specific hypotheses to fill in research gaps that require experimental
validation.
The objective of this project is therefore to build computational models that are based on data relevant to the
phenomenon of aging and to predict as many of its aspects and dimensions as possible (thus elucidate their relations
to each other). For converting between and sorting within dimensions which are relevant to aging, different machine
learning models are evaluated. Ones models are built, it can be determined how much they can explain different
aspects of aging. Those models will also be capable of specifying which features are most relevant for prediction (in
both classification or regression). It is possible to train models that incorporate age-related changes based on
transcriptomic, proteomic, metabolomic, epigenomic as well as morphological data and their combinations. Machine
learning is further used to convert between and within them.
This work focuses on three types of predictors. Subsequently, discoveries are made with the statistical and learning
algorithms. The first model (lifespan predictor) is trained on predicting the lifespan based on genotype, environment
and combinations thereof. It is useful for predicting lifespan extending interventions on the population level. The
second model (age predictor) is trained on predicting the age given features measured on individuals. This is useful
for identifying biomarkers of aging and to determine the effects of interventions on the level of individuals. The third
model predicts functions/regulations of biological entities in regard to the aging process based on heterogeneous data
such as ontologies and diverse omics including time-series gene expression profiles (which can be visualized as
plots), and linked data. It is used to understand the role of genes and proteins as well as perhaps other entities such as
small molecules including lipids and other metabolites. Functions of proteins, which are still unknown, especially
those involved in yeast lipid metabolism and its regulation, can be predicted.
For this purpose we use primarily yeast as model organism as well as data on humans. Other biomedical model
organisms might be added if found beneficial.
The novel aspects of this research are for instance that 1) aging is investigated systematically in an unbiased
data-driven approach, 2) lifespan is predicted as continuous values, 3) age is predicted by combining multiple omics
data, 4) functions and regulations of biological entities like genes are predicted with high confidence from
heterogeneous data sources.
This thesis discovered that genetics is the most important feature of lifespan determination. Phenotypic features
related to lipid and membranes such as vacuolar morphology and autophagy activity are important for lifespan
determination according to the best performing models. A age predictor based on transcriptomics and proteomics can
highly accurately determine the age. It selected features are associated with both translation and lipid metabolism.
Among the top selected features are transcripts of genes when deleted exhibit abnormal vacuolar morphology as well
as targets of Opi1. Opi1 itself and its regulators were found to be differentially regulated post-transcriptional or
post-translational. Lastly, a function predictor for genes was created that achieved exceptional accuracy of
classifying aging genes. It learned for instance that piecemeal autophagy of the nucleus is strongly predictive for
aging-suppressor genes while cytoplasmic translation is strongly predictive for gerontogenes.
v
In this thesis we provide a unifying framework for two decades of work in an area of Machine Learning known as cost-sensitive Boosting algorithms. This area is concerned with the fact that most real-world prediction problems are... more
In this thesis we provide a unifying framework for two decades of work in an area of Machine Learning known as cost-sensitive Boosting algorithms. This area is concerned with the fact that most real-world prediction problems are asymmetric, in the sense that different types of errors incur different costs.Adaptive Boosting (AdaBoost) is one of the most well-studied and utilised algorithms in the field of Machine Learning, with a rich theoretical depth as well as practical uptake across numerous industries. However, its inability to handle asymmetric tasks has been the subject of much criticism. As a result, numerous cost-sensitive modifications of the original algorithm have been proposed. Each of these has its own motivations, and its own claims to superiority.With a thorough analysis of the literature 1997-2016, we find 15 distinct cost-sensitive Boosting variants - discounting minor variations. We critique the literature using {\em four} powerful theoretical frameworks: Bayesian decision theory, the functional gradient descent view, margin theory, and probabilistic modelling.From each framework, we derive a set of properties which must be obeyed by boosting algorithms. We find that only 3 of the published Adaboost variants are consistent with the rules of all the frameworks - and even they require their outputs to be calibrated to achieve this.Experiments on 18 datasets, across 21 degrees of cost asymmetry, all support the hypothesis - showing that once calibrated, the three variants perform equivalently, outperforming all others.Our final recommendation - based on theoretical soundness, simplicity, flexibility and performance - is to use the original Adaboost algorithm albeit with a shifted decision threshold and calibrated probability estimates. The conclusion is that novel cost-sensitive boosting algorithms are unnecessary if proper calibration is applied to the original
Probability estimates generated by boosting ensembles are poorly calibrated because of the margin maximization nature of the algorithm. The outputs of the ensemble need to be properly calibrated before they can be used as probability... more
Probability estimates generated by boosting ensembles are poorly calibrated because of the margin maximization nature of the algorithm. The outputs of the ensemble need to be properly calibrated before they can be used as probability estimates. In this work, we demonstrate that online boosting is also prone to producing distorted probability estimates. In batch learning, calibration is achieved by reserving part of the training data for training the calibrator function. In the online setting, a decision needs to be made on each round: shall the new example(s) be used to update the parameters of the ensemble or those of the calibrator. We proceed to resolve this decision with the aid of bandit optimization algorithms. We demonstrate superior performance to uncalibrated and naively-calibrated on-line boosting ensembles in terms of probability estimation. Our proposed mechanism can be easily adapted to other tasks(e.g. cost-sensitive classification) and is robust to the choice of hyper...
In this paper, we compared the predictive capabilities of six different machine learning algorithms – linear regression, artificial neural network, random forest, extreme gradient boosting, light gradient boosting, and natural gradient... more
In this paper, we compared the predictive capabilities of six different machine learning algorithms – linear regression, artificial neural network, random forest, extreme gradient boosting, light gradient boosting, and natural gradient boosting – and demonstrated that a hybrid light gradient boosting and natural gradient boosting model provides the most desirable construction cost estimates in terms of the accuracy metrics, uncertainty estimates, and training speed. We also present a game theory-based model interpretation technique to evaluate the average marginal contribution of each feature value, across all possible combinations of features, on the model predictions. The comparison between the predicted cost and the actual cost confirms good alignment with 𝑅2 ∼ 0.99, 𝑅𝑀𝑆𝐸 ∼ 0.5, and 𝑀𝐵𝐸 ∼ -0.009. Besides, the proposed hybrid model can provide uncertainty estimates through probabilistic predictions for real-valued outputs. This probabilistic prediction approach produces a holistic...
Frontotemporal dementia (FTD) is a heterogeneous neurodegenerative disorder characterized by frontal and temporal lobe atrophy, typically manifesting with behavioural or language impairment. Because of its heterogeneity and lack of... more
Frontotemporal dementia (FTD) is a heterogeneous neurodegenerative disorder characterized by frontal and temporal lobe atrophy, typically manifesting with behavioural or language impairment. Because of its heterogeneity and lack of available diagnostic laboratory tests there can be a substantial delay in diagnosis. Cell-free, circulating, microRNAs are increasingly investigated as biomarkers for neurodegeneration, but their value in FTD is not yet established. In this study, we investigate microRNAs as biomarkers for FTD diagnosis. We performed next generation small RNA sequencing on cell-free plasma from 52 FTD cases and 21 controls. The analysis revealed the diagnostic importance of 20 circulating endogenous miRNAs in distinguishing FTD cases from controls. The study was repeated in an independent second cohort of 117 FTD cases and 35 controls. The combinatorial microRNA signature from the first cohort, precisely diagnosed FTD samples in a second cohort. To further increase the generalizability of the prediction, we implemented machine learning techniques in a merged dataset of the two cohorts, which resulted in a comparable or improved classification precision with a smaller panel of miRNA classifiers. In addition, there are intriguing molecular commonalities with cell free miRNA signature in ALS, a motor neuron disease that resides on a pathological continuum with FTD. However, the signature that describes the ALS-FTD spectrum is not shared with blood miRNA profiles of patients with multiple sclerosis. Thus, microRNAs are promising FTD biomarkers that might enable earlier detection of FTD and improve accurate identification of patients for clinical trials
The RAND Database of Worldwide Terrorism Incidents (RDWTI) seeks to index information about all terrorist incidents that occur and are mentioned in worldwide news media, providing a useful resource for policy researchers and decision... more
The RAND Database of Worldwide Terrorism Incidents (RDWTI) seeks to index information about all terrorist incidents that occur and are mentioned in worldwide news media, providing a useful resource for policy researchers and decision makers. We examined automated classification methods that could be used to identify news articles about terrorist incidents, thus enabling analysts to read a smaller number of news articles and maintain the database with less effort and cost. The support vector machine (SVM) and Lasso methods were only modestly successful, but a classifier based on the gradient boosting method (GBM) appeared to be very successful, correctly ranking 80% of the relevant articles at the “top of the pile” for examination by a human analyst.
A healthcare monitoring system needs the support of recent technologies such as artificial intelligence (AI), machine learning (ML), and big data, especially during the COVID-19 pandemic. This global pandemic has already taken millions of... more
A healthcare monitoring system needs the support of recent technologies such as artificial intelligence (AI), machine learning (ML), and big data, especially during the COVID-19 pandemic. This global pandemic has already taken millions of lives. Both infected and uninfected people have generated big data where AI and ML can use to combat and detect COVID-19 at an early stage. Motivated by this, an improved ML framework for the early detection of this disease is proposed in this paper. The state-of-the-art Harris hawks optimization (HHO) algorithm with an improved objective function is proposed and applied to optimize the hyperparameters of the ML algorithms, namely HHO-based eXtreme gradient boosting (HHOXGB), light gradient boosting (HHOLGB), categorical boosting (HHOCAT), random forest (HHORF) and support vector classifier (HHOSVC). An ensemble technique was applied to these optimized ML models to improve the prediction performance. Our proposed method was applied to publicly avai...
Gradient boosting is a machine learning method, that builds one strong classifier from many weak classifiers. In this work, an algorithm based on gradient boosting is presented, that detects event-related potentials in single... more
Gradient boosting is a machine learning method, that builds one strong classifier from many weak classifiers. In this work, an algorithm based on gradient boosting is presented, that detects event-related potentials in single electroencephalogram (EEG) trials. The algorithm is used to detect the P300 in the human EEG and to build a brain-computer interface (BCI), specifically a spelling device. Important features of the method described here are its high classification accuracy and its conceptual simplicity. The algorithm was ...
Frontotemporal dementia (FTD) is a heterogeneous neurodegenerative disorder characterized by frontal and temporal lobe atrophy, typically manifesting with behavioural or language impairment. Because of its heterogeneity and lack of... more
Frontotemporal dementia (FTD) is a heterogeneous neurodegenerative disorder characterized by frontal and temporal lobe atrophy, typically manifesting with behavioural or language impairment. Because of its heterogeneity and lack of available diagnostic laboratory tests there can be a substantial delay in diagnosis. Cell-free, circulating, microRNAs are increasingly investigated as biomarkers for neurodegeneration, but their value in FTD is not yet established. In this study, we investigate microRNAs as biomarkers for FTD diagnosis. We performed next generation small RNA sequencing on cell-free plasma from 52 FTD cases and 21 controls. The analysis revealed the diagnostic importance of 20 circulating endogenous miRNAs in distinguishing FTD cases from controls. The study was repeated in an independent second cohort of 117 FTD cases and 35 controls. The combinatorial microRNA signature from the first cohort, precisely diagnosed FTD samples in a second cohort. To further increase the ge...
The basic objective of the current project is to analyse arrival delay of the flights using data mining and supervised machine learning algorithms: Random Forest, Support Vector Machine (SVM), Linear Regression, Bagging Trees, Ada Boost... more
The basic objective of the current project is to analyse arrival delay of the flights using data mining and supervised machine learning algorithms: Random Forest, Support Vector Machine (SVM), Linear Regression, Bagging Trees, Ada Boost and Gradient Boosting Classifier (GBC), and compare their performances to obtain the best performing classifier. Next, the best performing classifier is used to predict whether a flight will be delayed well before it is announced on the boards.
Understanding what aspects of the urban environment are associated with better socioeconomic/liveability outcomes is a long standing research topic. Several quantitative studies have investigated such relationships. However, most of such... more
Understanding what aspects of the urban environment are associated with better socioeconomic/liveability outcomes is a long standing research topic. Several quantitative studies have investigated such relationships. However, most of such works analysed single correlations, thus failing to obtain a more complete picture of how the urban environment can contribute to explain the observed phenomena. More recently, multivariate models have been suggested. However, they use a limited set of metrics, propose a coarse spatial unit of analysis, and assume linearity and independence among regressors. In this paper, we propose a quantitative methodology to study the relationship between a more comprehensive set of metrics of the urban environment and the valorisation of street segments that handles non-linearity and possible interactions among variables, through the use of Machine Learning (ML). The proposed methodology was tested on the French Riviera and outputs show a moderate predictive c...