Sven Serneels - Academia.edu (original) (raw)
Papers by Sven Serneels
arXiv (Cornell University), Feb 9, 2023
Non-fungible tokens (NFT) have recently emerged as a novel blockchain hosted financial asset clas... more Non-fungible tokens (NFT) have recently emerged as a novel blockchain hosted financial asset class that has attracted major transaction volumes. Investment decisions rely on data and adequate preprocessing and application of analytics to them. Both owing to the non-fungible nature of the tokens and to a blockchain being the primary data source, NFT transaction data pose several challenges not commonly encountered in traditional financial data. Using data that consist of the transaction history of eight highly valued NFT collections, a selection of such challenges is illustrated. These are: price differentiation by token traits, the possible existence of lateral swaps and wash trades in the transaction history and finally, severe volatility. While this paper merely scratches the surface of how data analytics can be applied in this context, the data and challenges laid out here may present opportunities for future research on the topic.
Computer Aided Chemical Engineering
SoftwareX
The direpack package aims to establish a set of modern statistical dimension reduction techniques... more The direpack package aims to establish a set of modern statistical dimension reduction techniques into the Python universe as a single, consistent package. The dimension reduction methods included resort into three categories: projection pursuit based dimension reduction, sufficient dimension reduction, and robust M estimators for dimension reduction. As a corollary, regularized regression estimators based on these reduced dimension spaces are provided as well, ranging from classical principal component regression up to sparse partial robust M regression. The package also contains a set of classical and robust pre-processing utilities, including generalized spatial signs, as well as dedicated plotting functionality and cross-validation utilities. Finally, direpack has been written consistent with the scikit-learn API, such that the estimators can flawlessly be included into (statistical and/or machine) learning pipelines in that framework.
Advances in Data Analysis and Classification, 2021
As its name suggests, sufficient dimension reduction (SDR) targets to estimate a subspace from da... more As its name suggests, sufficient dimension reduction (SDR) targets to estimate a subspace from data that contains all information sufficient to explain a dependent variable. Ample approaches exist to SDR, some of the most recent of which rely on minimal to no model assumptions. These are defined according to an optimization criterion that maximizes a nonparametric measure of association. The original estimators are nonsparse, which means that all variables contribute to the model. However, in many practical applications, an SDR technique may be called for that is sparse and as such, intrinsically performs sufficient variable selection (SVS). This paper examines how such a sparse SDR estimator can be constructed. Three variants are investigated, depending on different measures of association: distance covariance, martingale difference divergence and ball covariance. A simulation study shows that each of these estimators can achieve correct variable selection in highly nonlinear contexts, yet are sensitive to outliers and computationally intensive. The study sheds light on the subtle differences between the methods. Two examples illustrate how these new estimators can be applied in practice, with a slight preference for the option based on martingale difference divergence in the bioinformatics example.
Computational Statistics & Data Analysis, 2020
The cellwise robust M regression estimator is introduced as the first estimator of its kind that ... more The cellwise robust M regression estimator is introduced as the first estimator of its kind that intrinsically yields both a map of cellwise outliers consistent with the linear model, and a vector of regression coefficients that is robust against vertical outliers and leverage points. As a by-product, the method yields a weighted and imputed data set that contains estimates of what the values in cellwise outliers would need to amount to if they had fit the model. The method is illustrated to be equally robust as its casewise counterpart, MM regression. The cellwise regression method discards less information than any casewise robust estimator. Therefore, predictive power can be expected to be at least as good as casewise alternatives. These results are corroborated in a simulation study. Moreover, while the simulations show that predictive performance is at least on par with casewise methods if not better, an application to a data set consisting of compositions of Swiss nutrients, shows that in individual cases, CRM can achieve a much higher predictive accuracy compared to MM regression.
Statistics and Computing, 2018
Outlier detection is an inevitable step to most statistical data analyses. However, the mere dete... more Outlier detection is an inevitable step to most statistical data analyses. However, the mere detection of an outlying case does not always answer all scientific questions associated with that data point. Outlier detection techniques, classical and robust alike, will typically flag the entire case as outlying, or attribute a specific case weight to the entire case. In practice, particularly in high dimensional data, the outlier will most likely not be outlying along all of its variables, but just along a subset of them. If so, the scientific question why the case has been flagged as an outlier becomes of interest. In this article, a fast and efficient method is proposed to detect variables that contribute most to an outlier's outlyingness. Thereby, it helps the analyst understand why an outlier lies out. The approach pursued in this work is to estimate the univariate direction of maximal outlyingness. It is shown that the problem of estimating that direction can be rewritten as the normed solution of a classical least squares regression problem. Identifying the subset of variables contributing most to outlyingness, can thus be achieved by estimating the associated least squares problem in a sparse manner. From a practical perspective, sparse partial least squares (SPLS) regression, preferably by the fast sparse NIPALS (SNIPLS) algorithm, is suggested to tackle that problem. The proposed methodology is illustrated to perform well both on simulated data and real life examples.
Journal of Futures Markets
Analytica Chimica Acta, Jan 9, 2007
The aim of this study is to show the usefulness of robust multiple regression techniques implemen... more The aim of this study is to show the usefulness of robust multiple regression techniques implemented in the expectation maximization framework in order to model successfully data containing missing elements and outlying objects. In particular, results from a comparative study of partial least squares and partial robust M-regression models implemented in the expectation maximization algorithm are presented. The performances of the proposed approaches are illustrated on simulated data with and without outliers, containing different percentages of missing elements and on a real data set. The obtained results suggest that the proposed methodology can be used for constructing satisfactory regression models in terms of their trimmed root mean squared errors.
Journal of Chemical Information and Modeling, May 1, 2006
The spatial sign is a multivariate extension of the concept of sign. Recently multivariate estima... more The spatial sign is a multivariate extension of the concept of sign. Recently multivariate estimators of covariance structures based on spatial signs have been examined by various authors. These new estimators are found to be robust to outlying observations. From a computational point of view, estimators based on spatial sign are very easy to implement as they boil down to a transformation of the data to their spatial signs, from which the classical estimator is then computed. Hence, one can also consider the transformation to spatial signs to be a preprocessing technique, which ensures that the calibration procedure as a whole is robust. In this paper, we examine the special case of spatial sign preprocessing in combination with partial least squares regression as the latter technique is frequently applied in the context of chemical data analysis. In a simulation study, we compare the performance of the spatial sign transformation to nontransformed data as well as to two robust counterparts of partial least squares regression. It turns out that the spatial sign transform is fairly efficient but has some undesirable bias properties. The method is applied to a recently published data set in the field of quantitative structure-activity relationships, where it is seen to perform equally well as the previously described best linear model for these data.
Journal of Chemometrics, 2016
In this paper, we compute the influence function for partial least squares regression. Thereunto,... more In this paper, we compute the influence function for partial least squares regression. Thereunto, we design two alternative algorithms, according to the PLS algorithm used. One algorithm for the computation of the influence function is based on the Helland PLS algorithm, whilst the other is compatible with SIMPLS. The calculation of the influence function leads to new influence diagnostic plots for PLS. An alternative to the well known Cook distance plot is proposed, as well as a variant which is sample specific. Moreover, a novel estimate of prediction variance is deduced. The validity of the latter is corroborated by dint of a Monte Carlo simulation.
SSRN Electronic Journal, 2000
Sparse partial robust M regression is introduced as a new regression method. It is the first dime... more Sparse partial robust M regression is introduced as a new regression method. It is the first dimension reduction and regression algorithm that yields estimates with a partial least squares alike interpretability that are sparse and robust with respect to both vertical outliers and leverage points. A simulation study underpins these claims. Real data examples illustrate the validity of the approach.
Studies in Classification, Data Analysis, and Knowledge Organization, 2006
The PLS approach is a widely used technique to estimate path models relating various blocks of va... more The PLS approach is a widely used technique to estimate path models relating various blocks of variables measured from the same population. It is frequently applied in the social sciences and in economics. In this type of applications, deviations from normality and outliers may occur, leading to an efficiency loss or even biased results. In the current paper, a robust path model estimation technique is being proposed, the partial robust M (PRM) approach. In an example its benefits are illustrated.
Studies in Classification, Data Analysis, and Knowledge Organization, 2006
Projection pursuit was originally introduced to identify structures in multivariate data clouds (... more Projection pursuit was originally introduced to identify structures in multivariate data clouds (Huber, 1985). The idea of projecting data to a lowdimensional subspace can also be applied to multivariate statistical methods. The robustness of the methods can be achieved by applying robust estimators to the lower-dimensional space. Robust estimation in high dimensions can thus be avoided which usually results in a faster computation. Moreover, flat data sets where the number of variables is much higher than the number of observations can be easier analyzed in a robust way. We will focus on the projection pursuit approach for robust continuum regression (Serneels et al., 2005). A new algorithm is introduced and compared with the reference algorithm as well as with classical continuum regression.
Comprehensive Chemometrics, 2009
This chapter presents an introduction to robust statistics with applications of a chemometric nat... more This chapter presents an introduction to robust statistics with applications of a chemometric nature. Following a description of the basic ideas and concepts behind robust statistics, including how robust estimators can be conceived, the chapter builds up to the construction (and use) of robust alternatives for some methods for multivariate analysis frequently used in chemometrics, such as principal component analysis and partial least squares. The chapter then provides an insight into how these robust methods can be used or extended to classification. To conclude, the issue of validation of the results is being addressed: it is shown how uncertainty statements associated with robust estimates, can be obtained.
arXiv (Cornell University), Feb 9, 2023
Non-fungible tokens (NFT) have recently emerged as a novel blockchain hosted financial asset clas... more Non-fungible tokens (NFT) have recently emerged as a novel blockchain hosted financial asset class that has attracted major transaction volumes. Investment decisions rely on data and adequate preprocessing and application of analytics to them. Both owing to the non-fungible nature of the tokens and to a blockchain being the primary data source, NFT transaction data pose several challenges not commonly encountered in traditional financial data. Using data that consist of the transaction history of eight highly valued NFT collections, a selection of such challenges is illustrated. These are: price differentiation by token traits, the possible existence of lateral swaps and wash trades in the transaction history and finally, severe volatility. While this paper merely scratches the surface of how data analytics can be applied in this context, the data and challenges laid out here may present opportunities for future research on the topic.
Computer Aided Chemical Engineering
SoftwareX
The direpack package aims to establish a set of modern statistical dimension reduction techniques... more The direpack package aims to establish a set of modern statistical dimension reduction techniques into the Python universe as a single, consistent package. The dimension reduction methods included resort into three categories: projection pursuit based dimension reduction, sufficient dimension reduction, and robust M estimators for dimension reduction. As a corollary, regularized regression estimators based on these reduced dimension spaces are provided as well, ranging from classical principal component regression up to sparse partial robust M regression. The package also contains a set of classical and robust pre-processing utilities, including generalized spatial signs, as well as dedicated plotting functionality and cross-validation utilities. Finally, direpack has been written consistent with the scikit-learn API, such that the estimators can flawlessly be included into (statistical and/or machine) learning pipelines in that framework.
Advances in Data Analysis and Classification, 2021
As its name suggests, sufficient dimension reduction (SDR) targets to estimate a subspace from da... more As its name suggests, sufficient dimension reduction (SDR) targets to estimate a subspace from data that contains all information sufficient to explain a dependent variable. Ample approaches exist to SDR, some of the most recent of which rely on minimal to no model assumptions. These are defined according to an optimization criterion that maximizes a nonparametric measure of association. The original estimators are nonsparse, which means that all variables contribute to the model. However, in many practical applications, an SDR technique may be called for that is sparse and as such, intrinsically performs sufficient variable selection (SVS). This paper examines how such a sparse SDR estimator can be constructed. Three variants are investigated, depending on different measures of association: distance covariance, martingale difference divergence and ball covariance. A simulation study shows that each of these estimators can achieve correct variable selection in highly nonlinear contexts, yet are sensitive to outliers and computationally intensive. The study sheds light on the subtle differences between the methods. Two examples illustrate how these new estimators can be applied in practice, with a slight preference for the option based on martingale difference divergence in the bioinformatics example.
Computational Statistics & Data Analysis, 2020
The cellwise robust M regression estimator is introduced as the first estimator of its kind that ... more The cellwise robust M regression estimator is introduced as the first estimator of its kind that intrinsically yields both a map of cellwise outliers consistent with the linear model, and a vector of regression coefficients that is robust against vertical outliers and leverage points. As a by-product, the method yields a weighted and imputed data set that contains estimates of what the values in cellwise outliers would need to amount to if they had fit the model. The method is illustrated to be equally robust as its casewise counterpart, MM regression. The cellwise regression method discards less information than any casewise robust estimator. Therefore, predictive power can be expected to be at least as good as casewise alternatives. These results are corroborated in a simulation study. Moreover, while the simulations show that predictive performance is at least on par with casewise methods if not better, an application to a data set consisting of compositions of Swiss nutrients, shows that in individual cases, CRM can achieve a much higher predictive accuracy compared to MM regression.
Statistics and Computing, 2018
Outlier detection is an inevitable step to most statistical data analyses. However, the mere dete... more Outlier detection is an inevitable step to most statistical data analyses. However, the mere detection of an outlying case does not always answer all scientific questions associated with that data point. Outlier detection techniques, classical and robust alike, will typically flag the entire case as outlying, or attribute a specific case weight to the entire case. In practice, particularly in high dimensional data, the outlier will most likely not be outlying along all of its variables, but just along a subset of them. If so, the scientific question why the case has been flagged as an outlier becomes of interest. In this article, a fast and efficient method is proposed to detect variables that contribute most to an outlier's outlyingness. Thereby, it helps the analyst understand why an outlier lies out. The approach pursued in this work is to estimate the univariate direction of maximal outlyingness. It is shown that the problem of estimating that direction can be rewritten as the normed solution of a classical least squares regression problem. Identifying the subset of variables contributing most to outlyingness, can thus be achieved by estimating the associated least squares problem in a sparse manner. From a practical perspective, sparse partial least squares (SPLS) regression, preferably by the fast sparse NIPALS (SNIPLS) algorithm, is suggested to tackle that problem. The proposed methodology is illustrated to perform well both on simulated data and real life examples.
Journal of Futures Markets
Analytica Chimica Acta, Jan 9, 2007
The aim of this study is to show the usefulness of robust multiple regression techniques implemen... more The aim of this study is to show the usefulness of robust multiple regression techniques implemented in the expectation maximization framework in order to model successfully data containing missing elements and outlying objects. In particular, results from a comparative study of partial least squares and partial robust M-regression models implemented in the expectation maximization algorithm are presented. The performances of the proposed approaches are illustrated on simulated data with and without outliers, containing different percentages of missing elements and on a real data set. The obtained results suggest that the proposed methodology can be used for constructing satisfactory regression models in terms of their trimmed root mean squared errors.
Journal of Chemical Information and Modeling, May 1, 2006
The spatial sign is a multivariate extension of the concept of sign. Recently multivariate estima... more The spatial sign is a multivariate extension of the concept of sign. Recently multivariate estimators of covariance structures based on spatial signs have been examined by various authors. These new estimators are found to be robust to outlying observations. From a computational point of view, estimators based on spatial sign are very easy to implement as they boil down to a transformation of the data to their spatial signs, from which the classical estimator is then computed. Hence, one can also consider the transformation to spatial signs to be a preprocessing technique, which ensures that the calibration procedure as a whole is robust. In this paper, we examine the special case of spatial sign preprocessing in combination with partial least squares regression as the latter technique is frequently applied in the context of chemical data analysis. In a simulation study, we compare the performance of the spatial sign transformation to nontransformed data as well as to two robust counterparts of partial least squares regression. It turns out that the spatial sign transform is fairly efficient but has some undesirable bias properties. The method is applied to a recently published data set in the field of quantitative structure-activity relationships, where it is seen to perform equally well as the previously described best linear model for these data.
Journal of Chemometrics, 2016
In this paper, we compute the influence function for partial least squares regression. Thereunto,... more In this paper, we compute the influence function for partial least squares regression. Thereunto, we design two alternative algorithms, according to the PLS algorithm used. One algorithm for the computation of the influence function is based on the Helland PLS algorithm, whilst the other is compatible with SIMPLS. The calculation of the influence function leads to new influence diagnostic plots for PLS. An alternative to the well known Cook distance plot is proposed, as well as a variant which is sample specific. Moreover, a novel estimate of prediction variance is deduced. The validity of the latter is corroborated by dint of a Monte Carlo simulation.
SSRN Electronic Journal, 2000
Sparse partial robust M regression is introduced as a new regression method. It is the first dime... more Sparse partial robust M regression is introduced as a new regression method. It is the first dimension reduction and regression algorithm that yields estimates with a partial least squares alike interpretability that are sparse and robust with respect to both vertical outliers and leverage points. A simulation study underpins these claims. Real data examples illustrate the validity of the approach.
Studies in Classification, Data Analysis, and Knowledge Organization, 2006
The PLS approach is a widely used technique to estimate path models relating various blocks of va... more The PLS approach is a widely used technique to estimate path models relating various blocks of variables measured from the same population. It is frequently applied in the social sciences and in economics. In this type of applications, deviations from normality and outliers may occur, leading to an efficiency loss or even biased results. In the current paper, a robust path model estimation technique is being proposed, the partial robust M (PRM) approach. In an example its benefits are illustrated.
Studies in Classification, Data Analysis, and Knowledge Organization, 2006
Projection pursuit was originally introduced to identify structures in multivariate data clouds (... more Projection pursuit was originally introduced to identify structures in multivariate data clouds (Huber, 1985). The idea of projecting data to a lowdimensional subspace can also be applied to multivariate statistical methods. The robustness of the methods can be achieved by applying robust estimators to the lower-dimensional space. Robust estimation in high dimensions can thus be avoided which usually results in a faster computation. Moreover, flat data sets where the number of variables is much higher than the number of observations can be easier analyzed in a robust way. We will focus on the projection pursuit approach for robust continuum regression (Serneels et al., 2005). A new algorithm is introduced and compared with the reference algorithm as well as with classical continuum regression.
Comprehensive Chemometrics, 2009
This chapter presents an introduction to robust statistics with applications of a chemometric nat... more This chapter presents an introduction to robust statistics with applications of a chemometric nature. Following a description of the basic ideas and concepts behind robust statistics, including how robust estimators can be conceived, the chapter builds up to the construction (and use) of robust alternatives for some methods for multivariate analysis frequently used in chemometrics, such as principal component analysis and partial least squares. The chapter then provides an insight into how these robust methods can be used or extended to classification. To conclude, the issue of validation of the results is being addressed: it is shown how uncertainty statements associated with robust estimates, can be obtained.