Μιχαήλ Τσαγρής | University of Crete (original) (raw)

Papers by Μιχαήλ Τσαγρής

Journal of Data Science, 2021

We demonstrate how to test for conditional independence of two variables with categorical data us... more We demonstrate how to test for conditional independence of two variables with categorical data using Poisson log-linear models. The size of the conditioning set of variables can vary from 0 (simple independence) up to many variables. We also provide a function in R for performing the test. Instead of calculating all possible tables with for loop we perform the test using the log-linear models and thus speeding up the process. Time comparison simulation studies are presented.

Journal of Data Science, 2021

Forward regression has been criticised heavily and one of the many reasons is regarding its speed... more Forward regression has been criticised heavily and one of the many reasons is regarding its speed and its stopping criteria. The main focus of this paper is on demonstrating how to make it efficient, using R. Our method works for continuous predictor variables only, as the use of the partial correlation plays the most important role.

Re-sampling based statistical tests are known to be computationally heavy, but reliable when smal... more Re-sampling based statistical tests are known to be computationally heavy, but reliable when small sample sizes are available. Despite their nice theoretical properties not much effort has been put to make them efficient. In this paper we treat the case of Pearson correlation coefficient and two independent samples t-test. We propose a highly computationally efficient method for calculating permutation based p-values in these two cases. The method is general and can be applied or be adopted to other similar two sample mean or two mean vectors cases.

In Bioinformatics, the number of available variables for a few tens of subjects, is usually in th... more In Bioinformatics, the number of available variables for a few tens of subjects, is usually in the order of tens of thousands. As an example is the case of gene expression data, where usually two groups of subjects exist, cases and controls or subjects with disease and subjects without disease. The detection of differentially expressed genes between the two groups takes place using many 2 independent samples (Welch) t-tests, one test for each variable (probeset). Motivated by this, the present research examines the empirical and exponential empirical likelihood, asymptotically, and provides some useful results revealing their relationship with the James's and Welch t-test. By exploiting this relationship, a simple calibration based on the t distribution, applicable to both techniques, is proposed. Then, this calibration is compared to the classical Welch t-test. A third, more famous, non parametric test subject to comparison is the Wilcoxn-Mann-Whitney test. As an extra step, bootstrap calibration of the aforementioned tests is performed and the exact p-value of the Wilcoxn-Mann-Whitney test is computed. The main goal is to examine the size and the power behaviour of these testing procedures, when applied on small to medium sized datasets. Based on extensive simulation studies we provide strong evidence for the Welch t-test. We show, numerically, that the Welch t-test has the same power abilities with all other testing procedures. It outperforms them though in terms of attaining the type I error. Further, it is computationally extremely efficient.

MXM is a flexible R package offering many feature selection algorithms for predictive or diagnost... more MXM is a flexible R package offering many feature selection algorithms for predictive or diagnostic models. MXM has a unique advantage over other packages; it allows for a long and comprehensive list of types of target variable to be modeled; continuous, percentages, time to event (survival), binary, nominal, ordinal, clustered, counts and left censored to name a few. In this paper we briefly discuss some of these algorithms, the different types of data they can handle, and the relevant functions along with their input and output arguments.

We evaluate the predictive performance of a variety of value-at-risk (VaR) models for a portfolio... more We evaluate the predictive performance of a variety of value-at-risk (VaR) models for a portfolio consisting of five assets. Traditional VaR models such as historical simulation with bootstrap and filtered historical simulation methods are considered. We suggest a new method for estimating Value at Risk: the filtered historical simulation GJR-GARCH method based on bootstrapping the standardized GJR-GARCH residuals. The predictive performance is evaluated in terms of three criteria, the test of unconditional coverage, independence and conditional coverage and the quadratic loss function suggested. The results show that classical methods are inefficient under moderate departures from normality and that the new method produces the most accurate forecasts of extreme losses.

arXiv: Methodology, 2014

In compositional data, an observation is a vector with non-negative components which sum to a con... more In compositional data, an observation is a vector with non-negative components which sum to a constant, typically 1. Data of this type arise in many areas, such as geology, archaeology, biology, economics and political science among others. The goal of this paper is to extend the taxicab metric and a newly suggested metric for compositional data by employing a power transformation. Both metrics are to be used in the k-nearest neighbours algorithm regardless of the presence of zeros. Examples with real data are exhibited.

arXiv: Methodology, 2015

In compositional data, an observation is a vector with non-negative components which sum to a con... more In compositional data, an observation is a vector with non-negative components which sum to a constant, typically 1. Data of this type arise in many areas, such as geology, archaeology, biology, economics and political science amongst others. The goal of this paper is to propose a new, divergence based, regression modelling technique for compositional data. To do so, a recently proved metric which is a special case of the Jensen-Shannon divergence is employed. A strong advantage of this new regression technique is that zeros are naturally handled. An example with real data and simulation studies are presented and are both compared with the log-ratio based regression suggested by Aitchison in 1986.

MXM offers many feature selection algorithms, namely MMPC, SES, MMMB, FBED, forward and backward ... more MXM offers many feature selection algorithms, namely MMPC, SES, MMMB, FBED, forward and backward regression. The target set of variables to be selected, ideally what we want to discover, is called Markov Blanket and it consists of the parents, children and parents of children (spouses) of the variable of interest assuming a Bayesian Network for all variables. MMPC stands for Max-Min Parents and Children. The idea is to use the Max-Min heuristic when choosing variables to put in the selected variables set and proceed in this way. Parents and Children comes from the fact that the algorithm will identify the parents and children of the variable of interest assuming a Bayesian Network. What it will not recover is the spouses of the children of the variable of interest. For more information the reader is addressed to [23]. MMMB (Max-Min Markov Blanket) extends the MMPC to discovering the spouses of the variable of interest [19]. SES (Statistically Equivalent Signatures) on the other hand...

Regression analysis, for prediction purposes, with compositional data is the subject of this pape... more Regression analysis, for prediction purposes, with compositional data is the subject of this paper. We examine both cases when compositional data are either response or predictor variables. A parametric model is assumed but the interest lies in the accuracy of the predicted values. For this reason, a data based power transformation is employed in both cases and the results are compared with the standard log-ratio approach. There are some interesting results and one advantage of the methods proposed here is the handling of the zero values.

Australian & New Zealand Journal of Statistics, 2020

A folded type model is developed for analyzing compositional data. The proposed model involves an... more A folded type model is developed for analyzing compositional data. The proposed model involves an extension of the α-transformation for compositional data and provides a new and flexible class of distributions for modeling data defined on the simplex sample space. Despite its rather seemingly complex structure, employment of the EM algorithm guarantees efficient parameter estimation. The model is validated through simulation studies and examples which illustrate that the proposed model performs better in terms of capturing the data structure, when compared to the popular logistic normal distribution, and can be advantageous over a similar model without folding.

R has many capabilities most of which are not known by many users, yet waiting to be discovered. ... more R has many capabilities most of which are not known by many users, yet waiting to be discovered. For this reason we provide more tips on how to write really efficient code without having to program in C++, programming advice, and tips to avoid errors and numerical overflows. This is the first time, to the best of our knowledge, that such a long list of tips is provided. The tips are categorized, according to their use, for matrices, simple functions, numerical optimization, parallel computing, programming tips, general advice, etc.

BackgroundFeature selection seeks to identify a minimal-size subset of features that is maximally... more BackgroundFeature selection seeks to identify a minimal-size subset of features that is maximally predictive of the outcome of interest. It is particularly important for biomarker discovery from high-dimensional molecular data, where the features could correspond to gene expressions, Single Nucleotide Polymorphisms (SNPs), proteins concentrations, e.t.c. We evaluate, empirically, three state-of-the-art, feature selection algorithms, scalable to high-dimensional data: a novel generalized variant of OMP (gOMP), LASSO and FBED. All three greedily select the next feature to include; the first two employ the residuals re-sulting from the current selection, while the latter rebuilds a statistical model. The algorithms are compared in terms of predictive performance, number of selected features and computational efficiency, on gene expression data with either survival time (censored time-to-event) or disease status (case-control) as an outcome. This work attempts to answer a) whether gOMP ...

Lobachevskii Journal of Mathematics, 2018

Compositional data are met in many different fields, such as economics, archaeometry, ecology, ge... more Compositional data are met in many different fields, such as economics, archaeometry, ecology, geology and political sciences. Regression where the dependent variable is a composition is usually carried out via a log-ratio transformation of the composition or via the Dirichlet distribution. However, when there are zero values in the data these two ways are not readily applicable. Suggestions for this problem exist, but most of them rely on substituting the zero values. In this paper we adjust the Dirichlet distribution when covariates are present, in order to allow for zero values to be present in the data, without modifying any values. To do so, we modify the log-likelihood of the Dirichlet distribution to account for zero values. Examples and simulation studies exhibit the performance of the zero adjusted Dirichlet regression.

Sankhya A, 2019

Compositional data consists of vectors of proportions whose components sum to 1. Such vectors lie... more Compositional data consists of vectors of proportions whose components sum to 1. Such vectors lie in the standard simplex, which is a manifold with boundary. One issue that has been rather controversial within the field of compositional data analysis is the choice of metric on the simplex. One popular possibility has been to use the metric implied by logtransforming the data, as proposed by Aitchison [1, 2]; and another popular approach has been to use the standard Euclidean metric inherited from the ambient space. Tsagris et al. [21] proposed a one-parameter family of power transformations, the α-transformations, which include both the metric implied by Aitchison's transformation and the Euclidean metric as particular cases. Our underlying philosophy is that, with many datasets, it may make sense to use the data to help us determine a suitable metric. A related possibility is to apply the α-transformations to a parametric family of distributions, and then estimate α along with the other parameters. However, as we shall see, when one follows this last approach with the Dirichlet family, some care is needed in a certain limiting case which arises (α → 0), as we found out when fitting this model to real and simulated data. Specifically, when the maximum likelihood estimator of α is close to 0, the other parameters tend to be large. The main purpose of the paper is to study this limiting case both theoretically and numerically and to provide insight into these numerical findings.

International Journal of Data Science and Analytics, 2018

We address the problem of constraint-based causal discovery with mixed data types, such as (but n... more We address the problem of constraint-based causal discovery with mixed data types, such as (but not limited to) continuous, binary, multinomial, and ordinal variables. We use likelihood-ratio tests based on appropriate regression models and show how to derive symmetric conditional independence tests. Such tests can then be directly used by existing constraint-based methods with mixed data, such as the PC and FCI algorithms for learning Bayesian networks and maximal ancestral graphs, respectively. In experiments on simulated Bayesian networks, we employ the PC algorithm with different conditional independence tests for mixed data and show that the proposed approach outperforms alternatives in terms of learning accuracy.

F1000Research, 2018

Feature (or variable) selection is the process of identifying the minimal set of features with th... more Feature (or variable) selection is the process of identifying the minimal set of features with the highest predictive performance on the target variable of interest. Numerous feature selection algorithms have been developed over the years, but only few have been implemented in R as a package. The R package MXM is such an example, which not only offers a variety of feature selection algorithms, but has unique features that make it advantageous over its competitors: a) it contains feature selection algorithms that can treat numerous types of target variables, including continuous, percentages, time to event (survival), binary, nominal, ordinal, clustered, counts, left censored, etc; b) it contains a variety of regression models to plug into the feature selection algorithms; c) it includes an algorithm for detecting multiple solutions (many sets of equivalent features); and d) it includes memory efficient algorithms for high volume data, data that cannot be loaded into R. In this paper...

Machine Learning, 2018

Cross-Validation (CV), and out-of-sample performance-estimation protocols in general, are often e... more Cross-Validation (CV), and out-of-sample performance-estimation protocols in general, are often employed both for (a) selecting the optimal combination of algorithms and values of hyper-parameters (called a configuration) for producing the final predictive model, and (b) estimating the predictive performance of the final model. However, the cross-validated performance of the best configuration is optimistically biased. We present an efficient bootstrap method that corrects for the bias, called Bootstrap Bias Corrected CV (BBC-CV). BBC-CV's main idea is to bootstrap the whole process of selecting the best-performing configuration on the out-of-sample predictions of each configuration, without additional training of models. In comparison to the alternatives, namely the nested cross-validation (Varma and Simon in BMC Bioinform 7(1):91, 2006) and a method by Tibshirani and Tibshirani (Ann Appl Stat 822-829, 2009), BBC-CV is computationally more efficient, has smaller variance and bias, and is applicable to any metric of performance (accuracy, AUC, concordance index, mean squared error). Subsequently, we employ again the idea of bootstrapping the out-of-sample predictions to speed up the CV process. Specifically, using a bootstrap-based statistical criterion we stop training of models on new folds of inferior (with high probability) configurations. We name the method Bootstrap Bias Corrected with Dropping CV (BBCD-CV) that is both efficient and provides accurate performance estimates.

Communications in Statistics - Theory and Methods, 2016

The present paper discusses the statistical distribution for the estimator of Rosenthal's 'file d... more The present paper discusses the statistical distribution for the estimator of Rosenthal's 'file drawer' number N R , which is an estimator of unpublished studies in meta-analysis. We calculate the probability distribution function of N R. This is achieved based on the Central Limit Theorem and the proposition that certain components of the estimator N R follow a half normal distribution, derived from the standard normal distribution. Our proposed distributions are supported by simulations and investigation of convergence.

Journal of Data Science, 2021

arXiv: Methodology, 2014

arXiv: Methodology, 2015

In compositional data, an observation is a vector with non-negative components which sum to a con... more In compositional data, an observation is a vector with non-negative components which sum to a constant, typically 1. Data of this type arise in many areas, such as geology, archaeology, biology, economics and political science amongst others. The goal of this paper is to propose a new, divergence based, regression modelling technique for compositional data. To do so, a recently proved metric which is a special case of the Jensen-Shannon divergence is employed. A strong advantage of this new regression technique is that zeros are naturally handled. An example with real data and simulation studies are presented and are both compared with the log-ratio based regression suggested by Aitchison in 1986.

Australian & New Zealand Journal of Statistics, 2020

Lobachevskii Journal of Mathematics, 2018

Sankhya A, 2019

International Journal of Data Science and Analytics, 2018

F1000Research, 2018

Machine Learning, 2018

Communications in Statistics - Theory and Methods, 2016

An English-Arabic dictionary of statistical terms. It's not a lengthy one, but it covers the need... more An English-Arabic dictionary of statistical terms. It's not a lengthy one, but it covers the needs of the students.

Μηραήι Θεόδωξνο Σζαγξήο ΓΙΑΣΡΙΒΗ Πνπ ππνβιήζεθε ζην Σκήκα ηαηηζηηθήο ηνπ Οηθνλνκηθνύ Παλεπηζηεκίν... more Μηραήι Θεόδωξνο Σζαγξήο ΓΙΑΣΡΙΒΗ Πνπ ππνβιήζεθε ζην Σκήκα ηαηηζηηθήο ηνπ Οηθνλνκηθνύ Παλεπηζηεκίνπ ΑΘελώλ ωο κέξνο ηωλ απαηηήζεωλ γηα ηελ απόθηεζε Μεηαπηπρηαθνύ Γηπιώκαηνο Δηδίθεπζεο ζηε ηαηηζηηθή ABSTRACT Michael Tsagris Multivariate outliers, the Forward Search and the Cronbach's reliability coefficient May 2010 Multivariate outliers are of very interest due to the nature of th e data. While in the univariate case, things are straightforward, when moving to more than one variables things can be very difficult. In this thesis, multivariate outlier detection methods are discussed and the Forward Search is also implemented. Robust estimates of scatter and location is the key feature for the detection of outliers. Finally, the Cronbach's reliability coefficient is discussed and applied to the Forward Search as a monitoring statistic. V Ι VI Ι ΠΕΡΙΛΗΨΗ Μηραήι Σζαγξήο Πολςμεταβλητέρ ακπαίερ τιμέρ, η Βηματική Ανίσνεςση και ο σςντελεστήρ αξιοπιστίαρ τος Cronbach Μάηνο 2010 Οη πνιπκεηαβιεηέο αθξαίεο ηηκέο έρνπλ κεγάιν ελδηαθέξνλ ιόγω ηεο θύζεο ηωλ δεδνκέλωλ. Δλώ ζηε κνλνκεηαβιεηή πεξίπηωζε ηα πξάγκαηα είλαη μεθάζαξα, θαζώο πεξλάκε ζε πεξηζζόηεξεο από κία κεηαβιεηή ηα πξάγκαηα κπνξεί λα γίλνπλ πνιύ δύζθνια. ε απηή ηε δηαηξηβή, κέζνδνη αλίρλεπζεο πνιπκεηαβιεηώλ αθξαίωλ ηηκώλ είλαη ππό ζπδήηεζε θαη πινπνηείηαη ε Βεκαηηθή Αλίρλεπζε. Οη εύξωζηεο εθηηκήζεηο ηεο δηαζπνξάο θαη ηεο ζέζεο είλαη ην βαζηθό ραξαθηεξηζηηθό γηα ηελ αλίρλεπζε ηωλ αθξαίωλ ηηκώλ. Σέινο, ν ζπληειεζηήο αμηνπηζηίαο ηνπ Cronbach είλαη ππό ζπδήηεζε θαη εθαξκόδεηαη ζηε Βεκαηηθή Αλίρλεπζε ωο ζηαηηζηηθό monitoring. VII Ι VIII 1

Densities and Sampling for the Skellam Distribution.

A collection of fast and very fast R functions written in R or C++.

An R package for compositional data analysis.

An R package for directional data analysis.

MXM is an R package which offers variable selection for high-dimensional data in cases of regress... more MXM is an R package which offers variable selection for high-dimensional data in cases of regression and classification. Many regression models are offered. In addition some functions for Bayesian Networks and graphical models are offered.

In Bioinformatics, the number of available variables for a few tens of subjects, is usually in th... more In Bioinformatics, the number of available variables for a few tens of subjects, is usually in the order of tens of thousands. As an example is the case of gene expression data, where usually two groups of subjects exist, cases and controls or subjects with disease and subjects without disease. The detection of differentially expressed genes between the two groups takes place using many 2 independent samples (Welch) t-tests, one test for each variable (probeset). Motivated by this, the present research examines the empirical and exponential empirical likelihood, asymptotically, and provides some useful results revealing their relationship with the James's and Welch t-test. By exploiting this relationship, a simple calibration based on the t distribution , applicable to both techniques, is proposed. Then, this calibration is compared to the classical Welch t-test. A third, more famous, non parametric test subject to comparison is the Wilcoxn-Mann-Whitney test. As an extra step, bootstrap calibration of the aforementioned tests is performed and the exact p-value of the Wilcoxn-Mann-Whitney test is computed. The main goal is to examine the size and the power behaviour of these testing procedures, when applied on small to medium sized datasets. Based on extensive simulation studies we provide strong evidence for the Welch t-test. We show, numerically, that the Welch t-test has the same power abilities with all other testing procedures. It outperforms them though in terms of attaining the type I error. Further, it is computationally extremely efficient.

Background Feature selection seeks to identify a minimal-size subset of features that is maximall... more Background Feature selection seeks to identify a minimal-size subset of features that is maximally predictive of the outcome of interest. It is particularly important for biomarker discovery from high-dimensional molecular data, where the features could correspond to gene expressions, Single Nucleotide Polymorphisms (SNPs), proteins concentrations, e.t.c. We evaluate, empirically, three state-of-the-art, feature selection algorithms, scalable to high-dimensional data: a novel generalized variant of OMP (gOMP), LASSO and FBED. All three greedily select the next feature to include; the first two employ the residuals resulting from the current selection, while the latter rebuilds a statistical model. The algorithms are compared in terms of predictive performance, number of selected features and computational efficiency, on gene expression data with either survival time (censored time-to-event) or disease status (case-control) as an outcome. This work attempts to answer a) whether gOMP is to be preferred over LASSO and b) whether residual-based algorithms, e.g. gOMP, are to be preferred over algorithms, such as FBED, that rely heavily on regression model fitting. Results gOMP is on par, or outperforms LASSO in all metrics, predictive performance, number of features selected and computational efficiency. Contrasting gOMP to FBED, both exhibit similar performance in terms of predictive performance and number of selected features. Overall, gOMP combines the benefits of both LASSO and FBED; it is com-putationally efficient and produces parsimonious models of high predictive performance. Conclusions The use of gOMP is suggested for variable selection with high-dimensional gene expression data, and the target variable need not be restricted to time-to-event or case control, as examined in this paper.

A folded type model is developed for analyzing compositional data. The proposed model, which is b... more A folded type model is developed for analyzing compositional data. The proposed model, which is based upon the α-transformation for compositional data, provides a new and flexible class of distributions for modeling data defined on the simplex sample space. Despite its rather seemingly complex structure, employment of the EM algorithm guarantees efficient parameter estimation. The model is validated through simulation studies and examples which illustrate that the proposed model performs better in terms of capturing the data structure, when compared to the popular logistic normal distribution.

PHD Thesis, 2014

They have been there for me all these four years at every step of my studies and I know I have ma... more They have been there for me all these four years at every step of my studies and I know I have made their head go big sometimes. They were very patient and tolerant of me. I will not go beyond, they know how I feel about them and that is enough for me. I will only say that if it was not for them this would not have been made possible. The knowledge I acquired in other fields is also due to them. They are very knowledgeable and gave me answers to many topics and helped me understand many things. I will stop here by saying that they have been amazing guides and supporters of a PhD student. Second, I would like to say Gely, my wife was there for me always. This thesis is dedicated to her and our little baby, Theodoros. Acknowledgements should go to Spyridon as well for sharing my un-fightable experiences from time to time. My internal examiner Theo and my external examiner Dr Alfred Kume for their valuable and detailed comments which were crucial not only for the presentation but also for some technicalities. Dr Janice Scealy from the university of Canberra for her long and very constructive discussions we had. I also have to mention Sergey, for enduring me and my short discussions with him. Also Tasos from Greece, he knows why. The B50 beasts, Johhny, Lisa, Beny, V, Shaker, Sammy and Michalis made the life in the office easier. Thanos was a very good imposter and his help with statistics was crucial and Yannis and Thodoris for being good friends. Of course, the list goes down further, to the professors in Greece, my parents, my school teachers all the way back in time. All of them, put a brick in my being where I am now.

ικανοποιημένου/-ης φοιτητή/-τριας. Ο λόγος που η έννοια της απόστασης δεν υφίσταται είναι ότι οι ... more ικανοποιημένου/-ης φοιτητή/-τριας. Ο λόγος που η έννοια της απόστασης δεν υφίσταται είναι ότι οι συγκεκριμένες μεταβλητές, τα μεγέθη, οι μετρήσεις δεν είναι αριθμοί.