A novel, divergence based, regression for compositional data (original) (raw)

A measure of difference for compositional data based on measures of divergence

For the application of many statistical methods it is necessary to establish the measure of di erence to be used. This measure has to be de ned in accordance with the nature of the data. In this study we propose a measure of di erence when the data set is compositional. We analyze its properties and we present examples to illustrate its performance.

A Dirichlet Regression Model for Compositional Data with Zeros

Lobachevskii Journal of Mathematics

Compositional data are met in many different fields, such as economics, archaeometry, ecology , geology and political sciences. Regression where the dependent variable is a composition is usually carried out via a log-ratio transformation of the composition or via the Dirichlet distribution. However, when there are zero values in the data these two ways are not readily applicable. Suggestions for this problem exist, but most of them rely on substituting the zero values. In this paper we adjust the Dirichlet distribution when covariates are present, in order to allow for zero values to be present in the data, without modifying any values. To do so, we modify the log-likelihood of the Dirichlet distribution to account for zero values. Examples and simulation studies exhibit the performance of the zero adjusted Dirichlet regression.

Compositional data and their analysis: an introduction

Geological Society, London, Special Publications, 2006

Compositional data are those which contain only relative information. They are parts of some whole. In most cases they are recorded as closed data, i.e. data summing to a constant, such as 100% -whole-rock geochemical data being classic examples. Compositional data have important and particular properties that preclude the application of standard statistical techniques on such data in raw form. Standard techniques are designed to be used with data that are free to range from -oo to +oo. Compositional data are always positive and range only from 0 to 100, or any other constant, when given in closed form. If one component increases, others must, perforce, decrease, whether or not there is a genetic link between these components. This means that the results of standard statistical analysis of the relationships between raw components or parts in a compositional dataset are clouded by spurious effects. Although such analyses may give apparently interpretable results, they are, at best, approximations and need to be treated with considerable circumspection. The methods outlined in this volume are based on the premise that it is the relative variation of components which is of interest, rather than absolute variation. Log-ratios of components provide the natural means of studying compositional data. In this contribution the basic terms and operations are introduced using simple numerical examples to illustrate their computation and to familiarize the reader with their use.

The α-k-NN regression for compositional data

2020

Compositional data arise in many real-life applications and versatile methods for properly analyzing this type of data in the regression context are needed. This paper, through use of the α-transformation, extends the classical k-NN regression to what is termed α-k-NN regression, yielding a highly flexible non-parametric regression model for compositional data. Unlike many of the recommended regression models for compositional data, zeros values (which commonly occur in practice) are not problematic and they can be incorporated into the proposed model without modification. Extensive simulation studies and real-life data analysis highlight the advantage of using α-k-NN regression for complex relationships between the response data and predictor variables for two cases, namely when the response data is compositional and predictor variables are continuous (or categorical) and vice versa. Both cases suggest that α-k-NN regression can lead to more accurate predictions compared to current...

Non-parametric regression models for compositional data

2020

Compositional data arise in many real-life applications and versatile methods for properly analyzing this type of data in the regression context are needed. This paper, through use of the alpha\alphaalpha-transformation, extends the classical kkk-$NN$ regression to what is termed alpha\alphaalpha-$k$-$NN$ regression, yielding a highly flexible non-parametric regression model for compositional data. The alpha\alphaalpha-$k$-$NN$ is further extended to the alpha\alphaalpha-kernel regression by adopting the Nadaray-Watson estimator. Unlike many of the recommended regression models for compositional data, zeros values (which commonly occur in practice) are not problematic and they can be incorporated into the proposed models without modification. Extensive simulation studies and real-life data analyses highlight the advantage of using these non-parametric regressions for complex relationships between the compositional response data and Euclidean predictor variables. Both suggest that alpha\alphaalpha-$k$-$NN$ and alpha\alphaalpha-ker...

The alpha\alphaalpha-$k$-$NN$ regression for compositional data

arXiv: Methodology, 2020

Compositional data arise in many real-life applications and versatile methods for properly analyzing this type of data in the regression context are needed. This paper, through use of the alpha\alphaalpha-transformation, extends the classical kkk-$NN$ regression to what is termed alpha\alphaalpha-$k$-$NN$ regression, yielding a highly flexible non-parametric regression model for compositional data. Unlike many of the recommended regression models for compositional data, zeros values (which commonly occur in practice) are not problematic and they can be incorporated into the proposed model without modification. Extensive simulation studies and real-life data analyses highlight the advantage of using alpha\alphaalpha-$k$-$NN$ regression for complex relationships between the compositional response data and Euclidean predictor variables. Both suggest that alpha\alphaalpha-$k$-$NN$ regression can lead to more accurate predictions compared to current regression models which assume a, sometimes restrictive, parametric re...

Modelling compositional data using dirichlet regression models

Compositional data are non-negative proportions with unit-sum. These types of data arise whenever we classify objects into disjoint categories and record their resulting relative frequencies, or partition a whole measurement into percentage contributions from its various parts. Under the unit-sum constraint, the elementary concepts of covariance and correlation are mis-leading. Therefore, compositional data are rarely analyzed with the usual multivariate statistical methods. Aitchison (1986) introduced the logratio analysis to model compositional data. Campbell and Mosimann (1987a) suggested the Dirichlet Covariate Model as a null model for such data. In this paper we investigate the Dirichlet Covariate Model and compare it to the logratio analysis. Maximum likelihood estimation methods are developed and the sampling distributions of these estimates are investigated. Measures of total variability and goodness of fit are proposed to assess the adequacy of the suggested models in anal...

Compositional data: the sample space and its structure

TEST, 2019

The log-ratio approach to compositional data (CoDa) analysis has now entered a mature phase. The principles and statistical tools introduced by J. Aitchison in the eighties have proven successful in solving a number of applied problems. The algebraic-geometric structure of the sample space, tailored to those principles, was developed at the beginning of the millennium. Two main ideas completed the J. Aitchison's seminal work: the conception of compositions as equivalence classes of proportional vectors, and their representation in the simplex endowed with an interpretable Euclidean structure. These achievements allowed the representation of compositions in meaningful coordinates (preferably Cartesian), as well as orthogonal projections compatible with the Aitchison distance introduced two decades before. These ideas and concepts are reviewed up to the normal distribution on the simplex and the associated central limit theorem. Exploratory tools, specifically designed for CoDa, are also reviewed. To illustrate the adequacy and interpretability of the sample space structure, a new inequality index, based on the Aitchison norm, is proposed. Most concepts are illustrated with an example of mean household gross income per capita in Spain.

On new developments in divergence statistics

Journal of Mathematical Sciences, 2009

In this paper, we discuss measures of divergence and focus on a recently introduced measure of divergence, the socalled BHHJ measure. A general class of such measures is introduced, and goodness of fit tests for multinomial populations are presented. Simulations are performed to check the appropriateness of the proposed test statistics. Bibliography: 20 titles.

Modelling Compositional Data. The Sample Space Approach

Handbook of Mathematical Geosciences, 2018

Compositions describe parts of a whole and carry relative information. Compositional data appear in all fields of science, and their analysis requires paying attention to the appropriate sample space. The log-ratio approach proposes the simplex, endowed with the Aitchison geometry, as an appropriate representation of the sample space. The main characteristics of the Aitchison geometry are presented, which open the door to statistical analysis addressed to extract the relative, not absolute, information. As a consequence, compositions can be represented in Cartesian coordinates by using an isometric log-ratio transformation. Standard statistical techniques can be used with these coordinates.