Compound compositional data processes (original) (raw)

Title Compound compositional data processes

2015

Compositional data is non-negative data subject to the unit sum constraint. The logistic normal distribution provides a framework for compositional data when it satisfies subcompositional coherence in that the inference from a subcomposition should be the same based on the full composition or the sub-composition alone. However, in many cases sub-compositions are not coherent because of additional structure on the compositions, which can be modelled as process(es) inducing change. Sometimes data are collected with a model already well validated and hence with the focus on estimation of the model parameters. Alternatively, sometimes the appropriate model is unknown in advance and it is necessary to use the data to identify a suitable model. In both cases, a hierarchy of possible structure(s) is very helpful. This is evident in the evaluation of, for example, geochemical and household expenditure data. In the case of geochemical data, the structural process might be the stoichiometric ...

Compositional data and their analysis: an introduction

Geological Society, London, Special Publications, 2006

Compositional data are those which contain only relative information. They are parts of some whole. In most cases they are recorded as closed data, i.e. data summing to a constant, such as 100% -whole-rock geochemical data being classic examples. Compositional data have important and particular properties that preclude the application of standard statistical techniques on such data in raw form. Standard techniques are designed to be used with data that are free to range from -oo to +oo. Compositional data are always positive and range only from 0 to 100, or any other constant, when given in closed form. If one component increases, others must, perforce, decrease, whether or not there is a genetic link between these components. This means that the results of standard statistical analysis of the relationships between raw components or parts in a compositional dataset are clouded by spurious effects. Although such analyses may give apparently interpretable results, they are, at best, approximations and need to be treated with considerable circumspection. The methods outlined in this volume are based on the premise that it is the relative variation of components which is of interest, rather than absolute variation. Log-ratios of components provide the natural means of studying compositional data. In this contribution the basic terms and operations are introduced using simple numerical examples to illustrate their computation and to familiarize the reader with their use.

Time Series Analysis of Compositional Data Using a Dynamic Linear Model Approach

Compositional time series data comprises of multivariate observations that at each time point are essentially proportions of a whole quantity. This kind of data occurs frequently in many disciplines such as economics, geology and ecology. Usual multivariate statistical procedures available in the literature are not applicable for the analysis of such data since they ignore the inherent constrained nature of these observations as parts of a whole. This article describes new techniques for modeling compositional time series data in a hierarchical Bayesian framework. Modified dynamic linear models are fit to compositional data via Markov chain Monte Carlo techniques. The distribution of the underlying errors is assumed to be a scale mixture of multivariate normals of which the multivariate normal, multivariate t, multivariate logistic, etc., are special cases. In particular, multivariate normal and Student-t error structures are considered and compared through predictive distributions. The approach is illustrated on a data set.

A folded model for compositional data analysis

A folded type model is developed for analyzing compositional data. The proposed model, which is based upon the α-transformation for compositional data, provides a new and flexible class of distributions for modeling data defined on the simplex sample space. Despite its rather seemingly complex structure, employment of the EM algorithm guarantees efficient parameter estimation. The model is validated through simulation studies and examples which illustrate that the proposed model performs better in terms of capturing the data structure, when compared to the popular logistic normal distribution.

Modeling asymmetric compositional data

Acta Scientiarum. Technology, 2014

Compositional data belong to the simplex sample space, but they are transformed to the sample space of the real numbers using the additive log-ratio transformation to allow the application of standard statistical techniques. This study aims to model compositional skewed data of three soil components after additive log-ratio transformation. The current modeling was done for compositional data of sand, silt and clay (simplex), and bivariate data (real) using the standard skew theory with and without the inclusion of the covariate soil porosity. The analyses were run using the R statistical software and the package sn, and the goodness-of-fit was found after applying the covariate.

Compositional data: the sample space and its structure

TEST, 2019

The log-ratio approach to compositional data (CoDa) analysis has now entered a mature phase. The principles and statistical tools introduced by J. Aitchison in the eighties have proven successful in solving a number of applied problems. The algebraic-geometric structure of the sample space, tailored to those principles, was developed at the beginning of the millennium. Two main ideas completed the J. Aitchison's seminal work: the conception of compositions as equivalence classes of proportional vectors, and their representation in the simplex endowed with an interpretable Euclidean structure. These achievements allowed the representation of compositions in meaningful coordinates (preferably Cartesian), as well as orthogonal projections compatible with the Aitchison distance introduced two decades before. These ideas and concepts are reviewed up to the normal distribution on the simplex and the associated central limit theorem. Exploratory tools, specifically designed for CoDa, are also reviewed. To illustrate the adequacy and interpretability of the sample space structure, a new inequality index, based on the Aitchison norm, is proposed. Most concepts are illustrated with an example of mean household gross income per capita in Spain.

Modelling Compositional Data. The Sample Space Approach

Handbook of Mathematical Geosciences, 2018

Compositions describe parts of a whole and carry relative information. Compositional data appear in all fields of science, and their analysis requires paying attention to the appropriate sample space. The log-ratio approach proposes the simplex, endowed with the Aitchison geometry, as an appropriate representation of the sample space. The main characteristics of the Aitchison geometry are presented, which open the door to statistical analysis addressed to extract the relative, not absolute, information. As a consequence, compositions can be represented in Cartesian coordinates by using an isometric log-ratio transformation. Standard statistical techniques can be used with these coordinates.

Units Recovery Methods in Compositional Data Analysis

Natural resources research, 2020

Compositional data carry relative information. Hence, their statistical analysis has to be performed on coordinates with respect to a log-ratio basis. Frequently, the modeler is required to back-transform the estimates obtained with the modeling to have them in the original units such as euros, kg or mg/liter. Approaches for recovering original units need to be formally introduced and its properties explored. Here, we formulate and analyze the properties of two procedures: a simple approach consisting of adding a residual part to the composition and an approach based on the use of an auxiliary variable. Both procedures are illustrated using a geochemical data set where the original units are recovered when spatial models are applied.

Univariate statistical analysis of environmental (compositional) data: Problems and possibilities

Science of The Total Environment, 2009

For almost 30 years it has been known that compositional (closed) data have special geometrical properties. In environmental sciences, where the concentration of chemical elements in different sample materials is investigated, almost all datasets are compositional. In general, compositional data are parts of a whole which only give relative information. Data that sum up to a constant, e.g. 100 wt.%, 1,000,000 mg/kg are the best known example. It is widely neglected that the "closure" characteristic remains even if only one of all possible elements is measured, it is an inherent property of compositional data. No variable is free to vary independent of all the others. Existing transformations to "open" closed data are seldom applied. They are more complicated than a log transformation and the relationship to the original data unit is lost. Results obtained when using classical statistical techniques for data analysis appeared reasonable and the possible consequences of working with closed data were rarely questioned. Here the simple univariate case of data analysis is investigated. It can be demonstrated that data closure must be overcome prior to calculating even simple statistical measures like mean or standard deviation or plotting graphs of the data distribution, e.g. a histogram. Some measures like the standard deviation (or the variance) make no statistical sense with closed data and all statistical tests building on the standard deviation (or variance) will thus provide erroneous results if used with the original data.