A new method for correlation analysis of compositional (environmental) data – a worked example (original) (raw)
Related papers
The bivariate statistical analysis of environmental (compositional) data
The Science of the total environment, 2010
Environmental sciences usually deal with compositional (closed) data. Whenever the concentration of chemical elements is measured, the data will be closed, i.e. the relevant information is contained in the ratios between the variables rather than in the data values reported for the variables. Data closure has severe consequences for statistical data analysis. Most classical statistical methods are based on the usual Euclidean geometry - compositional data, however, do not plot into Euclidean space because they have their own geometry which is not linear but curved in the Euclidean sense. This has severe consequences for bivariate statistical analysis: correlation coefficients computed in the traditional way are likely to be misleading, and the information contained in scatterplots must be used and interpreted differently from sets of non-compositional data. As a solution, the ilr transformation applied to a variable pair can be used to display the relationship and to compute a measu...
Univariate statistical analysis of environmental (compositional) data: Problems and possibilities
Science of The Total Environment, 2009
For almost 30 years it has been known that compositional (closed) data have special geometrical properties. In environmental sciences, where the concentration of chemical elements in different sample materials is investigated, almost all datasets are compositional. In general, compositional data are parts of a whole which only give relative information. Data that sum up to a constant, e.g. 100 wt.%, 1,000,000 mg/kg are the best known example. It is widely neglected that the "closure" characteristic remains even if only one of all possible elements is measured, it is an inherent property of compositional data. No variable is free to vary independent of all the others. Existing transformations to "open" closed data are seldom applied. They are more complicated than a log transformation and the relationship to the original data unit is lost. Results obtained when using classical statistical techniques for data analysis appeared reasonable and the possible consequences of working with closed data were rarely questioned. Here the simple univariate case of data analysis is investigated. It can be demonstrated that data closure must be overcome prior to calculating even simple statistical measures like mean or standard deviation or plotting graphs of the data distribution, e.g. a histogram. Some measures like the standard deviation (or the variance) make no statistical sense with closed data and all statistical tests building on the standard deviation (or variance) will thus provide erroneous results if used with the original data.
Visualization of correlation-based environmental data
Environmetrics, 2004
A method for the visualization of environmental data via the analysis of correlations has been proposed. As an example, the data in correlation matrices of environmental parameters that describe air pollution in Vilnius city (10 parameters) as well as the development of coastal dunes and their vegetation in Finland (16 parameters) are visualized. These applied problems are very urgent because of their ecological nature: a visual presentation of data stored in the correlation matrices makes it possible for ecologists to discover additional knowledge hidden in the matrices and to make proper decisions. The method consists of two stages: building of a system of vectors based on the correlation matrix and its visualization. Sammon's mapping and the self-organizing map (two its realizations having different features for visualization) were applied in the visualization. The advantage of the method lies in the possibility to restore the system of multidimensional vectors describing variables from the correlation matrix-one vector for one variable.
Compositional data and their analysis: an introduction
Geological Society, London, Special Publications, 2006
Compositional data are those which contain only relative information. They are parts of some whole. In most cases they are recorded as closed data, i.e. data summing to a constant, such as 100% -whole-rock geochemical data being classic examples. Compositional data have important and particular properties that preclude the application of standard statistical techniques on such data in raw form. Standard techniques are designed to be used with data that are free to range from -oo to +oo. Compositional data are always positive and range only from 0 to 100, or any other constant, when given in closed form. If one component increases, others must, perforce, decrease, whether or not there is a genetic link between these components. This means that the results of standard statistical analysis of the relationships between raw components or parts in a compositional dataset are clouded by spurious effects. Although such analyses may give apparently interpretable results, they are, at best, approximations and need to be treated with considerable circumspection. The methods outlined in this volume are based on the premise that it is the relative variation of components which is of interest, rather than absolute variation. Log-ratios of components provide the natural means of studying compositional data. In this contribution the basic terms and operations are introduced using simple numerical examples to illustrate their computation and to familiarize the reader with their use.
STATISTICAL ANALYSIS OF CHEMICAL COMPOSITIONAL DATA AND THE COMPARISON OF ANALYSES
Recent statistical work on approaches to analysing compositional data-where variables sum to a constant for each row of a data matrix-may encounter dijiculties when applied to data of the kind typically arising in scientijic archaeology. The reason is that results obtained may be unsatisfactory from a substantive viewpoint for identijiable technical reasons. This paper explores and illustrates some possible resolutions of the problem. A feature of the approach used is to analyse subsets of the variables on separate scales. A synthesis of the results obtained from separate analyses is essential and the use of multiple correspondence analysis for this purpose is illustrated.
Spatial analysis of compositional data: A historical review
Journal of Geochemical Exploration, 2016
Like the statistical analysis of compositional data in general, spatial analysis of compositional data requires specific tools. An historical overview of their development is presented in three steps: (a) the recognition of the problem, known as spurious spatial covariance, (b) first attempts to use the logratio approach, and (c) the application of the principle of working in coordinates using isometric logratio representations. Also mentioned are the use of matrix-valued variation-variograms as a tool to model crossvariograms, and the simplicial approach to indicator kriging, that solves inconsistencies in the standard approach to indicator kriging.
In the eighties, John Aitchison (1986) developed a new methodological approach for the statistical analysis of compositional data. This new methodology was implemented in Basic routines grouped under the name CODA and later NEWCODA inMatlab (Aitchison, 1997). After that, several other authors have published extensions to this methodology: Marín-Fernández and others (2000), Barceló-Vidal and others (2001), Pawlowsky-Glahn and Egozcue (2001, 2002) and Egozcue and others (2003). (...) Geologische Vereinigung; Universitat de Barcelona, Equip de Recerca Arqueomètrica; Institut d’Estadística de Catalunya; International Association for Mathematical Geology; Patronat de l’Escola Politècnica Superior de la Universitat de Girona; Fundació privada: Girona, Universitat i Futur.
Interpretation and analysis of complex environmental data using chemometric methods
TrAC Trends in Analytical Chemistry, 1994
An overview of the application of chemometric data analysis methods to complex chemical mixtures in various environmental media is presented. Reviews of selected research are given as examples of the application of principal components analysis and other statistical methods to identify contributions from multiple sources of contamination in air, water, sediments, and biota. Other examples are cited that illustrate how scientists have used classification and regression methods to model the distribution of anthropogenic contaminants and predict their environmental effects or fate.
Monitoring procedures in environmental geochemistry and compositional data analysis theory
2003
First discussion on compositional data analysis is attributable to Karl Pearson, in 1897. However, notwithstanding the recent developments on algebraic structure of the simplex, more than twenty years after Aitchison's idea of log-transformations of closed data, scientific literature is again full of statistical treatments of this type of data by using traditional methodologies. This is particularly true in environmental geochemistry where besides the problem of the closure, the spatial structure (dependence) of the data have to be considered. In this work we propose the use of log-contrast values, obtained by a simplicial principal component analysis, as LQGLFDWRUV of given environmental conditions. The investigation of the log-constrast frequency distributions allows pointing out the statistical laws able to generate the values and to govern their variability. The changes, if compared, for example, with the mean values of the random variables assumed as models, or other reference parameters, allow defining PRQLWRUV to be used to assess the extent of possible environmental contamination. Case study on running View publication stats View publication stats