Software and Programs (original) (raw)
Die Inhalte dieser Seite sind leider nicht auf Deutsch verfügbar.
Seitenpfad:
- Startseite
- Software and Programs
This side is intended to provide some of the code I developed throughout the years to automatise re-occuring tasks, or to overcome problems with the absence of some advanced but as yet not widely applied statistical methods in standard statistical software packages. For each file a short description of the method and purpose is given, and all files come with a basic in-file description of the use. However, those short descriptions mainly deal as user manuals, rather than an in-depth description of the methods. Whenever applicable, citations for further reading are provided, in most cases covering the original work in which the respective method was described. Any user is strictly encouraged to acquire at least a basic knowledge of the process before using the software. No liability whatsoever will be accepted for any losses that are directly or indirectly resulting from the use of that software.
Part of the software consists of MS Excel sheets, in most cases saved in the xlsx file format. This format has been around for a while now, and I assume everyone has access to a system running MS Office 2007 or higher. Principially xlsx files can be converted to xls files, and some online- and/or freeware converters for that purpose exist. I cannot guarantee, however, that I did not use expressions that are incompatible with older versions of MS Excel. Whether or not the spread sheets can be used in OpenOffice, LibreOffice, or any other freeware alternative of MS Excel, remains to be tested.
The majority of the code, especially for the more advanced techniques, is written for R (http://www.r-project.org/) or Matlab, however. It is provided as files in ASCII format, that can be opened with any text editor. The R-codes are normally provided as function, that can either be copied into the R console or imported via the source() command, and thereafter invoked as a normal function call. If third-party packages are required they are loaded automatically if installed, but will not be installed automatically if missing on your R installation. Please check whether or not all necessary packages are installed beforehand, and if not, install them before using the code. Each code file further contains one or more examples (either in-code or as separate files), demonstrating the use and also serving as examples for the required input file format.
Just before providing my own small contribution, I would like to give you a list of other useful programs and web pages concerned with statistics (feel free to spread the word):
Graham, D. and Midgley, N. Triangular diagram plotting spreadsheet (TRI-PLOT).
Hammer, Ø., Harper, D. A. T., and Ryan, P. D. (2001) PAST: Paleontological Statistics software package for education and data analysis. Palaeontologia Electronica 4 (1): 1–9.
McDonald, J. H. Handbook of biological statistics.
R Development Core Team (2011) R: A language and environment for statistical computing. (R Foundation for Statistical Computing: Vienna), ISBN 3-900051-07-0.
Wood, M. Making sense of statistics: A non-mathematical approach.
Calculating confidence intervals for relative abundances (R)
Calculating confidence intervals for relative abundances (Excel)
R codes
This section contains ASCII codes for use in R. You can principially open them in any text-editor, and either copy the code into the R console or use source("FilePath") to make the function known to R. Furthermore you can work directly from the console.
However, it is often more convenient to write code in an editor and send command lines directly to R. For that, two possible options are either Notepad++ with the NppToR extension installed on your computer, or the use of Tinn-R as an editor.
Function analysis (R)
Sometimes, one can fit a complex yet expalantory valid function to an observation of data. Oftentimes it is then necessary to work with parameters of that fitted function. While it is comparatively easy to find extremal points of such a function or calculate the goodness of fit, some tasks are harder to perform. The curvature function is a well known equation that allows to find the point of maximum curvature in a function, but it is rather difficult to calculate. Partly so, because it requires complex mathematics with the first and second derivative of the original function, and principially also the resulting curvature function. The R code provided calculates and returns the curvature function of any input function. It also iteratively calculates the position of maximum curvature in the input function. While this iterative approach is principially inferior to the mathematical approach using derivatives, it circumvents some problems and can still principially deliver results with a virtually infinite accuracy.
Conversion of geographical coordinates (R)
Everybody working in the field of geosciences knows the problem with different systems of geographical coordinates. While many people still prefer the degrees–minute–second scheme, many programs require decimal degrees. Some sources even mix both systems and provide coordinates as degrees–minute.minute. The problem becomes even more immanent if someone compiles data from different sources and wants to convert them into one scheme. This code provides the means to convert a set of coordinates provided in any of the aforementioned systems in any other of those systems. I did not find a way to invoke something like an automated format recognition (and I doubt if that is possible), so that the data encoding must be manually entered in the first or second column of the dataset. The old version v. 1.0 is provided for compatability with older datasets.
Conductivity to salinity conversion (R)
This function allows to convert conductivity values of sea water (under known temperature and pressure) into salinity values corresponding to the practical salinity unit scale.
Extraction of data points from a diagram (R)
Most scientists know the problems with the availability of raw data (or better, the lack thereof), especially for older works. Some author may has presented results that are quite useful for ones own work, but one has only access to the diagrams. Often, in such cases, one wants to reconstruct the data in digitized form, to be able to work with them more fluently. Sure that is possible in many advanced image processing software tools, but those can cause problems, for instance, when _x_- and _y_-axis are not to the same scale. And being able to create a table of _x_–_y_-coordinates in any given program does not necessarily mean, that one can simply export that table*. One can also perform that task by hand and manually punch in the data into an Excel sheet or something, but that requires a high degree of precission, some manual calculations, and is overall rather time consuming. This R code provides a function to plot a 2D-diagram as image, set the scale for both axes separately, digitize the points, and get an automatically exported table with _x_- and _y_-coordinates of all digitised points as .txt file. The image needs to be in ppm file format. A conversion into this file format is possible with most image processing programs, but it is also possible to do that, using ImageMagic, directly in R (a short code to convert several logically named files as a batch job is included).
*Personal experience of the author: Sometimes the only way is, to manually copy&paste each and every value one by one into a spreadsheet.
False discovery rate correction (R)
This code is supposed to calculate significance levels that are corrected for multiple testing. Such corrections are often performed automatically for analyses, were multiple tests are a standard procedure (like post-hoc tests in ANOVA), but are rarely available solely if one have to correct their own results. The function calculates the significance levels corrected after Benjamini and Hochberg (1995) on the basis of a provided set of calculated _p_-values, as well as the Bonferroni Correction of the significance levels. Note that the p.adjust command of the R stats package also allows you to correct your _p_-values according to several correction schemes. In contrast to that code it does not provide the re-calculated significance level for the whole dataset but corrects the entered _p_-values on a per-test basis. It depends on the intentions and customs of the user, which version is to be preferred.
Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B 57 (1): 289–300.
Calculating bootstrapped confidence intervals (R)
Bootstrapping is just another way to calculate confidence intervals of data means, when standard approaches are not useful due to some reason (e.g. data are not normally distributed). This code enables the calculation of a confidence interval of the mean of several data values to any chosen degree of significance. Although mathematically bootstrapping already works from three measurement values per sample, larger sample sizes are strictly recommended. The code includes a self adaptive approach, that uses accelerated bootstrapping when possible and falls back to basic bootstrapping if the former one is not suitable (Dixon 2002). This self-adaptation can be turned off, to use basic bootstrapping all the time. The old v. 1.1 is provided for backwards compatability. As of v. 3.0 the standard deviation of the sample values including its 95% confidence interval (Sheskin 2011) is also calculated.
If you prefer to use MS Excel, rather than R, you should have a look at resample.xls on http://woodm.myweb.port.ac.uk/nms/, which basically does nearly the same as this code (though it only allows basic bootstrapping), though it involves much more manual work for large datasets.
Dixon, Ph. M. (2002) Bootstrap resampling. in El-Shaarawi, A. H. and Piegorsch, W. W. (eds) Encyclopedia of Environmetrics, pp. 212–20 (John Wiley & Sons, Ltd.: Chichester).
Sheskin, D. J. (2011) Handbook of Parametric and Nonparametric Statistical Procedures. 5th ed., 1886 pp. (Chapman & Hall/CRC Press: Boca Raton, London, New York).
Calculating confidence intervals for permutation data (R)
Sometimes, for instance when calculating the Measurement Based Weight of organisms (Beer et al. 2010), one only retrieves one data value per sample, which in fact already is the mean of several organisms. Such procedures are often used, when an analysis of each specimen solitarily would be too time consuming. If such a per-specimen analysis is principially possible, however, this code provides the necessary tools to estimate the confidence interval of your data. If you perform a Monte-Carlo permutation, measuring only a random subset of at least one sample several times, those data reflect the variability of your organisms, and can thus be used for confidence interval calculations. The first part of the provided code does exactly that, using the _Q_8(p) definition for the quantiles (Hyndman and Fan 1996).
The second part of the code can be used, to include those calculated confidence intervals in correlation analyses—which would be hard to do using normal approaches. For that, a randomisation approach is applied, that in each rerun chooses the value used for the correlation randomly from within the calculated confidence interval.
Beer, Ch. J., Schiebel, R., and Wilson, P. A. (2010) Technical note: On methodologies for determining the size-normalised weight of planktic Foraminifera. Biogeosciences 7: 2193-8.
Hyndman, R. J. and Fan, Y. (1996) Sample quantiles in statistical packages. The American Statistician 50 (4): 361–5.
Cite as:
Weinkauf, M. F. G., Moller, T., Koch, M. C., and Kučera, M. (2013) Calcification intensity in planktonic Foraminifera reflects ambient conditions irrespective of environmental stress. Biogeosciences 10: 6639–55.
Calculating coefficients of variation (R)
The coefficient of variation can be used to assess how variable a population is in any parameter. It is defined as the ratio between the standard deviation and the mean of the population. Many methods to calculate confidence intervals for the coefficient of variation exist, and one such method (Vangel 1996) is implemented in the program. The code reads a set of parameters which are somehow grouped and calculates the by-group coefficient of variation including its confidence interval for each parameter.
Vangel, M. G. (1996) Confidence intervals for a normal coefficient of variation. The American Statistician 50 (1): 21–6.
Calculating confidence intervals for relative abundances (R)
The calculation of the confidence interval of relative abundances of species, morphotypes, etc. in a sample is not as easy as in many other cases. Since relative abundances cannot be negative and individual relative abundances of several groups in a sample are not independent of each other, procedures for multinomial mathematics must be applied. Luckily, Heslop et al. (2011) developed a modern adaptation of those methods. The R-code provided reads abundances of species in samples (either absolute or relative abundances) and calculates the multinomial confidence intervals for the relative abundances of all species.
Heslop, D., De Schepper, S., and Proske, U. (2011) Diagnosing the uncertainty of taxa relative abundances derived from count data. Marine Micropaleontology 79: 114–20.
Proportions z-test (R)
Comparing proportions (i.e. incidences of observations within a sample) is not so straightforward. Proportions, strictly speaking, are nominal variables (McDonald 2009), which means that normal approaches like _t_-test, Mann–Whitney U test or ANOVA-like approaches are not feasible. In contrast to common misbelief, this problem is also not solved by a squareroot-asin transformation of the data: this only eliminates the unimodal relationship between mean and standard deviation in such data, but does by no means necessarily transfer the data into normal distribution with homoscedasticity, nor does it transform the data from nominal into scale variables. Luckily, the z-test can be adapated to work with proportions (even without necessitating transformations). The provided R-code does exactly this, and also includes a Benjamini–Hochberg (1995) correction of the p-values, in case there is more than one comparison between groups involved.
McDonald, J. H. (2009) Handbook of Biological Statistics. 2nd ed., 319 pp. (Sparky House Publishing: Baltimore).
Kendall–Theil robust line fitting (R)
Linear regression in general is one of the most abused methods of statistical analyses, given that many people do not understand the fundamental (and partly philosophical) difference between correlation and regression, and simply go for regression (even worse, often far from linear) because a line ‘looks neat’. Matters grow worse when one is counting how often model I linear regression is used on datasets that are simply not suitable for that analysis. Beside the obvious prerequisite that a straigth line describes the monotonic relationship between dependent and independent variable best, there are several other assumptions made: (a) _x_-values are assumed to be measured basically without error (as far as that is possible in reality), (b) _x_-values were chosen by the experimenter, and (c) _y_-values are normally distributed with the same variance for all values of x. If any of those assumptions is violated a model I linear regression is not suitable, and should be replaced by a model II linear regression method. Given prerequisites for model I linear regression it is mainly of use for laboratory experiments, but especially points (a) and (b) will seldom hold true for data collected in the field or from the fossil record. The Kendall–Theil robust line fitting method invoked here is one of the most robust of such model II linear regressions that should be used instead in such cases—and nevertheless still not implemented in any program I know except that one.
Stukel's goodness-of-fit test (R)
It can sometimes be difficult to calculate the goodness of fit (i.e. the ability of the model to describe the data well) for binomial models. While methods like Pearson's chi-square test work when the data can be grouped into strata, they fail as soon as either any of the strata contains less than _c._5 cases or at least one of the independent variables is continuous, prohibiting the erection of strata altogether. For such cases, the test invented by Stukel (1988) is a versatile alternative to test, how well the model describes the data.
Stukel, Th. A. (1988) Generalized logistic models. Journal of the American Statistical Association 83 (402): 426–31.
Traditional morphometrics (R)
Traditional morphometrics are one of the three major branches of morphometrics (outline analyses and geometric morphometrics being the other two). Traditional morphometrics use only a few selected measurements to describe the morphology of an object, which is why it is oftentimes neglected nowadays. What it lacks in descriptive power, however, it makes up for in understandability and time-effectiveness of data-gathering. The script provided here includes some of the main approaches used in traditional morphometric analyses. Most of the functions are based on Claude (2008), but have been reworked/reassembled to obtain an even higher degree of automation.
Claude, J. (2008) Morphometrics with R. Gentleman, R., Hornik, K., and Parmigiani, G. (eds) Use R!, vol. 2, 316 pp. (Springer).
Outline analyses (R)
Outline analyses are one of the three major branches of morphometrics (traditional morphometrics and geometric morphometrics being the other two). While traditional morphometrics use only a few selected measurements to describe the morphology of an object, outline analyses and geometric morphometrics try to capture the whole picture of morphology. Geometric morphometrics relies on the definition of relatively few, well chosen landmarks within the object to perform that task. Outline analyses, on the other hand, use the digitized outline of the object to describe its shape.
While outline analyses are often inferior to geometric morphometrics in terms of descriptive power, and have been criticised because they neglect features within the object by exclusively focusing on the outline and mostly involve more or less drastic mathematic recalculations, they have some advantages as well. They can capture shape on a more objective level than any other method, because all of the other methods rely on the more or less subjective definition of measurement points. They can describe form (in terms of outer shape) better than any other method. They allow a complete, smooth reconstruction of any possible shape and can model that shape as a whole, not being limited to the reconstruction of the position of a few points. Lastly, in contrast to geometric morphometrics, results can be analyzed using traditional multivariate approaches without further parameter tweaking.
Outline analyses is particularly problematic in practice, because much of the software is rather old and does hardly run on modern machines. Furthermore, the software is heavily fragmented, often necessitating to switch between four or five programs repeatedly to perform a complete analysis. The program SHAPE is one of the few noteworthy programs that run perfectly well on modern computers and are able to perform most tasks (from data acquisition to analysis) in a single framework. The script provided here aims for the same goal in the more adaptable R-environment, containing all functions necessary to extract outlines and perform a Zahn–Roskies Fourier analysis and an elliptic Fourier analysis. Most of the functions are based on Claude (2008), but have been reworked/reassembled to obtain an even higher degree of automation.
Claude, J. (2008) Morphometrics with R. Gentleman, R., Hornik, K., and Parmigiani, G. (eds) Use R!, vol. 2, 316 pp. (Springer).
Geometric Morphometrics (R)
Geometric morphometrics are one of the three major branches of morphometrics (traditional morphometrics and outline analyses being the other two). While traditional morphometrics use only a few selected measurements to describe the morphology of an object, outline analyses and geometric morphometrics try to capture the whole picture of morphology. Geometric morphometrics relies on the definition of relatively few, well chosen landmarks within the object to perform that task. Outline analyses, on the other hand, use the digitized outline of the object to describe its shape.
Geometric morphometrics is the newest branch of morphometrics analyses. It tries to capture the whole shape (i.e. size-independent form) of objects using a coherent set of landmarks, instead of doing so using a combination of more or less arbitrarily chosen linear measurements. It can thus describe general shape change independent of size (and is therefore not affected by scaling problems). On the other hand it concentrates on only a few well chosen features of the object, instead of extracting the whole outline regardless of their local explanatory value. It therefore occupies a middle ground between outline analyses and traditional morphometrics. In contrast to both other methods, however, landmarks are not independent of each other after fitting, so that traditional statistic approaches cannot be aplied without modifications to landmark data.
The script provided here aims to offer the majority of steps one could possibly undergo when extracting and analysing landmark data in one coherent R script. It allows extraction of landmarks, reading and writing of .nts and .tps files, and most statistical analyses that could be performed on such data. Apart from R, MorphoJ is very versatile alternative for landmarks analyses. Most of the functions are based on Claude (2008) and Zelditch et al. (2012), but have been reworked/reassembled to obtain an even higher degree of automation.
Claude, J. (2008) Morphometrics with R. Gentleman, R., Hornik, K., and Parmigiani, G. (eds) Use R!, vol. 2, 316 pp. (Springer).
Zelditch, M. L., Swiderski, D. L., and Sheets, H. D. (2012) Geometric Morphometrics for Biologists: A Primer. 2nd ed., 478 pp. (London, Waltham, San Diego: Academic Press).
PCA for Globigerinella species (R)
This program can be used to project morphometric data extracted from specimens of the planktonic foraminifer genus Globigerinella into a predefined morphospace separating the three species G. siphonifera, G. calida, and G. radians. It can thus be used to objectively separate between those species. For further details compare Weiner et al. (2015).
Cite as:
Weiner, A. K. M., Weinkauf, M. F. G., Kurasawa, A., Darling, K. F., and Kučera, M. (2015) Genetic and morphometric evidence for parallel evolution of the Globigerinella calida morphotype. Marine Micropaleontology 114: 19–35. doi:10.1016/j.marmicro.2014.10.003
Matlab codes
This section contains ASCII codes for use in Matlab. You can principially open them in any text-editor, and copy the code into the Matlab console. Furthermore you can work directly from the console.
EOF of palaeoclimate data (Matlab)
This program allows to perform an empirical orthogonal function (EOF) analysis of unevenly spaced palaeoclimatic time series. It also includes an optional randomisation procedure to quantify the influence of age-model and proxy uncertainties on the interpretation of the data. For further details see Milker et al. (2013).
Cite as:
Milker, Y., Rachmayani, R., Weinkauf, M. F. G., Prange, M., Raitzsch, M., Schulz, M., and Kučera, M. (2013) Global and regional sea surface temperature trends during Marine Isotope Stage 11. Climate of the Past 9: 2231–52. doi:10.5194/cp-9-2231-2013
Excel spreadsheets
This section contains MS Excel spreadsheets in xls or xlsx format. Whether or not they work in OpenOffice, LibreOffice, or alike is unknown to me. Due to Excels limtations they are not by far as advanced as comparable R code, and some of the used Excel equations are rather complicated and confusing.
Conversion of geographical coordinates (Excel)
An early and very basic version of the R code above. It was prepared to convert large amounts of degrees–minute–second coordinates into decimal degrees for purposes of program compatability. Nothing special, nothing complicated, nothing one could not easily do by oneself; but now its there and someone may have use for it. Admittedly, it does not even make use of Excels full potential and requires a high degree of manual data manipulation, which would absolutely not be necessary. It might, however, still be useful for people who are afraid of more complicated implementations. If nothing else it can at least serve as an example that everybody starts at a VERY basic level.
Calculating the mean of stereographical data (Excel)
This spreadsheet was created some years ago for the purpose of calculating the mean of several structural geological data collected in the field, together with some statistical tests for the sufficiency of sample size of those data. Though not completely free of minor problems I provide it here because it does a reasonable job in more than 90% of cases (and also because I am not likely to ever improve it further, anyway, so possibly someone else will do the job).
Conductivity to salinity conversion (Excel)
This Excel sheet allows to convert conductivity values of sea water (under known temperature and pressure) into salinity values corresponding to the practical salinity unit scale.
False discovery rate correction (Excel)
This Excel spreadsheet provides means to calculate corrected levels of significance for cases of multiple testing. Both the Benjamini and Hochberg (1995) Correction and the simple Bonferroni Correction are calculated. It is provided as an alternative to the R code above and performs the same task. The only limitation, in comparison to the R version, is, that no more than 50 _p_-values can be entered without manually changing some formulas—but this will apply seldom if ever, anyway.
Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B 57 (1): 289–300.
Calculating confidence intervals for relative abundances (Excel)
The calculation of the confidence interval of relative abundances of species, morphotypes, etc. in a sample is not as easy as in many other cases. Since relative abundances cannot be negative and individual relative abundances of several groups in a sample are not independent of each other, procedures for multinomial mathematics must be applied. Luckily, Heslop et al. (2011) developed a modern adaptation of those methods. The spreadsheet provided reads relative abundances and calculates the multinomial confidence intervals for all species. It is provided as an alternative to the R-code above.
Heslop, D., De Schepper, S., and Proske, U. (2011) Diagnosing the uncertainty of taxa relative abundances derived from count data. Marine Micropaleontology 79: 114–20.
Calculating pseudo-landmarks (Excel)
When landmarks which describe morphological structures as outlines are extracted by hand, individual landmarks are naturally not equally spaced along that structure-outline. Since equal spacing is often a necessity for morphometric analysis, this Excel sheet allows to convert such a set of landmarks into pseudo-landmarks which are equally spaced along the reconstructed outline of the structure.
Disclaimer
Since all that software was written for my own purposes, and also since I am not a professional programmer, it does not cover up all possibilities. It basically follows my own habits and provides some options, but there might be occasions when it requires file formats which you would normally not use, or when it does not give you a choice which you would have deemed useful. That's how it is, live with it!
In any case you are allowed to alter and correct all the code as pleases you. Just give a correct reference when using the code. And if you find some true bug or error, or significantly improved some of the code (everything above small changes according to your own taste like ‘Hey, I added a line so that all plotted curves change colour every second and the program plays the Star Wars soundtrack while running’), I would be more than happy if you could send me the reworked version.