Normalization of High Dimensional Genomics Data Where the Distribution of the Altered Variables Is Skewed (original) (raw)
Related papers
Modeling Skewness in Human Transcriptomes
2012
Gene expression data are influenced by multiple biological and technological factors leading to a wide range of dispersion scenarios, although skewed patterns are not commonly addressed in microarray analyses. In this study, the distribution pattern of several human transcriptomes has been studied on free-access microarray gene expression data.
Analysis strategy for comparison of skewed outcomes from biological data: A recent development
When faced with the problem of comparing positively skewed outcome values, data transformations such as log and square root etc, are often used. However, this approach suffers with the difficulty in interpretability, lack of accuracy etc. That is, while the back transformation of mean is possible, but not for the standard deviation. This paper presents the analysis of comparing positively skewed outcome data by using generalized pivotal and log transformation approach for lognormally distributed data. Simulation experiment was conducted to examine the characteristics of generalized pivotal approach for small sample sizes and for large standard deviations. For the analysis of positively skewed biological data between two groups generalized p value and confidence interval approach for lognormal distribution is considered to be efficient as this provides direct statistical inference such as estimates, 95% CI and its p values.
Evaluation of normalization methods for microarray data
BMC bioinformatics, 2003
Microarray technology allows the monitoring of expression levels for thousands of genes simultaneously. This novel technique helps us to understand gene regulation as well as gene by gene interactions more systematically. In the microarray experiment, however, many undesirable systematic variations are observed. Even in replicated experiment, some variations are commonly observed. Normalization is the process of removing some sources of variation which affect the measured gene expression levels. Although a number of normalization methods have been proposed, it has been difficult to decide which methods perform best. Normalization plays an important role in the earlier stage of microarray data analysis. The subsequent analysis results are highly dependent on normalization. In this paper, we use the variability among the replicated slides to compare performance of normalization methods. We also compare normalization methods with regard to bias and mean square error using simulated dat...
The covariance shift (C-SHIFT) algorithm for normalizing biological data
2020
Omics technologies are powerful tools for analyzing patterns in gene expression data for thousands of genes. Due to a number of systematic variations in experiments, the raw gene expression data is often obfuscated by undesirable technical noises. Various normalization techniques were designed in an attempt to remove these nonbiological errors prior to any statistical analysis. One of the reasons for normalizing data is the need for recovering the covariance matrix used in gene network analysis. In this paper, we introduce a novel normalization technique, called the covariance shift (C-SHIFT) method. This normalization algorithm uses optimization techniques together with the blessing of dimensionality philosophy and energy minimization hypothesis for covariance matrix recovery under additive noise (in biology, known as the bias). Thus, it is perfectly suited for the analysis of logarithmic gene expression data. Numerical experiments on synthetic data demonstrate the method's advantage over the classical normalization techniques. Namely, the comparison is made with rank, quantile, cyclic LOESS (locally estimated scatterplot smoothing), and MAD (median absolute deviation) normalization methods.
A model-based analysis of microarray experimental error and normalisation
Nucleic Acids Research, 2003
A statistical model is proposed for the analysis of errors in microarray experiments and is employed in the analysis and development of a combined normalisation regime. Through analysis of the model and two-dye microarray data sets, this study found the following. The systematic error introduced by microarray experiments mainly involves spot intensity-dependent, feature-speci®c and spot position-dependent contributions. It is dif®cult to remove all these errors effectively without a suitable combined normalisation operation. Adaptive normalisation using a suitable regression technique is more effective in removing spot intensity-related dye bias than self-normalisation, while regional normalisation (block normalisation) is an effective way to correct spot position-dependent errors. However, dye-¯ip replicates are necessary to remove feature-speci®c errors, and also allow the analyst to identify the experimentally introduced dye bias contained in non-self-self data sets. In this case, the bias present in the data sets may include both experimentally introduced dye bias and the biological difference between two samples. Self-normalisation is capable of removing dye bias without identifying the nature of that bias. The performance of adaptive normalisation, on the other hand, depends on its ability to correctly identify the dye bias. If adaptive normalisation is combined with an effective dye bias identi®cation method then there is no systematic difference between the outcomes of the two methods.
Correction of scaling mismatches in oligonucleotide microarray data
BMC BIOINFORMATICS, 2006
Background: Gene expression microarray data is notoriously subject to high signal variability. Moreover, unavoidable variation in the concentration of transcripts applied to microarrays may result in poor scaling of the summarized data which can hamper analytical interpretations. This is especially relevant in a systems biology context, where systematic biases in the signals of particular genes can have severe effects on subsequent analyses. Conventionally it would be necessary to replace the mismatched arrays, but individual time points cannot be rerun and inserted because of experimental variability. It would therefore be necessary to repeat the whole time series experiment, which is both impractical and expensive.
Application of skew-normal distribution for detecting differential expression to microRNA data
Traditional statistical modeling of continuous outcome variables relies heavily on the assumption of a normal distribution. However, in some applications, such as analysis of microRNA (miRNA) data, normality may not hold. Skewed distributions play an important role in such studies and might lead to robust results in the presence of extreme outliers. We apply a skew-normal (SN) distribution, which is indexed by three parameters (location, scale and shape), in the context of miRNA studies. We developed a test statistic for comparing means of two conditions replacing the normal assumption with SN distribution. We compared the performance of the statistic with other Wald-type statistics through simulations. Two real miRNA datasets are analyzed to illustrate the methods. Our simulation findings showed that the use of a SN distribution can result in improved identification of differentially expressed miRNAs, especially with markedly skewed data and when the two groups have different variances. It also appeared that the statistic with SN assumption performs comparably with other Wald-type statistics irrespective of the sample size or distribution. Moreover, the real dataset analyses suggest that the statistic with SN assumption can be used effectively for identification of important miRNAs. Overall, the statistic with SN distribution is useful when data are asymmetric and when the samples have different variances for the two groups.
Robust Local Normalization Of Gene Expression Microarray Data
2000
analysis of expression microarray data is normalization: the process of adjusting the signal from 2 different reporter channels on a single microarray or a single-reporter channel on multiple microarrays to a common scale. Current methods involve either normalization to some representative statistic of all of the data or to a statistic of some subset of the data, such as a set of "housekeeping" genes. The first method fails to correct any non-linearity between the data channels; the second method is sometimes undone by differential expression of genes that were thought to be unregulated. Agilent has invented a normalization method that uses robust statistical methods to establish the "central tendency" of a set of differential expression data. Normalization utilizes the data clustered near this central tendency; these points comprise an experimentally determined set of housekeeping genes. The resulting algorithm is rapid, robust and capable of correctly normalizing microarray data from different platforms, such as cDNA and in situ synthesized oligonucleotide microarrays. In addition, the method provides an easily interpreted measurement of the degree to which the normalization has altered the original data.
PloS one, 2014
Normalization procedures are widely used in high-throughput genomic data analyses to remove various technological noise and variations. They are known to have profound impact to the subsequent gene differential expression analysis. Although there has been some research in evaluating different normalization procedures, few attempts have been made to systematically evaluate the gene detection performances of normalization procedures from the bias-variance trade-off point of view, especially with strong gene differentiation effects and large sample size. In this paper, we conduct a thorough study to evaluate the effects of normalization procedures combined with several commonly used statistical tests and MTPs under different configurations of effect size and sample size. We conduct theoretical evaluation based on a random effect model, as well as simulation and biological data analyses to verify the results. Based on our findings, we provide some practical guidance for selecting a suit...
BMC Bioinformatics, 2010
Background: Oligonucleotide arrays have become one of the most widely used high-throughput tools in biology. Due to their sensitivity to experimental conditions, normalization is a crucial step when comparing measurements from these arrays. Normalization is, however, far from a solved problem. Frequently, we encounter datasets with significant technical effects that currently available methods are not able to correct. Results: We show that by a careful decomposition of probe specific amplification, hybridization and array location effects, a normalization can be performed that allows for a much improved analysis of these data. Identification of the technical sources of variation between arrays has allowed us to build statistical models that are used to estimate how the signal of individual probes is affected, based on their properties. This enables a model-based normalization that is probe-specific, in contrast with the signal intensity distribution normalization performed by many current methods. Next to this, we propose a novel way of handling background correction, enabling the use of background information to weight probes during summarization. Testing of the proposed method shows a much improved detection of differentially expressed genes over earlier proposed methods, even when tested on (experimentally tightly controlled and replicated) spike-in datasets.