Microarray data analysis for differential expression: a tutorial (original) (raw)
Related papers
DNa microarray is a technology that simultaneously evaluates quantitative measurements for the expression of thousands of genes. DNa microarrays have been used to assess gene expression between groups of cells of different organs or different populations. in order to understand the role and function of the genes, one needs the complete information about their mrNa transcripts and proteins. Unfortunately, exploring the protein functions is very difficult, due to their unique 3-dimentional complicated structure. to overcome this difficulty, one may concentrate on the mRNA molecules produced by the genes' expression. in this paper, we describe some of the methods for preprocessing data for gene expression and for pairwise comparison from genomic experiments. Previous studies to assess the efficiency of different methods for pairwise comparisons have found little agreement in the lists of significant genes. Finally, we describe the procedures to control false discovery rates, sample size approach for these experiments, and available software for microarray data analysis. this paper is written for those professionals who are new in microarray data analysis for differential expression and want to have an overview of the specific steps or the different approaches for this sort of analysis.
Analysis of Microarray Gene Expression Data
Current Bioinformatics, 2006
This article reviews the methods utilized in processing and analysis of gene expression data generated using DNA microarrays. This type of experiment allows to determine relative levels of mRNA abundance in a set of tissues or cell populations for thousands of genes simultaneously. Naturally, such an experiment requires computational and statistical analysis techniques. At the outset of the processing pipeline, the computational procedures are largely determined by the technology and experimental setup that are used. Subsequently, as more reliable intensity values for genes emerge, pattern discovery methods come into play. The most striking peculiarity of this kind of data is that one usually obtains measurements for thousands of genes for only a much smaller number of conditions. This is at the root of several of the statistical questions discussed here.
DNA microarray is a powerful technology that can simultaneously determine the levels of thousands of transcripts (generated, for example, from genes/miRNAs) across different experimental conditions or tissue samples. The motto of differential expression analysis is to identify the transcripts whose expressions change significantly across different types of samples or experimental conditions. A number of statistical testing methods are available for this purpose. In this article, we provide a comprehensive survey on different parametric and nonparametric testing methodologies for identifying differential expression from microarray datasets. The performances of the different testing methods have been compared based on some real-life miRNA and mRNA expression data sets. For validating the resulting differentially expressed miRNAs, the outcomes of each test are checked with the information available for miRNA in the standard miRNA database PhenomiR 2.0. Subsequently, we have prepared different simulated datasets of different sample sizes (from 10 to 100 per group/population) and thereafter the power of each test have been calculated individually. The comparative simulated study might lead to formulate robust and comprehensive judgements about the performance of each test in the basis of assumption of data distribution. Finally, a list of advantages and limitations of the different statistical tests has been provided, along with indications of some areas where further studies are required.
Microarray analysis of gene expression: considerations in data mining and statistical treatment
Physiological Genomics, 2006
DNA microarray represents a powerful tool in biomedical discoveries. Harnessing the potential of this technology depends on the development and appropriate use of data mining and statistical tools. Significant current advances have made microarray data mining more versatile. Researchers are no longer limited to default choices that generate suboptimal results. Conflicting results in repeated experiments can be resolved through attention to the statistical details. In the current dynamic environment, there are many choices and potential pitfalls for researchers who intend to incorporate microarrays as a research tool. This review is intended to provide a simple framework to understand the choices and identify the pitfalls. Specifically, this review article discusses the choice of microarray platform, preprocessing raw data, differential expression and validation, clustering, annotation and functional characterization of genes, and pathway construction in light of emergent concepts an...
A Comparative Study of Methods of Analyzing Gene Expression Data
2004
In analyzing microarray data it is often necessary to detect genes that are differentially expressed between two or more samples. This project aims to apply two methods to address statistical issues that arise when identifying differentially expressed genes. The first is a geneby-gene analysis that attempts to overcome the small sample size issue that is often present in microarray data sets. By averaging the variances of genes with similar expression levels, we are able to stabilize the test statistics used in determining significant genes and obtain more powerful tests. When looking at thousands of tests, one for each gene, problems arise involving the type I error rates. The leads to multiple testing issues that must be addressed. We applied many methods of correcting or adjusting the p-values for multiple testing. Based on this study, the false discovery rate method appears to provide a reasonable balance between the type I error rate and allowing sufficient power to detect diff...
2008
Microarrays experiments are becoming a common laboratory tool for monitoring expression level in cells for thousand of genes simultaneously. The new data promise to enhance fundamental understanding of life on a molecular level and may prove useful in medical diagnosis, treatment and drug design. The greatest challenge to array technology lies in the analysis of gene expression data to identify which genes are differentially expressed across tissue samples or experimental conditions. A simple fold change was used to test the differential expression of genes. Ordinary t-test and t-test approaches with minor variations are usually used in finding differentially expressed genes under two conditions. Analysis of variance (ANOVA) and mixed model ANOVA proved to be powerful under multiple conditions or several sources of variation. Since thousands of hypotheses are tested simultaneously there is increased chance of false positives and it becomes necessary to adjust for multiple testing when assessing statistical significance of findings. Bayesian variable selection and empirical Bayesian approaches offer yet another avenue.
2007
This paper proposes a novel algorithm for the identification of differentially expressed genes in two group microarray experiments. The algorithm, called PUL, is compared to other popular algorithms using published implementations. The comparison is based on established measurements in information retrieval (Recall and Precision). Surprisingly a clear ordering in performance of the algorithms was observed. PUL outperformed other algorithms by a factor of two. PUL was applied successfully in different practical applications. For these experiments the importance of the genes proposed by PUL were independently verified. Contact: ultsch@informatik.uni-marburg.de
A unified framework for finding differentially expressed genes from microarray experiments
BMC Bioinformatics, 2007
Background: This paper presents a unified framework for finding differentially expressed genes (DEGs) from the microarray data. The proposed framework has three interrelated modules: (i) gene ranking, ii) significance analysis of genes and (iii) validation. The first module uses two gene selection algorithms, namely, a) two-way clustering and b) combined adaptive ranking to rank the genes. The second module converts the gene ranks into p-values using an R-test and fuses the two sets of p-values using the Fisher's omnibus criterion. The DEGs are selected using the FDR analysis. The third module performs three fold validations of the obtained DEGs. The robustness of the proposed unified framework in gene selection is first illustrated using false discovery rate analysis. In addition, the clustering-based validation of the DEGs is performed by employing an adaptive subspace-based clustering algorithm on the training and the test datasets. Finally, a projection-based visualization is performed to validate the DEGs obtained using the unified framework. Results: The performance of the unified framework is compared with well-known ranking algorithms such as t-statistics, Significance Analysis of Microarrays (SAM), Adaptive Ranking, Combined Adaptive Ranking and Two-way Clustering. The performance curves obtained using 50 simulated microarray datasets each following two different distributions indicate the superiority of the unified framework over the other reported algorithms. Further analyses on 3 real cancer datasets and 3 Parkinson's datasets show the similar improvement in performance. First, a 3 fold validation process is provided for the two-sample cancer datasets. In addition, the analysis on 3 sets of Parkinson's data is performed to demonstrate the scalability of the proposed method to multi-sample microarray datasets. Conclusion: This paper presents a unified framework for the robust selection of genes from the two-sample as well as multi-sample microarray experiments. Two different ranking methods used in module 1 bring diversity in the selection of genes. The conversion of ranks to p-values, the fusion of p-values and FDR analysis aid in the identification of significant genes which cannot be judged based on gene ranking alone. The 3 fold validation, namely, robustness in selection of genes using FDR analysis, clustering, and visualization demonstrate the relevance of the DEGs. Empirical analyses on 50 artificial datasets and 6 real microarray datasets illustrate the efficacy of the proposed approach. The analyses on 3 cancer datasets demonstrate the utility of the proposed approach on microarray datasets with two classes of samples. The scalability of the proposed unified approach to multi-sample (more than two sample classes) microarray datasets is addressed using three sets of Parkinson's Data. Empirical analyses show that the unified framework outperformed other gene selection methods in selecting differentially expressed genes from microarray data.
Testing for differentially expressed genes with microarray data
Nucleic Acids Research, 2003
This paper compares the type I error and power of the one-and two-sample t-tests, and the one-and two-sample permutation tests for detecting differences in gene expression between two microarray samples with replicates using Monte Carlo simulations. When data are generated from a normal distribution, type I errors and powers of the one-sample parametric t-test and one-sample permutation test are very close, as are the two-sample t-test and twosample permutation test, provided that the number of replicates is adequate. When data are generated from a t-distribution, the permutation tests outperform the corresponding parametric tests if the number of replicates is at least ®ve. For data from a two-color dye swap experiment, the one-sample test appears to perform better than the two-sample test since expression measurements for control and treatment samples from the same spot are correlated. For data from independent samples, such as the one-channel array or two-channel array experiment using reference design, the two-sample t-tests appear more powerful than the one-sample t-tests.
2020
Volume 7(8) ISSN Abstract: Identification of genes differentially expressed across multiple conditions has become an important statistical problem in analyzing large-scale microarray data. Many statistical methods have been developed to address the challenging problem. Therefore, an extensive comparison among these statistical methods is extremely important for experimental scientists to choose a valid method for their data analysis. In this study, we conducted simulation studies to compare six statistical methods: the Bonferroni (B-) procedure, the Benjamini and Hochberg (BH-) procedure, the Local false discovery rate (Localfdr) method, the Optimal Discovery Procedure (ODP), the Ranking Analysis of F-statistics (RAF), and the Significant Analysis of Microarray data (SAM) in identifying differentially expressed genes. We demonstrated that the strength of treatment effect, the sample size, proportion of differentially expressed genes and variance of gene expression will significantly...