Tests for finding complex patterns of differential expression in cancers: towards individualized medicine (original) (raw)

On the gene expression landscape of cancer

2020

A principal component analysis of the TCGA data for 15 cancer localizations unveils the following qualitative facts about tumors: 1) The state of a tissue in gene expression space may be described by a few variables. In particular, there is a single variable describing the progression from a normal tissue to a tumor. 2) Each cancer localization is characterized by a gene expression profile, in which genes have specific weights in the definition of the cancer state. There are no less than 2500 differentially-expressed genes, which lead to power-like tails in the expression distribution functions. 3) Tumors in different localizations share hundreds or even thousands of differentially expressed genes. There are 6 genes common to the 15 studied tumor localizations. 4) The tumor region is a kind of attractor. Tumors in advanced stages converge to this region independently of patient age or genetic variability. 5) There is a landscape of cancer in gene expression space with an approximate...

A Comparative Analysis of Gene-Expression Data of Multiple Cancer Types

PLoS ONE, 2010

A comparative study of public gene-expression data of seven types of cancers (breast, colon, kidney, lung, pancreatic, prostate and stomach cancers) was conducted with the aim of deriving marker genes, along with associated pathways, that are either common to multiple types of cancers or specific to individual cancers. The analysis results indicate that (a) each of the seven cancer types can be distinguished from its corresponding control tissue based on the expression patterns of a small number of genes, e.g., 2, 3 or 4; (b) the expression patterns of some genes can distinguish multiple cancer types from their corresponding control tissues, potentially serving as general markers for all or some groups of cancers; (c) the proteins encoded by some of these genes are predicted to be blood secretory, thus providing potential cancer markers in blood; (d) the numbers of differentially expressed genes across different cancer types in comparison with their control tissues correlate well with the five-year survival rates associated with the individual cancers; and (e) some metabolic and signaling pathways are abnormally activated or deactivated across all cancer types, while other pathways are more specific to certain cancers or groups of cancers. The novel findings of this study offer considerable insight into these seven cancer types and have the potential to provide exciting new directions for diagnostic and therapeutic development.

Identification and Validation of Commonly Overexpressed Genes in Solid Tumors by Comparison of Microarray Data

Neoplasia, 2004

Cancers originating from epithelial cells are the most common malignancies. No common expression profile of solid tumors compared to normal tissues has been described so far. Therefore we were interested if genes differentially expressed in the majority of carcinomas could be identified using bioinformatic methods. Complete data sets were downloaded for carcinomas of the prostate, breast, lung, ovary, colon, pancreas, stomach, bladder, liver, and kidney, and were subjected to an expression analysis using SAM. In each experiment, a gene was scored as differentially expressed if the q value was below 25%. Probe identifiers were unified by comparing the respective probe sequences to the Unigene build 155 using BlastN. To obtain differentially expressed genes within the set of analyzed carcinomas, the number of experiments in which differential expression was observed was counted. Differential expression was assigned to genes if they were differentially expressed in at least eight experiments of tumors from different origin. The identified candidate genes ADRM1, EBNA1BP2, FDPS, FOXM1, H2AFX, HDAC3, IRAK1, and YY1 were subjected to further validation. Using this comparative approach, 100 genes were identified as upregulated and 21 genes as downregulated in the carcinomas. Neoplasia (2004) 6, 744 -750

Gene Expression Profiling of Human Cancers

Annals of the New York Academy of Sciences, 2004

DNA microarrays allow us to visualize simultaneously the expression of potentially all genes within a cell population or tissue sample-revealing the "transcriptome." The analysis of this type of data is commonly called "gene expression profiling" (GEP) because it provides a comprehensive picture of the pattern of gene expression in a particular biological sample. For this reason microarrays are revolutionizing life sciences research and are leading to the development of novel and powerful methods for investigating cancer biology, classifying cancers, and predicting clinical outcome of cancers. Several recent high-profile reports have revealed how clustering of GEP data can clearly identify clinically (and prognostically) important subtypes of cancer among patients considered by established clinicopathological criteria to have similar tumors. Accurate "prognostic signatures" can be obtained from GEP data, which represent relatively small numbers of genes. These signatures can be valuable in directing appropriate treatment and in predicting clinical outcome, and they generally outperform other systems based on clinical and histological criteria. In this paper the basic principles of DNA microarray technology and the different types of microarray platforms available will be introduced, and the power of the technique will be illustrated by reviewing some recent GEP studies on selected cancers, including a preliminary analysis of hepatocellular carcinoma from our Palermo laboratory. GEP is likely to be adopted in the future as a key decision-making tool in the clinical arena. However, several issues relating to data analysis, reproducibility, cross-comparability, validation, and cost need to be resolved before the technology can be adopted broadly in this context.

Micro Array Based Gene Expression Analysis using Parametric Multivariate Tests per Gene - A Generalized Application of Multiple Procedures with Data-driven Order of Hypotheses

Biometrical Journal, 2004

Micro array technology allows the simultaneous analysis of ten-thousands of genes. Most often, however, the analysis is based on a few replications only. This causes problems in the application of classical multivariate tests which require sample sizes exceeding the number of observed variables. To overcome these problems, a class of stable, multivariate procedures based on the theory of spherical distributions has been proposed by Låuter, Glimm, and Kropf (1996). These methods allow the use of multivariate information of many genes for testing differential gene expression. Furthermore, multiple testing procedures based on these principles have been constructed (e.g., Kropf, Låuter, 2002), which strictly keep the familywise type I error rate (FWE). In this paper, these methods have been generalized to allow for the use of full multivariate information on expression intensities of individual genes analysed by the Affymetrix GeneChip technology. In contrast to the usual strategy, which constructs an expression score for each gene, based on averaging of the different oligonucleotide (perfect-and miss-match) information, and then performs some test on these summarized expression values, we suggest using a test procedure based on the complete multivariate perfect match information. We show that a multiple FWE-controlling procedure for normally distributed data proposed by Westfall, Kropf, and Finos (2004), can be generalised to a more powerful procedure based on left-spherically distributed scores derived from the perfect match information, without losing the FWE-controlling property. To illustrate the proposed test procedures, which have been implemented in the statistical programming environment R, we analyse two already published data sets, comparing gene expression of tumour and healthy tissues within identical patients and between two groups of different patients, respectively. Using these examples, we demonstrated that the incorporation of the multivariate perfect match information is superior to classical expression score based methods with respect to the number of identifiable differentially expressed genes.

Adaptive trimmed t‐statistics for identifying predominantly high expression in a microarray experiment

2011

Abstract Often, interesting candidate tumor markers are not only genes that show homogeneously higher expression (HHE) in tumor samples compared to control samples, but also genes with only predominantly higher expression (PHE), ie genes which exhibit higher expression in at least 80 per cent of tumor samples. Standard parametric test statistics used in the analysis of microarray experiments may fail with PHE as a consequence of the mixture of distributions present in the tumor group.

An Efficient and Robust Statistical Modeling Approach to Discover Differentially Expressed Genes Using Genomic Expression Profiles

Genome Research, 2001

We have developed a statistical regression modeling approach to discover genes that are differentially expressed between two predefined sample groups in DNA microarray experiments. Our model is based on well-defined assumptions, uses rigorous and well-characterized statistical measures, and accounts for the heterogeneity and genomic complexity of the data. In contrast to cluster analysis, which attempts to define groups of genes and/or samples that share common overall expression profiles, our modeling approach uses known sample group membership to focus on expression profiles of individual genes in a sensitive and robust manner. Further, this approach can be used to test statistical hypotheses about gene expression. To demonstrate this methodology, we compared the expression profiles of 11 acute myeloid leukemia (AML) and 27 acute lymphoblastic leukemia (ALL) samples from a previous study (Golub et al. 1999) and found 141 genes differentially expressed between AML and ALL with a 1%...

A multivariate statistical test for differential expression analysis

Scientific Reports

Statistical tests of differential expression usually suffer from two problems. Firstly, their statistical power is often limited when applied to small and skewed data sets. Secondly, gene expression data are usually discretized by applying arbitrary criteria to limit the number of false positives. In this work, a new statistical test obtained from a convolution of multivariate hypergeometric distributions, the Hy-test, is proposed to address these issues. Hy-test has been carried out on transcriptomic data from breast and kidney cancer tissues, and it has been compared with other differential expression analysis methods. Hy-test allows implicit discretization of the expression profiles and is more selective in retrieving both differential expressed genes and terms of Gene Ontology. Hy-test can be adopted together with other tests to retrieve information that would remain hidden otherwise, e.g., terms of (1) cell cycle deregulation for breast cancer and (2) “programmed cell death” f...

Empirical evaluation of consistency and accuracy of methods to detect differentially expressed genes based on microarray data

Computers in Biology and Medicine, 2014

Background-In this study, we empirically evaluated the consistency and accuracy of five different methods to detect differentially expressed genes (DEGs) based on microarray data. Methods-Five different methods were compared, including the t-test, significance analysis of microarrays (SAM), the empirical Bayes t-test (eBayes), t-tests relative to a threshold (TREAT), and assumption adequacy averaging (AAA). The percentage of overlapping genes (POG) and percentage of overlapping genes related (POGR) scores were used to rank the different methods on their ability to maintain a consistent list of DEGs both within the same data set and across two different data sets concerning the same disease. The power of each method was evaluated based on a simulation approach which mimics the multivariate distribution of the original microarray data. Results-For smaller sample sizes (6 or less per group), moderated versions of the t-test (SAM, eBayes, and TREAT) were superior in terms of both power and consistency relative to the t-test and AAA, with TREAT having the highest consistency in each scenario. Differences in consistency were most pronounced for comparisons between two different data sets for the same disease. For larger sample sizes AAA had the highest power for detecting small effect sizes, while TREAT had the lowest. Discussion-For smaller sample sizes moderated versions of the t-test can generally be recommended, while for larger sample sizes selection of a method to detect DEGs may involve a compromise between consistency and power.