Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes - PubMed (original) (raw)
Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes
Patrick Warnat et al. BMC Bioinformatics. 2005.
Abstract
Background: The extensive use of DNA microarray technology in the characterization of the cell transcriptome is leading to an ever increasing amount of microarray data from cancer studies. Although similar questions for the same type of cancer are addressed in these different studies, a comparative analysis of their results is hampered by the use of heterogeneous microarray platforms and analysis methods.
Results: In contrast to a meta-analysis approach where results of different studies are combined on an interpretative level, we investigate here how to directly integrate raw microarray data from different studies for the purpose of supervised classification analysis. We use median rank scores and quantile discretization to derive numerically comparable measures of gene expression from different platforms. These transformed data are then used for training of classifiers based on support vector machines. We apply this approach to six publicly available cancer microarray gene expression data sets, which consist of three pairs of studies, each examining the same type of cancer, i.e. breast cancer, prostate cancer or acute myeloid leukemia. For each pair, one study was performed by means of cDNA microarrays and the other by means of oligonucleotide microarrays. In each pair, high classification accuracies (> 85%) were achieved with training and testing on data instances randomly chosen from both data sets in a cross-validation analysis. To exemplify the potential of this cross-platform classification analysis, we use two leukemia microarray data sets to show that important genes with regard to the biology of leukemia are selected in an integrated analysis, which are missed in either single-set analysis.
Conclusion: Cross-platform classification of multiple cancer microarray data sets yields discriminative gene expression signatures that are found and validated on a large number of microarray samples, generated by different laboratories and microarray technologies. Predictive models generated by this approach are better validated than those generated on a single data set, while showing high predictive power and improved generalization performance.
Figures
Figure 1
Barplot of the number of UniGene clusters represented in each data set. Grey coloured bars indicate the proportion of UniGene clusters common to a pair of studies.
Figure 2
Quantile-quantile-plots (QQ-plots) comparing the distribution of gene expression values from microarrays of all investigated studies before and after the respective application of MRS or QD. One microarray per study was selected and a quantile-quantile plot (QQ-plot) for every pair of microarrays from corresponding studies was produced. In every QQ-plot the quantiles of all gene expression values of a first microarray are plotted against the quantiles of all gene expression values of a second microarray. If the gene expression values of the two different microarrays share the same distribution, the points in the plot should form a straight line. Abbreviations: MRS, median rank scores; QD, quantile discretization
Figure 3
Barplot of results from a classification analysis using SVM classifiers. Barplot of results from a classification analysis where all data from one study are used to built a classifier (training), which is then used to classify all samples of the other study (test), using SVM classifiers. The names below the bars indicate which study was used for classifier training (left name) and testing (right name). The bars represent the achieved classification accuracies, i.e. the fraction of samples correctly classified. The colour of a bar indicates the method used for data integration. P-values are obtained by statistical testing with the null hypothesis that the two marked classification approaches perform equally well on the given test set (see Methods for details). The target variable for classification analysis of the prostate cancer data was 'type of tissue' (normal vs. tumor tissue), for the breast cancer data the estrogen receptor (ER) status (ER positive vs. ER negative), and for the leukemia data the karyotype of the samples (one of the chromosomal aberrations t(8;21), t(15;17), inv(16) or normal karyotype, respectively). Abbreviations: MRS, median rank scores; QD, quantile discretization, SVM, support vector machine.
Figure 4
Venn diagrams showing the overlap between lists of genes generated by RFE analysis. Venn diagrams showing the overlap between lists of genes generated by RFE analysis based on single sets (Bullinger et al. or Valk et al.) and based on both data sets integrated by MRS or QD. Abbreviations: MRS, median rank scores; QD, quantile discretization, RFE, recursive feature elimination.
Figure 5
Hierarchical clustering of leukemia samples. Hierarchical clustering of leukemia samples based on expression values of genes selected by RFE analysis. The colored bars indicate the true class affiliations of every sample, the black and white bars below indicate study origin. (a) Clustering result for data from Valk et al. or (b) Bullinger et al. using only genes selected by RFE on this data set. (c) Clustering of data from Valk et al. after data integration by MRS algorithm using only expression values of genes selected by RFE on the data of Bullinger et al. (d) Clustering of data from Bullinger et al. based on genes selected on data from Valk et al. Data integrated by QD or non-integrated data yielded results similar to those here (data not shown). (e) Clustering results of all samples of both studies using gene lists generated on the combined set integrated by MRS or (f) QD. Abbreviations: MRS, median rank scores; QD, quantile discretization, RFE, recursive feature elimination.
Figure 6
Flow diagram of the presented cross-platform classification approach. Flow diagram of the presented cross-platform classification approach (see Methods for details) compared to a meta-analysis approach.
References
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–537. doi: 10.1126/science.286.5439.531. - DOI - PubMed
- Bittner M, Meltzer P, Chen Y, Jiang Y, Seftor E, Hendrix M, Radmacher M, Simon R, Yakhini Z, Ben-Dor A, Sampas N, Dougherty E, Wang E, Marincola F, Gooden C, Lueders J, Glatfelter A, Pollock P, Carpten J, Gillanders E, Leja D, Dietrich K, Beaudry C, Berens M, Alberts D, Sondak V. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature. 2000;406:536–540. doi: 10.1038/35020115. - DOI - PubMed
- van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415:530–536. doi: 10.1038/415530a. - DOI - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources