Simultaneous non-negative matrix factorization for multiple large scale gene expression datasets in toxicology - PubMed (original) (raw)

Simultaneous non-negative matrix factorization for multiple large scale gene expression datasets in toxicology

Clare M Lee et al. PLoS One. 2012.

Abstract

Non-negative matrix factorization is a useful tool for reducing the dimension of large datasets. This work considers simultaneous non-negative matrix factorization of multiple sources of data. In particular, we perform the first study that involves more than two datasets. We discuss the algorithmic issues required to convert the approach into a practical computational tool and apply the technique to new gene expression data quantifying the molecular changes in four tissue types due to different dosages of an experimental panPPAR agonist in mouse. This study is of interest in toxicology because, whilst PPARs form potential therapeutic targets for diabetes, it is known that they can induce serious side-effects. Our results show that the practical simultaneous non-negative matrix factorization developed here can add value to the data analysis. In particular, we find that factorizing the data as a single object allows us to distinguish between the four tissue types, but does not correctly reproduce the known dosage level groups. Applying our new approach, which treats the four tissue types as providing distinct, but related, datasets, we find that the dosage level groups are respected. The new algorithm then provides separate gene list orderings that can be studied for each tissue type, and compared with the ordering arising from the single factorization. We find that many of our conclusions can be corroborated with known biological behaviour, and others offer new insights into the toxicological effects. Overall, the algorithm shows promise for early detection of toxicity in the drug discovery process.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: DC was employed by Pfizer and legacy company Wyeth during the course of this work and is now employed by Sanofi. GM was employed by Pfizer and legacy company Wyeth during the course of this work and is now employed by Epistem. CRW is employed by CXR. Pfizer agreed to the publication of this manuscript. MM, DRH and JKV were funded through the TMRC. This does not alter the authors' adherence to all the PLOS ONE policies on sharing data and materials.

Figures

Figure 1

Figure 1. Three measures of the performance versus specified cluster size,

formula image , when the data set is factorised as a single entity. (a) The value of the objective function for formula image. (b) The area under consensus cumulative density, , . (c) The cophenetic correlation coefficient, .

Figure 2

Figure 2. Factorising as a single dataset; reordering using the NMF for

formula image . The columns show the samples and the rows the gene expression for each of the 45037 genes. Genes and samples are organised by cluster number. Elements within each cluster are ordered, with the largest value at the bottom/right. Each tissue is characterised by a group of highly expressed genes; from the top left to bottom right these are heart, skeletal muscle, liver and kidney. For comparison purposes, the characteristic 100 “best” genes in the four columns are names formula image, formula image, formula image and formula image.

Figure 3

Figure 3. Factorising as a single dataset.

The clustering of the mouse samples for formula image. Within each column the samples in the same colour are clustered together. No value of formula image reveals the known tissue/dosage subgroups, or places different tissues in the same cluster.

Figure 4

Figure 4. Three measures of the performance versus specified cluster size,

formula image , when the four tissue types are factorised separately. (a) The value of the objective function for formula image. (b) The area under consensus cumulative density function for formula image, , . (c) The cophenetic correlation coefficient, .

Figure 5

Figure 5. Factorisation of the four separate tissue types using simultaneous NMF with

formula image . Top left, kidney; top right, liver; lower left, heart; lower right, skeletal muscle. The four tissue types are treated as separate sources of information across a common set of mice. Genes are therefore ordered differently in each of the four tissues, but the mice ordering is global. The resulting mouse ordering and mouse clusters are detailed in Table 3.

Figure 6

Figure 6. Factorisation of the four separate tissue types simultaneously.

The clustering of the mice for formula image; colour indicates cluster number. One “misclassification” is found for several values of formula image. This involves the mouse showing a toxic response to the lower (6 mg/kg) dose of PPAR agonist, as discussed in section .

Figure 7

Figure 7. Enrichment of canonical pathways in the four tissue specific gene clusters.

The top one hundred most influential probe-sets in the four tissue specific gene clusters obtained in the first factorization were subjected to signalling and metabolic pathways analysis in the IPA software. This graph shows the comparison of canonical pathways enriched in the four tissue specific gene clusters, formula image, formula image, formula image and formula image. The coloured bars show the significance of the enrichment for a particular pathway in the cluster computed by Fisher's exact test.

Figure 8

Figure 8. Enrichment of toxicity functions in the four tissue specific gene clusters.

The top one hundred most influential probe-sets in the four tissue specific gene clusters obtained in the first factorization were subjected to IPA-Tox analysis in the IPA software. This graph shows the comparison of toxicity functions enriched in the four tissue specific gene clusters. The coloured bars show the significance of the enrichment for a particular toxicity functions in the cluster computed by Fisher's exact test.

Figure 9

Figure 9. Heart and muscle genes enriched in calcium signalling – muscle contraction pathway.

IPA analysis of the top 100 probe-sets from heart and muscle gene clusters (Figure 7) showed the enrichment of calcium signalling pathway. In this figure, we have highlighted the genes present in this pathway in orange. Though this pathway is generalised for skeletal muscle contraction and cardiac muscle contraction, they differ in the members of the same gene family. The heart and muscle genes present in this pathway are given in Tables 7 and 8. Pathway diagram was drawn using Path Designer function of IPA .

Figure 10

Figure 10. Liver genes enriched in FXR/RXR activation pathway IPA analysis of the top 100 probe-sets from the

formula image cluster ( Figure 7 ) showed the enrichment of FXR/RXR activation pathway. The genes present in this pathway are highlighted in orange. The liver genes present in the pathway are given in Table 9. Pathway diagram was drawn using Path Designer function of IPA .

Figure 11

Figure 11. Enrichment of canonical pathways in the liver heart gene cluster no. 2.

This gene cluster has 49 common probe-sets between the top one hundred most influential probe-sets in the liver gene cluster and top one hundred probe-sets in cluster number 2 (6 mg/kg dose rate) of the heart dataset reordered by 4-way simultaneous factorization. Canonical pathways enrichment for these 49 probe-sets analysed using the IPA software is shown in this figure. The length of the bars shows the Fisher's exact test p-value for enrichment for a particular pathway in the cluster.

Figure 12

Figure 12. Enrichment of toxicity functions in

formula image

formula image cluster 2. This gene cluster has 49 common probe-sets between the top one hundred most influential probe-sets in formula image formula image cluster 2 (6 mg/kg dose rate). Toxicity functions enrichment for these 49 probe-sets analysed using the IPA software is shown in this figure. The length of the bars shows the Fisher's exact test p-value for enrichment for a particular pathway in the cluster.

Similar articles

Cited by

References

    1. Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401: 788–791. - PubMed
    1. Carmona-Saez P, Pascual-Marqui R, Tirado F, Carazo J, Pascual-Montano A (2006) Biclustering of gene expression data by non-smooth non-negative matrix factorization. BMC Bioinformatics 7: 78. - PMC - PubMed
    1. Brunet JP, Tamayo P, Golub TR, Mesirov JP (2004) Metagenes and molecular pattern discovery using matrix factorisation. Proc Nat Acad Sci 101: 4164–4169. - PMC - PubMed
    1. Fogel P, Young SS, Hawkins DM, Ledirac N (2007) Inferential, robust non-negative matrix factorization analysis of microarray data. Bioinformatics 23: 44–49. - PubMed
    1. Kim H, Park H (2007) Sparse non-negative matrix factorizations via alternating non-negativityconstrained least squares for microarray data analysis. Bioinformatics 23: 1495–1502. - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources