Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt (original) (raw)

Protocol
Published: 23 July 2009

Nature Protocols volume 4, pages 1184–1191 (2009)Cite this article

5445 Accesses
2242 Citations
15 Altmetric
Metrics details

Abstract

Genomic experiments produce multiple views of biological systems, among them are DNA sequence and copy number variation, and mRNA and protein abundance. Understanding these systems needs integrated bioinformatic analysis. Public databases such as Ensembl provide relationships and mappings between the relevant sets of probe and target molecules. However, the relationships can be biologically complex and the content of the databases is dynamic. We demonstrate how to use the computational environment R to integrate and jointly analyze experimental datasets, employing BioMart web services to provide the molecule mappings. We also discuss typical problems that are encountered in making gene-to-transcript–to-protein mappings. The approach provides a flexible, programmable and reproducible basis for state-of-the-art bioinformatic data integration.

This is a preview of subscription content, access via your institution

Access options

Subscribe to this journal

Receive 12 print issues and online access

$259.00 per year

only $21.58 per issue

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

Additional access options:

References

R Development Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, Vienna, Austria, 2008) ISBN 3-900051-07-0.
Gentleman, R.C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5 (10): R80 (2004).
Article Google Scholar
Kasprzyk, A. et al. Ensmart: a generic system for fast and flexible access to biological data. Genome Res. 14 (1): 160–169 (2004).
Article CAS Google Scholar
Hubbard, T.J. et al. Ensembl 2009. Nucleic Acids Res. 37 (Database issue): D690–D697 (2009).
Article CAS Google Scholar
Rogers, A. et al. Wormbase 2007. Nucleic Acids Res. 36 (Database issue): D612–D617 (2008).
CAS PubMed Google Scholar
Matthews, L. et al. Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Res. 37 (Database issue): D619–D622 (2009).
Article CAS Google Scholar
Durinck, S. et al. BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics 21, 3439–3440 (2005).
Article CAS Google Scholar
Durinck, S. Integrating biological data resources into R with biomaRt. The Newsletter of the R Project 6/5, 40–45 (2006).
Google Scholar
Boutros, M. et al. Analysis of cell-based RNAi screens. Genome Biol. 7, R66 (2006).
Article Google Scholar
Wei, J.S. et al. The MYCN oncogene is a direct target of miR-34a. Oncogene 27 (39): 5204–5213 (2008).
Article CAS Google Scholar
Hahne, F. et al. Bioconductor Case Studies. Springer Verlag, New York, USA, (2008).
Book Google Scholar
Pruitt, K.D., Tatusova, T. & Maglott, D.R. NCBI reference sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35 (Database issue): D61–D65 (2007).
Article CAS Google Scholar
Bruford, E.A. et al. The HGNC database in 2008: a resource for the human genome. Nucleic Acids Res. 36 (Database issue): D445–D448 (2008).
Article CAS Google Scholar
Neve, R.M. et al. A collection of breast cancer cell lines for the study of functionally distinct cancer subtypes. Cancer Cell 10, 515–527 (2006).
Article CAS Google Scholar
Parkinson, H. et al. Arrayexpress update – from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res. 37, D868–D872 (2009).
Article CAS Google Scholar
Irizarry, R.A. et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249–264 (2003).
Article Google Scholar

Download references

Acknowledgements

We thank Arek Kasprzyk and Rhoda Kinsella for insightful discussions.

This work was partially funded by the U24 CA126551 grant.

Author information

Authors and Affiliations

Lawrence Berkeley National Laboratory, Berkeley, California, USA
Steffen Durinck & Paul T Spellman
European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
Ewan Birney & Wolfgang Huber

Authors

Steffen Durinck
You can also search for this author inPubMed Google Scholar
Paul T Spellman
You can also search for this author inPubMed Google Scholar
Ewan Birney
You can also search for this author inPubMed Google Scholar
Wolfgang Huber
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toSteffen Durinck.

Supplementary information

Supplementary Data 1

Zip archive containing the raw data of the Neve et al. study on a panel of 51 breast cell lines. It consists of Affymetrix CEL files of gene expression measurements deposited in ArrayExpress as experiment E-TABM-157, and Array CGH and protein quantification data which are available from http://cancer.lbl.gov/breastcancer. (ZIP 168067 kb)

Rights and permissions

About this article

Cite this article

Durinck, S., Spellman, P., Birney, E. et al. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt.Nat Protoc 4, 1184–1191 (2009). https://doi.org/10.1038/nprot.2009.97

Download citation

Published: 23 July 2009
Issue Date: August 2009
DOI: https://doi.org/10.1038/nprot.2009.97