BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata - PubMed (original) (raw)

. 2012 Jan;40(Database issue):D57-63.

doi: 10.1093/nar/gkr1163. Epub 2011 Dec 1.

Karen Clark, Robert Gevorgyan, Vyacheslav Gorelenkov, Eugene Gribov, Ilene Karsch-Mizrachi, Michael Kimelman, Kim D Pruitt, Sergei Resenchuk, Tatiana Tatusova, Eugene Yaschenko, James Ostell

Affiliations

BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata

Tanya Barrett et al. Nucleic Acids Res. 2012 Jan.

Abstract

As the volume and complexity of data sets archived at NCBI grow rapidly, so does the need to gather and organize the associated metadata. Although metadata has been collected for some archival databases, previously, there was no centralized approach at NCBI for collecting this information and using it across databases. The BioProject database was recently established to facilitate organization and classification of project data submitted to NCBI, EBI and DDBJ databases. It captures descriptive information about research projects that result in high volume submissions to archival databases, ties together related data across multiple archives and serves as a central portal by which to inform users of data availability. Concomitantly, the BioSample database is being developed to capture descriptive information about the biological samples investigated in projects. BioProject and BioSample records link to corresponding data stored in archival repositories. Submissions are supported by a web-based Submission Portal that guides users through a series of forms for input of rich metadata describing their projects and samples. Together, these databases offer improved ways for users to query, locate, integrate and interpret the masses of data held in NCBI's archival repositories. The BioProject and BioSample databases are available at http://www.ncbi.nlm.nih.gov/bioproject and http://www.ncbi.nlm.nih.gov/biosample, respectively.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

Schematic depicting how BioProject, BioSample and data objects can be organized and linked. This example is composed of one umbrella project that encompasses three subprojects, each of which generated data derived from two BioSample records. Users can query either the BioProject or the BioSample database to retrieve the relevant records, and then navigate through links to the corresponding experimental data which continue to be stored in NCBI's primary data archives, including GenBank, SRA, dbGaP and GEO. This schematic depicts direct links that can be applied between objects; it does not depict links to corresponding records in other NCBI databases, including PubMed, Gene, Genome and Taxonomy.

Figure 2.

Figure 2.

Screenshot of a Genome Sequencing project that is a component of an umbrella project that encompasses data generated from an E. coli pathogen outbreak (upper panel) (17) and a corresponding sample record (lower panel). The records display the project title, summary, data type, locus_tag prefix and various project attributes including the scope and capture method (A). The Project Data section (B) lists the availability of corresponding sequence and assembly data in the Nucleotide and SRA databases where the data can be downloaded. Navigation panels assist users to link to Genome-level resources for that organism (C), or to ‘Navigate Up’ to the parent umbrella project, or to ‘Navigate across’ to sibling projects that are part of that umbrella project, as well as any additional projects related by organism (D). The ‘Related Information’ panel (E) contains full list of linkages for that record; clicking the BioSample link directs the user to the sample record shown in the lower panel, which lists the attributes that were collected for that sample including the collection date, isolation source, country and strain and serovar (F).

Similar articles

Cited by

References

    1. Cochrane G, Karsch-Mizrachi I, Nakamura Y. The International Nucleotide Sequence Database Collaboration. Nucleic Acids Res. 2011;39:D15–D18. - PMC - PubMed
    1. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2011;39:D32–D37. - PMC - PubMed
    1. Shumway M, Cochrane G, Sugawara H. Archiving next generation sequencing data. Nucleic Acids Res. 2010;38:D870–D871. - PMC - PubMed
    1. Barrett T, Troup DB, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, et al. NCBI GEO: archive for functional genomics data sets–10 years on. Nucleic Acids Res. 2011;39:D1005–D1010. - PMC - PubMed
    1. Fingerman IM, McDaniel L, Zhang X, Ratzat W, Hassan T, Jiang Z, Cohen RF, Schuler GD. NCBI Epigenomics: a new public resource for exploring epigenomic data sets. Nucleic Acids Res. 2011;39:D908–D912. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources