Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation - PubMed (original) (raw)

. 2018 Jan 4;46(D1):D221-D228.

doi: 10.1093/nar/gkx1031.

Nuala A O'Leary 1, Catherine M Farrell 1, Jane E Loveland 2, Jonathan M Mudge 2, Craig Wallin 1, Carlos G Girón 2, Mark Diekhans 3, If Barnes 2, Ruth Bennett 2, Andrew E Berry 2, Eric Cox 1, Claire Davidson 2, Tamara Goldfarb 1, Jose M Gonzalez 2, Toby Hunt 2, John Jackson 1, Vinita Joardar 1, Mike P Kay 2, Vamsi K Kodali 1, Fergal J Martin 2, Monica McAndrews 4, Kelly M McGarvey 1, Michael Murphy 1, Bhanu Rajput 1, Sanjida H Rangwala 1, Lillian D Riddick 1, Ruth L Seal 5, Marie-Marthe Suner 2, David Webb 1, Sophia Zhu 4, Bronwen L Aken 2, Elspeth A Bruford 5, Carol J Bult 4, Adam Frankish 2, Terence Murphy 1, Kim D Pruitt 1

Affiliations

Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation

Shashikant Pujar et al. Nucleic Acids Res. 2018.

Abstract

The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assembly in genome annotations produced independently by NCBI and the Ensembl group at EMBL-EBI. This dataset is the product of an international collaboration that includes NCBI, Ensembl, HUGO Gene Nomenclature Committee, Mouse Genome Informatics and University of California, Santa Cruz. Identically annotated coding regions, which are generated using an automated pipeline and pass multiple quality assurance checks, are assigned a stable and tracked identifier (CCDS ID). Additionally, coordinated manual review by expert curators from the CCDS collaboration helps in maintaining the integrity and high quality of the dataset. The CCDS data are available through an interactive web page (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) and an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In this paper, we outline the ongoing work, growth and stability of the CCDS dataset and provide updates on new collaboration members and new features added to the CCDS user interface. We also present expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community.

Published by Oxford University Press on behalf of Nucleic Acids Research 2017.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

Number of CCDS IDs and genes represented in the human (A) and mouse (B) CCDS releases. The _X_-axis indicates the year in which a CCDS dataset was made public. Details about CCDS releases are available on the CCDS Releases and Statistics web page (

https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi?REQUEST=SHOW\_STATISTICS

).

Figure 2.

Figure 2.

Fraction of all genes in a CCDS release that are represented by at least two current CCDS IDs.

Figure 3.

Figure 3.

Changes in the human (A) and mouse (B) datasets with every new CCDS release. ‘New’ = new CCDS IDs added; ‘dropped’ = CCDS ID present in the previous release but withdrawn in the subsequent release; ‘updated’ = CCDS IDs that have an incremented accession version compared to the previous release, indicating a sequence update in the coding region.

Figure 4.

Figure 4.

A view of the graphical display accessed from the report page of CCDS3542.1 (

https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi?REQUEST=ALLFIELDS&DATA=CCDS3542&ORGANISM=0&BUILDS=CURRENTBUILDS

) using the purple ‘S’ icon. (A) Transcripts and proteins from NCBI Annotation Release 108. (B) Transcripts and proteins from Ensembl Release 85. The green bar indicates the gene; transcripts are shown in purple and proteins are shown in red color. Positioning the cursor over any of these objects (gene, transcript or protein) opens a tool tip which includes additional information and links. Proteins in the NCBI annotation display that are in the CCDS set include a link to the CCDS ID in the tool tip. The gray box to the right (indicated by vertical arrow) is the tool tip corresponding to the protein accession NP_002514.1. Differences between any two objects can also be revealed as vertical lines (indicated by horizontal arrows) when the objects (NM_002523.2 and ENST00000265634 in the figure) are selected using the ‘Control’ or ‘Command’ button on the keyboard.

Figure 5.

Figure 5.

Distribution of human and mouse CCDS IDs by their ‘Review status’ in the current human (Release 20) and mouse (Release 21) CCDS releases at the time of data freeze. Details of the review status categories and sub-categories are provided in Table 1. Reviewed 1 = CCDS IDs reviewed ‘by RefSeq and HAVANA’, Reviewed 2 = CCDS IDs reviewed ‘by CCDS collaboration’, Reviewed 3 = CCDS IDs reviewed ‘by RefSeq, HAVANA and CCDS collaboration’.

Similar articles

Cited by

References

    1. O’Leary N.A., Wright M.W., Brister J.R., Ciufo S., Haddad D., McVeigh R., Rajput B., Robbertse B., Smith-White B., Ako-Adjei D. et al. . Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016; 44:D733–D745. - PMC - PubMed
    1. Aken B.L., Ayling S., Barrell D., Clarke L., Curwen V., Fairley S., Fernandez Banet J., Billis K., Garcia Giron C., Hourlier T. et al. . The Ensembl gene annotation system. Database. 2016; 2016:1–19. - PMC - PubMed
    1. Pruitt K.D., Harrow J., Harte R.A., Wallin C., Diekhans M., Maglott D.R., Searle S., Farrell C.M., Loveland J.E., Ruef B.J. et al. . The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Genome Res. 2009; 19:1316–1323. - PMC - PubMed
    1. Harte R.A., Farrell C.M., Loveland J.E., Suner M.M., Wilming L., Aken B., Barrell D., Frankish A., Wallin C., Searle S. et al. . Tracking and coordinating an international curation effort for the CCDS Project. Database. 2012; 2012:bas008. - PMC - PubMed
    1. Farrell C.M., O’Leary N.A., Harte R.A., Loveland J.E., Wilming L.G., Wallin C., Diekhans M., Barrell D., Searle S.M., Aken B. et al. . Current status and new features of the Consensus Coding Sequence database. Nucleic Acids Res. 2014; 42:D865–D872. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources