Multiple evidence strands suggest that there may be as few as 19,000 human protein-coding genes - PubMed (original) (raw)

. 2014 Nov 15;23(22):5866-78.

doi: 10.1093/hmg/ddu309. Epub 2014 Jun 16.

Affiliations

Multiple evidence strands suggest that there may be as few as 19,000 human protein-coding genes

Iakes Ezkurdia et al. Hum Mol Genet. 2014.

Abstract

Determining the full complement of protein-coding genes is a key goal of genome annotation. The most powerful approach for confirming protein-coding potential is the detection of cellular protein expression through peptide mass spectrometry (MS) experiments. Here, we mapped peptides detected in seven large-scale proteomics studies to almost 60% of the protein-coding genes in the GENCODE annotation of the human genome. We found a strong relationship between detection in proteomics experiments and both gene family age and cross-species conservation. Most of the genes for which we detected peptides were highly conserved. We found peptides for >96% of genes that evolved before bilateria. At the opposite end of the scale, we identified almost no peptides for genes that have appeared since primates, for genes that did not have any protein-like features or for genes with poor cross-species conservation. These results motivated us to describe a set of 2001 potential non-coding genes based on features such as weak conservation, a lack of protein features, or ambiguous annotations from major databases, all of which correlated with low peptide detection across the seven experiments. We identified peptides for just 3% of these genes. We show that many of these genes behave more like non-coding genes than protein-coding genes and suggest that most are unlikely to code for proteins under normal circumstances. We believe that their inclusion in the human protein-coding gene catalogue should be revised as part of the ongoing human genome annotation effort.

© The Author 2014. Published by Oxford University Press.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

The percentage of gene products detected in proteomics experiments as a function of gene conservation. Gene conservation is expressed using MI score, displayed in bins. Bin ‘0’ is MI scores from 0 to 0.019, ‘0.02’ is from 0.02 to 0.039, etc. The ‘missing’ genes are those where the conservation was so poor that INERTIA was not able to generate a score.

Figure 2.

Figure 2.

The percentage of genes for which peptides are detected in proteomics experiments against gene family age. Gene products with gene families that appeared in the oldest phylogenetic divisions (towards the left) are detected much more often in proteomics experiments than those genes with families that appeared in the most recent phylogenetic divisions.

Figure 3.

Figure 3.

Transcript ubiquity for human genes. UniGene contains transcript evidence for most human genes over 45 different tissues. For each gene, we counted the number of tissues in which there was transcript evidence of at least five or more transcripts per million_._ We separated the numbers of tissues in which transcripts were detected in UniGene into ten bins and calculated the percentage of genes in each of the ten bins. We split the GENCODE 12 genes into three groups, those genes for which we found peptides (‘Detected’ in dark red), those genes for which we did not find peptides that were also in the potential non-coding set (‘Potential NC’ genes marked in yellow) and those for which we did not find peptides but that were not in the potential non-coding set (‘_Not Detected_’ genes, in orange).

Figure 4.

Figure 4.

RFC scores for pairwise alignments with four species. The RFC scores are calculated as per the section Materials and methods. RFC scores for alignments between (A) human and chimp, (B) human and macaque, (C) human and mouse and (D) human and dog. We split the GENCODE 12 genes into three groups, those genes for which we found peptides (‘Detected’ in dark red), those genes for which we did not find peptides and that were in the potential non-coding set (‘Potential NC’ genes marked in yellow) and those that we did not detect but that were not in the potential non-coding set (‘_Not Detected_’ genes, in orange). As a comparison, we included the results for a set of long non-coding genes (‘Non-coding’ shown in blue). RFC scores are shown on the _y_-axis; the _x_-axis is the proportion of each set. RFC scores are ordered from highest to lowest.

Figure 5.

Figure 5.

RFC scores for genes from potential NC set. The RFC scores were calculated as per the section Materials and methods for alignments between human and mouse only. We split the Potential NC set genes that we could classify into 4 groups, those 342 genes that we felt were likely protein-coding genes (Possible coding), the 396 genes that we felt were possible pseudogenes (Possible pseudogenes), the 229 read-through genes and those 969 genes that we felt were likely to be non-coding (Possible non-coding). We compared these four sets against three background sets, those protein-coding genes for which we found peptides (Detected in dark red), those coding genes for which we did not find peptides and that were not in the potential non-coding set (Not Detected genes, in orange) and a set of long non-coding genes (Non-coding shown in blue). RFC scores are shown on the _y_-axis; the _x_-axis in all the figures is the proportion of all the valid pairwise alignments included in the RFC calculations. RFC scores are ordered from highest to lowest.

Similar articles

Cited by

References

    1. Pennisi E. A low gene number wins the GeneSweep pool. Science. 2003;300:1484. - PubMed
    1. Venter J.C., Adams M.D., Myers E.W., Li P.W., Mural R.J., Sutton G.G., Smith H.O., Yandell M., Evans C.A., Holt R.A., et al. The sequence of the human genome. Science. 2001;291:1304–1351. - PubMed
    1. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. - PubMed
    1. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–945. - PubMed
    1. Clamp M., Fry B., Kamal M., Xie X., Cuff J., Lin M.F., Kellis M., Lindblad-Toh K., Lander E.S. Distinguishing protein-coding and noncoding genes in the human genome. Proc. Natl. Acad. Sci. USA. 2007;104:19428–19433. - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources