Multiple evidence strands suggest that there may be as few as 19,000 human protein-coding genes - PubMed (original) (raw)
. 2014 Nov 15;23(22):5866-78.
doi: 10.1093/hmg/ddu309. Epub 2014 Jun 16.
Affiliations
- PMID: 24939910
- PMCID: PMC4204768
- DOI: 10.1093/hmg/ddu309
Multiple evidence strands suggest that there may be as few as 19,000 human protein-coding genes
Iakes Ezkurdia et al. Hum Mol Genet. 2014.
Abstract
Determining the full complement of protein-coding genes is a key goal of genome annotation. The most powerful approach for confirming protein-coding potential is the detection of cellular protein expression through peptide mass spectrometry (MS) experiments. Here, we mapped peptides detected in seven large-scale proteomics studies to almost 60% of the protein-coding genes in the GENCODE annotation of the human genome. We found a strong relationship between detection in proteomics experiments and both gene family age and cross-species conservation. Most of the genes for which we detected peptides were highly conserved. We found peptides for >96% of genes that evolved before bilateria. At the opposite end of the scale, we identified almost no peptides for genes that have appeared since primates, for genes that did not have any protein-like features or for genes with poor cross-species conservation. These results motivated us to describe a set of 2001 potential non-coding genes based on features such as weak conservation, a lack of protein features, or ambiguous annotations from major databases, all of which correlated with low peptide detection across the seven experiments. We identified peptides for just 3% of these genes. We show that many of these genes behave more like non-coding genes than protein-coding genes and suggest that most are unlikely to code for proteins under normal circumstances. We believe that their inclusion in the human protein-coding gene catalogue should be revised as part of the ongoing human genome annotation effort.
© The Author 2014. Published by Oxford University Press.
Figures
Figure 1.
The percentage of gene products detected in proteomics experiments as a function of gene conservation. Gene conservation is expressed using MI score, displayed in bins. Bin ‘0’ is MI scores from 0 to 0.019, ‘0.02’ is from 0.02 to 0.039, etc. The ‘missing’ genes are those where the conservation was so poor that INERTIA was not able to generate a score.
Figure 2.
The percentage of genes for which peptides are detected in proteomics experiments against gene family age. Gene products with gene families that appeared in the oldest phylogenetic divisions (towards the left) are detected much more often in proteomics experiments than those genes with families that appeared in the most recent phylogenetic divisions.
Figure 3.
Transcript ubiquity for human genes. UniGene contains transcript evidence for most human genes over 45 different tissues. For each gene, we counted the number of tissues in which there was transcript evidence of at least five or more transcripts per million_._ We separated the numbers of tissues in which transcripts were detected in UniGene into ten bins and calculated the percentage of genes in each of the ten bins. We split the GENCODE 12 genes into three groups, those genes for which we found peptides (‘Detected’ in dark red), those genes for which we did not find peptides that were also in the potential non-coding set (‘Potential NC’ genes marked in yellow) and those for which we did not find peptides but that were not in the potential non-coding set (‘_Not Detected_’ genes, in orange).
Figure 4.
RFC scores for pairwise alignments with four species. The RFC scores are calculated as per the section Materials and methods. RFC scores for alignments between (A) human and chimp, (B) human and macaque, (C) human and mouse and (D) human and dog. We split the GENCODE 12 genes into three groups, those genes for which we found peptides (‘Detected’ in dark red), those genes for which we did not find peptides and that were in the potential non-coding set (‘Potential NC’ genes marked in yellow) and those that we did not detect but that were not in the potential non-coding set (‘_Not Detected_’ genes, in orange). As a comparison, we included the results for a set of long non-coding genes (‘Non-coding’ shown in blue). RFC scores are shown on the _y_-axis; the _x_-axis is the proportion of each set. RFC scores are ordered from highest to lowest.
Figure 5.
RFC scores for genes from potential NC set. The RFC scores were calculated as per the section Materials and methods for alignments between human and mouse only. We split the Potential NC set genes that we could classify into 4 groups, those 342 genes that we felt were likely protein-coding genes (Possible coding), the 396 genes that we felt were possible pseudogenes (Possible pseudogenes), the 229 read-through genes and those 969 genes that we felt were likely to be non-coding (Possible non-coding). We compared these four sets against three background sets, those protein-coding genes for which we found peptides (Detected in dark red), those coding genes for which we did not find peptides and that were not in the potential non-coding set (Not Detected genes, in orange) and a set of long non-coding genes (Non-coding shown in blue). RFC scores are shown on the _y_-axis; the _x_-axis in all the figures is the proportion of all the valid pairwise alignments included in the RFC calculations. RFC scores are ordered from highest to lowest.
Similar articles
- Pinstripe: a suite of programs for integrating transcriptomic and proteomic datasets identifies novel proteins and improves differentiation of protein-coding and non-coding genes.
Gascoigne DK, Cheetham SW, Cattenoz PB, Clark MB, Amaral PP, Taft RJ, Wilhelm D, Dinger ME, Mattick JS. Gascoigne DK, et al. Bioinformatics. 2012 Dec 1;28(23):3042-50. doi: 10.1093/bioinformatics/bts582. Epub 2012 Oct 7. Bioinformatics. 2012. PMID: 23044541 - Comparative proteomics reveals a significant bias toward alternative protein isoforms with conserved structure and function.
Ezkurdia I, del Pozo A, Frankish A, Rodriguez JM, Harrow J, Ashman K, Valencia A, Tress ML. Ezkurdia I, et al. Mol Biol Evol. 2012 Sep;29(9):2265-83. doi: 10.1093/molbev/mss100. Epub 2012 Mar 22. Mol Biol Evol. 2012. PMID: 22446687 Free PMC article. - Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics.
Fermin D, Allen BB, Blackwell TW, Menon R, Adamski M, Xu Y, Ulintz P, Omenn GS, States DJ. Fermin D, et al. Genome Biol. 2006;7(4):R35. doi: 10.1186/gb-2006-7-4-r35. Epub 2006 Apr 28. Genome Biol. 2006. PMID: 16646984 Free PMC article. - Proteogenomics: needs and roles to be filled by proteomics in genome annotation.
Ansong C, Purvine SO, Adkins JN, Lipton MS, Smith RD. Ansong C, et al. Brief Funct Genomic Proteomic. 2008 Jan;7(1):50-62. doi: 10.1093/bfgp/eln010. Epub 2008 Mar 10. Brief Funct Genomic Proteomic. 2008. PMID: 18334489 Review. - Small Proteins Encoded by Unannotated ORFs are Rising Stars of the Proteome, Confirming Shortcomings in Genome Annotations and Current Vision of an mRNA.
Delcourt V, Staskevicius A, Salzet M, Fournier I, Roucou X. Delcourt V, et al. Proteomics. 2018 May;18(10):e1700058. doi: 10.1002/pmic.201700058. Epub 2017 Oct 11. Proteomics. 2018. PMID: 28627015 Review.
Cited by
- Noncoding RNAs in gastric cancer: Research progress and prospects.
Zhang M, Du X. Zhang M, et al. World J Gastroenterol. 2016 Aug 7;22(29):6610-8. doi: 10.3748/wjg.v22.i29.6610. World J Gastroenterol. 2016. PMID: 27547004 Free PMC article. Review. - Alternatively Spliced Homologous Exons Have Ancient Origins and Are Highly Expressed at the Protein Level.
Abascal F, Ezkurdia I, Rodriguez-Rivas J, Rodriguez JM, del Pozo A, Vázquez J, Valencia A, Tress ML. Abascal F, et al. PLoS Comput Biol. 2015 Jun 10;11(6):e1004325. doi: 10.1371/journal.pcbi.1004325. eCollection 2015 Jun. PLoS Comput Biol. 2015. PMID: 26061177 Free PMC article. - NOX4 has the potential to be a biomarker associated with colon cancer ferroptosis and immune infiltration based on bioinformatics analysis.
Yang X, Yu Y, Wang Z, Wu P, Su X, Wu Z, Gan J, Zhang D. Yang X, et al. Front Oncol. 2022 Sep 28;12:968043. doi: 10.3389/fonc.2022.968043. eCollection 2022. Front Oncol. 2022. PMID: 36249057 Free PMC article. - Compartment-Specific Proximity Ligation Expands the Toolbox to Assess the Interactome of the Long Non-Coding RNA NEAT1.
Mamontova V, Trifault B, Burger K. Mamontova V, et al. Int J Mol Sci. 2022 Apr 17;23(8):4432. doi: 10.3390/ijms23084432. Int J Mol Sci. 2022. PMID: 35457249 Free PMC article. Review. - Long noncoding RNAs of single hematopoietic stem and progenitor cells in healthy and dysplastic human bone marrow.
Wu Z, Gao S, Zhao X, Chen J, Keyvanfar K, Feng X, Kajigaya S, Young NS. Wu Z, et al. Haematologica. 2019 May;104(5):894-906. doi: 10.3324/haematol.2018.208926. Epub 2018 Dec 13. Haematologica. 2019. PMID: 30545929 Free PMC article.
References
- Pennisi E. A low gene number wins the GeneSweep pool. Science. 2003;300:1484. - PubMed
- Venter J.C., Adams M.D., Myers E.W., Li P.W., Mural R.J., Sutton G.G., Smith H.O., Yandell M., Evans C.A., Holt R.A., et al. The sequence of the human genome. Science. 2001;291:1304–1351. - PubMed
- International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. - PubMed
- International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–945. - PubMed
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources