Gene characterization index: assessing the depth of gene annotation - PubMed (original) (raw)

Gene characterization index: assessing the depth of gene annotation

Danielle Kemmer et al. PLoS One. 2008.

Abstract

Background: We introduce the Gene Characterization Index, a bioinformatics method for scoring the extent to which a protein-encoding gene is functionally described. Inherently a reflection of human perception, the Gene Characterization Index is applied for assessing the characterization status of individual genes, thus serving the advancement of both genome annotation and applied genomics research by rapid and unbiased identification of groups of uncharacterized genes for diverse applications such as directed functional studies and delineation of novel drug targets.

Methodology/principal findings: The scoring procedure is based on a global survey of researchers, who assigned characterization scores from 1 (poor) to 10 (extensive) for a sample of genes based on major online resources. By evaluating the survey as training data, we developed a bioinformatics procedure to assign gene characterization scores to all genes in the human genome. We analyzed snapshots of functional genome annotation over a period of 6 years to assess temporal changes reflected by the increase of the average Gene Characterization Index. Applying the Gene Characterization Index to genes within pharmaceutically relevant classes, we confirmed known drug targets as high-scoring genes and revealed potentially interesting novel targets with low characterization indexes. Removing known drug targets and genes linked to sequence-related patent filings from the entirety of indexed genes, we identified sets of low-scoring genes particularly suited for further experimental investigation.

Conclusions/significance: The Gene Characterization Index is intended to serve as a tool to the scientific community and granting agencies for focusing resources and efforts on unexplored areas of the genome. The Gene Characterization Index is available from http://cisreg.ca/gci/.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1

Figure 1. GCI Model Cross-validation Performance.

GCI Predictor Performance–Leave-One-Out cross-validation results for the final GCI predictor model utilizing the MARS method on z-score normalized data. The X-axis displays average evaluator assigned scores, while the Y-axis displays the predicted scores for each gene in the leave-one-out cross validation analysis (the score assigned when the gene was not included in the training data). As observed, the MARS model can assign scores greater than 10 (in all further analysis such scores are rounded down to 10).

Figure 2

Figure 2. Genome-wide GCI Score Distribution.

Histogram displaying the frequency of scores observed in the analysis of genes at 3 different time points after the release of the first draft of the human genome sequence. Genes based only on predictions and/or EST sequences have been removed (∼3000 genes in 2007 data).

Figure 3

Figure 3. Resnik Scores for Depth of GO Gene Annotation Correspond with GCI Scores.

The Resnik score describes the granularity of annotations attached to each gene. There is an overall Pearson correlation of 0.6 between GCI and Resnik scores. The distribution plot shows the distribution of Resnik scores for ranges of GCI scores.

Figure 4

Figure 4. A. Distribution of GCI Scores for Genes in Selected Protein Families and Classes.

750 G Protein-Coupled Recptors: 79 DTG, 671 NDTG; 50 Nuclear Receptors: 23 DTG, 27 NDTG; 66 Ligand-Gated Ion Channels: 21 DTG, 45 NDTG; 111 Potassium Ion Channels: 14 DTG, 97 NDTG. B. Genome-wide GCI Score Distribution for Drug Targets, Patented and All Other Genes. Based on genome release July 2006: 1095 drug targets, 14237 patented, 14913 non-target, non-patented genes. 10867 non-targeted, non-patented genes were highly uncharacterized with GCI scores <3.5.

Figure 5

Figure 5. Evolution of Patented versus Non-Patented Genes between 2001 and 2007.

Histogram presenting substantial differences in annotation progress between patented and non-patented genes. Fluctuating gene numbers due to changes in genome annotations and transcript mappings.

Figure 6

Figure 6. Screenshot of GCI Web Page.

Example of Calmodulin-like protein 6 returned by GCI search engine with gene-specific GCI score and links to data sources.

Similar articles

Cited by

References

    1. Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007;447:799–816. - PMC - PubMed
    1. Bogue MA, Grubb SC. The Mouse Phenome Project. Genetica. 2004;122:71–74. - PubMed
    1. Mashimo T, Voigt B, Kuramoto T, Serikawa T. Rat Phenome Project: the untapped potential of existing rat strains. J Appl Physiol. 2005;98:371–379. - PubMed
    1. Rual JF, Ceron J, Koreth J, Hao T, Nicot AS, et al. Toward improving Caenorhabditis elegans phenome mapping with an ORFeome-based RNAi library. Genome Res. 2004;14:2162–2168. - PMC - PubMed
    1. Gewin V. A golden age of brain exploration. PLoS Biol. 2005;3:e24. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources