Computational inference of homologous gene structures in the human genome - PubMed (original) (raw)

Exon- and nucleotide-level accuracy of similarity-based gene-prediction programs as a function of protein similarity. (A) Exon-level sensitivity (ESn: percent of exons predicted exactly) and (B) exon-level specificity (ESp: percent of predicted exons exactly correct) were calculated for subsets of the SingleGene dataset and grouped according to the level of

BLASTP

similarity (in the context of a database search) between the encoded protein and the protein used in the prediction for

GenomeScan

,

Procrustes

, and

GeneWise

as described by Guigó et al. 2000). The definitions of the subsets and number of genes per subset were as follows: 10−5 > P >10−10 (90); 10−10 > P > 10−20 (103); 10−20 > P >10−30 (102); 10−30 > P > 10−40 (97); 10−40 > P >10−60 (114); 10−60 > P > 10−80 (97); 10−80 > P > 10−120 (97); and_P_ < 10−120 (72). For example, 114 of the 175 sequences in the SingleGene dataset had a homolog with

BLAST

_P_-value in the range 10−60< P < 10−40. For sequences in this subset,

GenomeScan

was run using the results of a

BLASTX

run of the genomic sequence against the top hit in the nonredundant protein database that had sequence similarity in the desired range (10−40 > P > 10−60).

GeneWise

and

Procrustes

data, run using the same peptides as input, are from Guigó et al. (2000). (C) Nucleotide-level sensitivity (NSn: percent of coding nucleotides predicted correctly) and (D) nucleotide-level specificity (NSp: percent of predicted coding nucleotides that are correct). Accuracy statistics on the SingleGene dataset as a whole for the ab initio gene-prediction methods

GENSCAN

,

HMMGene

1.1, and

GRAIL

3.1, respectively, were as follows: ESn (0.79, 0.75, 0.47); ESp (0.77, 0.68, 0.61); NSn (0.93, 0.86, 0.68): NSp (0.91, 0.74, 0.94).