GenBlastA: enabling BLAST to identify homologous gene sequences - PubMed (original) (raw)

GenBlastA: enabling BLAST to identify homologous gene sequences

Rong She et al. Genome Res. 2009 Jan.

Abstract

BLAST is an extensively used local similarity search tool for identifying homologous sequences. When a gene sequence (either protein sequence or nucleotide sequence) is used as a query to search for homologous sequences in a genome, the search results, represented as a list of high-scoring pairs (HSPs), are fragments of candidate genes rather than full-length candidate genes. Relevant HSPs ("signals"), which represent candidate genes in the target genome sequences, are buried within a report that contains also hundreds to thousands of random HSPs ("noises"). Consequently, BLAST results are often overwhelming and confusing even to experienced users. For effective use of BLAST, a program is needed for extracting relevant HSPs that represent candidate homologous genes from the entire HSP report. To achieve this goal, we have designed a graph-based algorithm, genBlastA, which automatically filters HSPs into well-defined groups, each representing a candidate gene in the target genome. The novelty of genBlastA is an edge length metric that reflects a set of biologically motivated requirements so that each shortest path corresponds to an HSP group representing a homologous gene. We have demonstrated that this novel algorithm is both efficient and accurate for identifying homologous sequences, and that it outperforms existing approaches with similar functionalities.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

Grouping of HSPs into groups representing paralogs (Gene1 and Gene2) in tandem in the target genome. For simplicity, this figure shows only a small portion of the HSPs returned by BLAST. Each HSP may correspond to a coding segment (likely an exon) of a gene, thus a group of HSPs may collectively represent a full-length gene. Each shaded box at the bottom of the figures represents an HSP at its corresponding genomic position. Candidate genes are shown on the genome, with exons (black boxes) connected by introns (lines). The HSP groups that best represent the genes are shown under the corresponding genes, with relevant HSPs in the groups circled. Two paralogous genes in tandem (Gene1 and Gene2) are shown. The boundary of the two genes must be correctly resolved.

Figure 2.

Figure 2.

Grouping HSPs into groups representing individual genes. genBlastA was able to resolve all five members, while ML resolved only two and WU only one. Gene models are shown in the Gene Models track. HSPs are shown as blue boxes in the All HSPs track. The color indicates different PIDs for the HSPs. Darker color indicates higher PID. The genBlastA Group, ML Group, and WU Group tracks show HSPs groupings that are returned by genBlastA, ML, and WU-BLAST, respectively.

Figure 3.

Figure 3.

Grouping of HSPs to represent individual homologous genes in tandem clusters. This figure shows average resolve rate for a total of 30 tandem duplicated gene clusters in the EvsE data set for genBlastA (GB), Cui et al. (2007) (ML), and WU-Blast (WU). Ratio of specific groups was calculated as the number of genes resolved over the total number of genes in each tandem gene cluster. A gene is considered resolved if the HSP group overlaps with only one single gene in WormBase and the span similarity is ≥50%. Gapped and ungapped represent two independent BLAST results using either gapped setting or ungapped setting. GB alpha value is 0.5. ML distance threshold is 1000. Error bars, SE. (***) Statistical significance (P < 0.001) by paired Student’s _t_-test.

Figure 4.

Figure 4.

(A) Average coverage for EvsE data set. (B) Average span similarity for EvsE data set. (C) Average coverage for EvsB data set. (D) Average span similarity for EvsB data set. In all cases, figures represent averaged results over 464 test genes for three different programs genBlastA (GB), Cui et al. (2007) (ML), and WU-Blast (WU). Gapped and ungapped represent two independent BLAST results using either gapped setting or ungapped setting. Span similarity is calculated by Jaccard similarity. GB alpha value is 0.5. ML distance threshold is 1000. Error bars, SE. (***) Statistical significance (P < 0.001) by paired Student’s _t_-test.

Figure 5.

Figure 5.

(A) HSPs returned by BLAST. Q1, Q2, Q3, and Q4 represent query segments, while T1, T2, T3, T4, T5, and T6 represent target segments. (B) Example of groups of HSPs. (C) The HSP graph, with solid lines representing edges and dotted edges indicating skip edges. (D) The HSP graph, with vertical bars indicating separating edges.

Similar articles

Cited by

References

    1. Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. - PubMed
    1. Bentley D.R. Whole-genome re-sequencing. Curr. Opin. Genet. Dev. 2006;16:545–552. - PubMed
    1. Birney E., Clamp M., Durbin R. GeneWise and Genomewise. Genome Res. 2004;14:988–995. - PMC - PubMed
    1. Chen N., Harris T.W., Antoshechkin I., Bastiani C., Bieri T., Blasiar D., Bradnam K., Canaran P., Chan J., Chen C.K., et al. WormBase: A comprehensive data resource for Caenorhabditis biology and genomics. Nucleic Acids Res. 2005;33:D383–D389. - PMC - PubMed
    1. Coghlan A., Wolfe K.H. Fourfold faster rate of genome rearrangement in nematodes than in Drosophila. Genome Res. 2002;12:857–867. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources