Detection of protein coding sequences using a mixture model for local protein amino acid sequence. (original) (raw)

Locating protein coding regions in genomic DNA is a critical step in accessing the information generated by large scale sequencing projects. Current methods for gene detection depend on statistical measures of content differences between coding and noncoding DNA in addition to the recognition of promoters, splice sites, and other regulatory sites. Here we explore the potential value of recurrent amino acid sequence patterns 3-19 amino acids in length as a content statistic for use in gene nding approaches. A nite mixture model incorporating these patterns can partially discriminate protein sequences which have no (detectable) known homologs from randomized versions of these sequences, and from short ( 50 amino acids) non-coding segments extracted from the S. cerevisiea genome. The mixture model derived scores for a collection of human exons were not correlated with the GENSCAN scores, suggesting that the addition of our protein pattern recognition module to current gene recognition programs may improve their performance.

Loading...

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.