William Hayes - Academia.edu (original) (raw)

Papers by William Hayes

Research paper thumbnail of GeneLynx: A Gene-Centric Portal to the Human Genome

Genome Research, 2001

GeneLynx is a meta-database providing an extensive collection of hyperlinks to human gene-specifi... more GeneLynx is a meta-database providing an extensive collection of hyperlinks to human gene-specific information in diverse databases available on the Internet. The GeneLynx project is based on the simple notion that given any gene-specific identifier (accession number, gene name, text, or sequence), scientists should be able to access a single location that provides a set of links to all the publicly available information pertinent to the specified human gene. GeneLynx was implemented as an extensible relational database with an intuitive and user-friendly Web interface. The data are automatically extracted from more than 40 external resources, using appropriate approaches to maximize coverage of the available data. Construction and curation of the system is mediated by a custom set of software tools. An indexing utility is provided to facilitate the establishment of hyperlinks in external databases. A unique feature of the GeneLynx system is a communal curation system for user-aided...

Research paper thumbnail of AZuRe, a scalable system for automated term disambiguation of gene and protein names

Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004.

Research paper thumbnail of Ontology-Based Interactive Information Extraction From Scientific Abstracts

Comparative and Functional Genomics, 2005

Over recent years, there has been a growing interest in extracting information automatically or s... more Over recent years, there has been a growing interest in extracting information automatically or semi-automatically from the scientific literature. This paper describes a novel ontology-based interactive information extraction (OBIIE) framework and a specific OBIIE system. We describe how this system enables life scientists to make ad hoc queries similar to using a standard search engine, but where the results are obtained in a database format similar to a pre-programmed information extraction engine. We present a case study in which the system was evaluated for extracting co-factors from EMBASE and MEDLINE.

Research paper thumbnail of Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL)

Database : the journal of biological databases and curation, 2016

Success in extracting biological relationships is mainly dependent on the complexity of the task ... more Success in extracting biological relationships is mainly dependent on the complexity of the task as well as the availability of high-quality training data. Here, we describe the new corpora in the systems biology modeling language BEL for training and testing biological relationship extraction systems that we prepared for the BioCreative V BEL track. BEL was designed to capture relationships not only between proteins or chemicals, but also complex events such as biological processes or disease states. A BEL nanopub is the smallest unit of information and represents a biological relationship with its provenance. In BEL relationships (called BEL statements), the entities are normalized to defined namespaces mainly derived from public repositories, such as sequence databases, MeSH or publicly available ontologies. In the BEL nanopubs, the BEL statements are associated with citation information and supportive evidence such as a text excerpt. To enable the training of extraction tools, w...

Research paper thumbnail of Reputation-Based Collaborative Network Biology

Research paper thumbnail of On Crowd-verification of Biological Networks

Bioinformatics and Biology Insights, 2013

Biological networks with a structured syntax are a powerful way of representing biological inform... more Biological networks with a structured syntax are a powerful way of representing biological information generated from high density data; however, they can become unwieldy to manage as their size and complexity increase. This article presents a crowd-verification approach for the visualization and expansion of biological networks. Web-based graphical interfaces allow visualization of causal and correlative biological relationships represented using Biological Expression Language (BEL). Crowdsourcing principles enable participants to communally annotate these relationships based on literature evidences. Gamification principles are incorporated to further engage domain experts throughout biology to gather robust peer-reviewed information from which relationships can be identified and verified. The resulting network models will represent the current status of biological knowledge within the defined boundaries, here processes related to human lung disease. These models are amenable to co...

Research paper thumbnail of Bacterial start site prediction

Nucleic Acids Research, 1999

With the growing number of completely sequenced bacterial genes, accurate gene prediction in bact... more With the growing number of completely sequenced bacterial genes, accurate gene prediction in bacterial genomes remains an important problem. Although the existing tools predict genes in bacterial genomes with high overall accuracy, their ability to pinpoint the translation start site remains unsatisfactory. In this paper, we present a novel approach to bacterial start site prediction that takes into account multiple features of a potential start site, viz., ribosome binding site (RBS) binding energy, distance of the RBS from the start codon, distance from the beginning of the maximal ORF to the start codon, the start codon itself and the coding/non-coding potential around the start site. Mixed integer programing was used to optimize the discriminatory system. The accuracy of this approach is up to 90%, compared to 70%, using the most common tools in fully automated mode (that is, without expert human post-processing of results). The approach is evaluated using Bacillus subtilis, Escherichia coli and Pyrococcus furiosus. These three genomes cover a broad spectrum of bacterial genomes, since B.subtilis is a Gram-positive bacterium, E.coli is a Gram-negative bacterium and P.furiosus is an archaebacterium. A significant problem is generating a set of 'true' start sites for algorithm training, in the absence of experimental work. We found that sequence conservation between P.furiosus and the related Pyrococcus horikoshii clearly delimited the gene start in many cases, providing a sufficient training set.

Research paper thumbnail of Causal biological network database: a comprehensive platform of causal biological network models focused on the pulmonary and vascular systems

Database : the journal of biological databases and curation, 2015

With the wealth of publications and data available, powerful and transparent computational approa... more With the wealth of publications and data available, powerful and transparent computational approaches are required to represent measured data and scientific knowledge in a computable and searchable format. We developed a set of biological network models, scripted in the Biological Expression Language, that reflect causal signaling pathways across a wide range of biological processes, including cell fate, cell stress, cell proliferation, inflammation, tissue repair and angiogenesis in the pulmonary and cardiovascular context. This comprehensive collection of networks is now freely available to the scientific community in a centralized web-based repository, the Causal Biological Network database, which is composed of over 120 manually curated and well annotated biological network models and can be accessed at http://causalbionet.com. The website accesses a MongoDB, which stores all versions of the networks as JSON objects and allows users to search for genes, proteins, biological proc...

Research paper thumbnail of GeneLynx: A Gene-Centric Portal to the Human Genome

Genome Research, 2001

GeneLynx is a meta-database providing an extensive collection of hyperlinks to human gene-specifi... more GeneLynx is a meta-database providing an extensive collection of hyperlinks to human gene-specific information in diverse databases available on the Internet. The GeneLynx project is based on the simple notion that given any gene-specific identifier (accession number, gene name, text, or sequence), scientists should be able to access a single location that provides a set of links to all the publicly available information pertinent to the specified human gene. GeneLynx was implemented as an extensible relational database with an intuitive and user-friendly Web interface. The data are automatically extracted from more than 40 external resources, using appropriate approaches to maximize coverage of the available data. Construction and curation of the system is mediated by a custom set of software tools. An indexing utility is provided to facilitate the establishment of hyperlinks in external databases. A unique feature of the GeneLynx system is a communal curation system for user-aided annotation. GeneLynx can be accessed freely at http://www.genelynx.org.

Research paper thumbnail of Information needs and the role of text mining in drug development

Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, 2008

Drug development generates information needs from groups throughout a company. Knowing where to l... more Drug development generates information needs from groups throughout a company. Knowing where to look for high-quality information is essential for minimizing costs and remaining competitive. Using 1131 research requests that came to our library between 2001 and 2007, we show that drugs, diseases, and genes/proteins are the most frequently searched subjects, and journal articles, patents, and competitive intelligence literature are the most frequently consulted textual resources.

Research paper thumbnail of Computer survey for likely genes in the one megabase contiguous genomic sequence data of Synechocystis sp. strain PCC6803

DNA research : an international journal for rapid publication of reports on genes and genomes, 1995

The user has requested enhancement of the downloaded file. All in-text references underlined in b... more The user has requested enhancement of the downloaded file. All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.

Research paper thumbnail of Applications of GeneMark in multispecies environments

Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology, 1996

This paper is supposed to bridge the gap between practical experience in using GeneMark for a rap... more This paper is supposed to bridge the gap between practical experience in using GeneMark for a rapidly widening repertoire of genomes, and the available publications that determine and compare the gene prediction accuracy of the GeneMark method for different genomes. Here we tbcus on the genome-specific variability of prediction error rates and their sources. DNA sequence inhomogeneity is present both in training and control sets of coding and non-coding regions. Coding region inhomogeneity, caused by differences in sequence composition between "native" and horizontally transferred genes or between genes expressed at different levels, contributes to the false negative error rate. Inhomogeneity of non-coding region may frequently be caused by the presence of unnoticed genes and contributes to the false positive error rate. We have documented such unnoticed genes in GenBank sequences for several species. Some of protein products of these genes have been characterized by similarity search methods. For others, which we call "pioneer genes", no significant similarity has been found at a protein sequence level although the confidence of GeneMark prediction is high. For instance, to date a majority of those pioneer gene predictions made for E. coil now show strong similarity to more recently characterized proteins that have been added to protein sequence database. Another practical question is related to genomic sequence inhomogeneity at interspecies level: if GeneMark has not been trained for a particular species, is it possible to apply models derived for phylogenetically close genomes? The answer is, yes. The results of cross-species gene prediction experiments show that cross-species prediction can often be reasonably accurate.

Research paper thumbnail of Metabolism and evolution of Haemophilus influenzae deduced from a whole-genome comparison with Escherichia coli

Current biology : CB, 1996

Background: The 1.83 Megabase (Mb) sequence of the Haemophilus influenzae chromosome, the first c... more Background: The 1.83 Megabase (Mb) sequence of the Haemophilus influenzae chromosome, the first completed genome sequence of a cellular life form, has been recently reported. Approximately 75 % of the 4.7 Mb genome sequence of Escherichia coli is also available. The life styles of the two bacteria are very different -H. influenzae is an obligate parasite that lives in human upper respiratory mucosa and can be cultivated only on rich media, whereas E. coli is a saprophyte that can grow on minimal media. A detailed comparison of the protein products encoded by these two genomes is expected to provide valuable insights into bacterial cell physiology and genome evolution.

Research paper thumbnail of Gene identification and classification in the Synechocystis genomic sequence by recursive gene mark analysis

DNA sequence : the journal of DNA sequencing and mapping, 1997

The GeneMark method has proven to be an efficient gene-finding tool for the analysis of prokaryot... more The GeneMark method has proven to be an efficient gene-finding tool for the analysis of prokaryotic genomic sequence data. We have developed a procedure of deriving and utilizing several GeneMark models in order to get better gene-detection performance. Upon applying this procedure to the 1.0 Mb contiguous DNA sequence of Synechocystis sp. strain PCC6803, we were able to cluster predicted genes into distinct classes and to produce the class-specific GeneMark models reflecting statistical characteristics of each gene class. One gene class apparently includes genes of exogenous origin. Using class-specific models reduces the gene under prediction error rate down to 1.7% in comparison with 8.1% reported in the previous study when only one GeneMark model was used.

Research paper thumbnail of The complete genome sequence of the gastric pathogen Helicobacter pylori

Nature, 1997

Helicobacter pylori, strain 26695, has a circular genome of 1,667,867 base pairs and 1,590 predic... more Helicobacter pylori, strain 26695, has a circular genome of 1,667,867 base pairs and 1,590 predicted coding sequences. Sequence analysis indicates that H. pylori has well-developed systems for motility, for scavenging iron, and for DNA restriction and modification. Many putative adhesins, lipoproteins and other outer membrane proteins were identified, underscoring the potential complexity of host-pathogen interaction. Based on the large number of sequence-related genes encoding outer membrane proteins and the presence of homopolymeric tracts and dinucleotide repeats in coding sequences, H. pylori, like several other mucosal pathogens, probably uses recombination and slipped-strand mispairing within repeats as mechanisms for antigenic variation and adaptive evolution. Consistent with its restricted niche, H. pylori has a few regulatory networks, and a limited metabolic repertoire and biosynthetic capacity. Its survival in acid conditions depends, in part, on its ability to establish a positive inside-membrane potential in low pH. grant from the National Center for Research Resources. We thank N. S. Akopyants for preparing high quality chromosomal DNA from H. pylori strain 26695; M. Heaney, J. Scott, A. Saeed and R. Shirley for software and database support; and V. Sapiro, B. Vincent, J. Meehan and D. Mass for computer system support.

Research paper thumbnail of Deriving ribosomal binding site (RBS) statistical models from unannotated DNA sequences and the use of the RBS model for N-terminal prediction

Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, 1998

Accurate prediction of the position of translation initiation N-terminal prediction is a di cult ... more Accurate prediction of the position of translation initiation N-terminal prediction is a di cult problem. N-terminal prediction from DNA sequence alone is ambiguous if several candidate start sites are close to each other. Protein similarity search is usually unable to indicate the true start of a gene as it would require a strong protein sequence similarity at the N-terminal portion of a protein where conservative regions are rarely situated. With the aid of the GeneMark program for gene identi cation, we extract DNA sequence fragments presumably containing ribosome binding sites RBS from unannotated complete genomic sequences. These DNA segments are aligned to generate the RBS model using the Gibbs' sampling method. N-terminal prediction is then performed by using the RBS model in conjunction with the GeneMark start codon prediction to aid in determining the true N-terminal site.

Research paper thumbnail of How to interpret an anonymous bacterial genome: machine learning approach to gene identification

Genome research, 1998

An international, peer-reviewed genome sciences journal featuring outstanding original research t... more An international, peer-reviewed genome sciences journal featuring outstanding original research that offers novel insights into the biology of all organisms.

Research paper thumbnail of GeneLynx: A Gene-Centric Portal to the Human Genome

Genome Research, 2001

GeneLynx is a meta-database providing an extensive collection of hyperlinks to human gene-specifi... more GeneLynx is a meta-database providing an extensive collection of hyperlinks to human gene-specific information in diverse databases available on the Internet. The GeneLynx project is based on the simple notion that given any gene-specific identifier (accession number, gene name, text, or sequence), scientists should be able to access a single location that provides a set of links to all the publicly available information pertinent to the specified human gene. GeneLynx was implemented as an extensible relational database with an intuitive and user-friendly Web interface. The data are automatically extracted from more than 40 external resources, using appropriate approaches to maximize coverage of the available data. Construction and curation of the system is mediated by a custom set of software tools. An indexing utility is provided to facilitate the establishment of hyperlinks in external databases. A unique feature of the GeneLynx system is a communal curation system for user-aided...

Research paper thumbnail of AZuRe, a scalable system for automated term disambiguation of gene and protein names

Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004.

Research paper thumbnail of Ontology-Based Interactive Information Extraction From Scientific Abstracts

Comparative and Functional Genomics, 2005

Over recent years, there has been a growing interest in extracting information automatically or s... more Over recent years, there has been a growing interest in extracting information automatically or semi-automatically from the scientific literature. This paper describes a novel ontology-based interactive information extraction (OBIIE) framework and a specific OBIIE system. We describe how this system enables life scientists to make ad hoc queries similar to using a standard search engine, but where the results are obtained in a database format similar to a pre-programmed information extraction engine. We present a case study in which the system was evaluated for extracting co-factors from EMBASE and MEDLINE.

Research paper thumbnail of Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL)

Database : the journal of biological databases and curation, 2016

Success in extracting biological relationships is mainly dependent on the complexity of the task ... more Success in extracting biological relationships is mainly dependent on the complexity of the task as well as the availability of high-quality training data. Here, we describe the new corpora in the systems biology modeling language BEL for training and testing biological relationship extraction systems that we prepared for the BioCreative V BEL track. BEL was designed to capture relationships not only between proteins or chemicals, but also complex events such as biological processes or disease states. A BEL nanopub is the smallest unit of information and represents a biological relationship with its provenance. In BEL relationships (called BEL statements), the entities are normalized to defined namespaces mainly derived from public repositories, such as sequence databases, MeSH or publicly available ontologies. In the BEL nanopubs, the BEL statements are associated with citation information and supportive evidence such as a text excerpt. To enable the training of extraction tools, w...

Research paper thumbnail of Reputation-Based Collaborative Network Biology

Research paper thumbnail of On Crowd-verification of Biological Networks

Bioinformatics and Biology Insights, 2013

Biological networks with a structured syntax are a powerful way of representing biological inform... more Biological networks with a structured syntax are a powerful way of representing biological information generated from high density data; however, they can become unwieldy to manage as their size and complexity increase. This article presents a crowd-verification approach for the visualization and expansion of biological networks. Web-based graphical interfaces allow visualization of causal and correlative biological relationships represented using Biological Expression Language (BEL). Crowdsourcing principles enable participants to communally annotate these relationships based on literature evidences. Gamification principles are incorporated to further engage domain experts throughout biology to gather robust peer-reviewed information from which relationships can be identified and verified. The resulting network models will represent the current status of biological knowledge within the defined boundaries, here processes related to human lung disease. These models are amenable to co...

Research paper thumbnail of Bacterial start site prediction

Nucleic Acids Research, 1999

With the growing number of completely sequenced bacterial genes, accurate gene prediction in bact... more With the growing number of completely sequenced bacterial genes, accurate gene prediction in bacterial genomes remains an important problem. Although the existing tools predict genes in bacterial genomes with high overall accuracy, their ability to pinpoint the translation start site remains unsatisfactory. In this paper, we present a novel approach to bacterial start site prediction that takes into account multiple features of a potential start site, viz., ribosome binding site (RBS) binding energy, distance of the RBS from the start codon, distance from the beginning of the maximal ORF to the start codon, the start codon itself and the coding/non-coding potential around the start site. Mixed integer programing was used to optimize the discriminatory system. The accuracy of this approach is up to 90%, compared to 70%, using the most common tools in fully automated mode (that is, without expert human post-processing of results). The approach is evaluated using Bacillus subtilis, Escherichia coli and Pyrococcus furiosus. These three genomes cover a broad spectrum of bacterial genomes, since B.subtilis is a Gram-positive bacterium, E.coli is a Gram-negative bacterium and P.furiosus is an archaebacterium. A significant problem is generating a set of 'true' start sites for algorithm training, in the absence of experimental work. We found that sequence conservation between P.furiosus and the related Pyrococcus horikoshii clearly delimited the gene start in many cases, providing a sufficient training set.

Research paper thumbnail of Causal biological network database: a comprehensive platform of causal biological network models focused on the pulmonary and vascular systems

Database : the journal of biological databases and curation, 2015

With the wealth of publications and data available, powerful and transparent computational approa... more With the wealth of publications and data available, powerful and transparent computational approaches are required to represent measured data and scientific knowledge in a computable and searchable format. We developed a set of biological network models, scripted in the Biological Expression Language, that reflect causal signaling pathways across a wide range of biological processes, including cell fate, cell stress, cell proliferation, inflammation, tissue repair and angiogenesis in the pulmonary and cardiovascular context. This comprehensive collection of networks is now freely available to the scientific community in a centralized web-based repository, the Causal Biological Network database, which is composed of over 120 manually curated and well annotated biological network models and can be accessed at http://causalbionet.com. The website accesses a MongoDB, which stores all versions of the networks as JSON objects and allows users to search for genes, proteins, biological proc...

Research paper thumbnail of GeneLynx: A Gene-Centric Portal to the Human Genome

Genome Research, 2001

GeneLynx is a meta-database providing an extensive collection of hyperlinks to human gene-specifi... more GeneLynx is a meta-database providing an extensive collection of hyperlinks to human gene-specific information in diverse databases available on the Internet. The GeneLynx project is based on the simple notion that given any gene-specific identifier (accession number, gene name, text, or sequence), scientists should be able to access a single location that provides a set of links to all the publicly available information pertinent to the specified human gene. GeneLynx was implemented as an extensible relational database with an intuitive and user-friendly Web interface. The data are automatically extracted from more than 40 external resources, using appropriate approaches to maximize coverage of the available data. Construction and curation of the system is mediated by a custom set of software tools. An indexing utility is provided to facilitate the establishment of hyperlinks in external databases. A unique feature of the GeneLynx system is a communal curation system for user-aided annotation. GeneLynx can be accessed freely at http://www.genelynx.org.

Research paper thumbnail of Information needs and the role of text mining in drug development

Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, 2008

Drug development generates information needs from groups throughout a company. Knowing where to l... more Drug development generates information needs from groups throughout a company. Knowing where to look for high-quality information is essential for minimizing costs and remaining competitive. Using 1131 research requests that came to our library between 2001 and 2007, we show that drugs, diseases, and genes/proteins are the most frequently searched subjects, and journal articles, patents, and competitive intelligence literature are the most frequently consulted textual resources.

Research paper thumbnail of Computer survey for likely genes in the one megabase contiguous genomic sequence data of Synechocystis sp. strain PCC6803

DNA research : an international journal for rapid publication of reports on genes and genomes, 1995

The user has requested enhancement of the downloaded file. All in-text references underlined in b... more The user has requested enhancement of the downloaded file. All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.

Research paper thumbnail of Applications of GeneMark in multispecies environments

Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology, 1996

This paper is supposed to bridge the gap between practical experience in using GeneMark for a rap... more This paper is supposed to bridge the gap between practical experience in using GeneMark for a rapidly widening repertoire of genomes, and the available publications that determine and compare the gene prediction accuracy of the GeneMark method for different genomes. Here we tbcus on the genome-specific variability of prediction error rates and their sources. DNA sequence inhomogeneity is present both in training and control sets of coding and non-coding regions. Coding region inhomogeneity, caused by differences in sequence composition between "native" and horizontally transferred genes or between genes expressed at different levels, contributes to the false negative error rate. Inhomogeneity of non-coding region may frequently be caused by the presence of unnoticed genes and contributes to the false positive error rate. We have documented such unnoticed genes in GenBank sequences for several species. Some of protein products of these genes have been characterized by similarity search methods. For others, which we call "pioneer genes", no significant similarity has been found at a protein sequence level although the confidence of GeneMark prediction is high. For instance, to date a majority of those pioneer gene predictions made for E. coil now show strong similarity to more recently characterized proteins that have been added to protein sequence database. Another practical question is related to genomic sequence inhomogeneity at interspecies level: if GeneMark has not been trained for a particular species, is it possible to apply models derived for phylogenetically close genomes? The answer is, yes. The results of cross-species gene prediction experiments show that cross-species prediction can often be reasonably accurate.

Research paper thumbnail of Metabolism and evolution of Haemophilus influenzae deduced from a whole-genome comparison with Escherichia coli

Current biology : CB, 1996

Background: The 1.83 Megabase (Mb) sequence of the Haemophilus influenzae chromosome, the first c... more Background: The 1.83 Megabase (Mb) sequence of the Haemophilus influenzae chromosome, the first completed genome sequence of a cellular life form, has been recently reported. Approximately 75 % of the 4.7 Mb genome sequence of Escherichia coli is also available. The life styles of the two bacteria are very different -H. influenzae is an obligate parasite that lives in human upper respiratory mucosa and can be cultivated only on rich media, whereas E. coli is a saprophyte that can grow on minimal media. A detailed comparison of the protein products encoded by these two genomes is expected to provide valuable insights into bacterial cell physiology and genome evolution.

Research paper thumbnail of Gene identification and classification in the Synechocystis genomic sequence by recursive gene mark analysis

DNA sequence : the journal of DNA sequencing and mapping, 1997

The GeneMark method has proven to be an efficient gene-finding tool for the analysis of prokaryot... more The GeneMark method has proven to be an efficient gene-finding tool for the analysis of prokaryotic genomic sequence data. We have developed a procedure of deriving and utilizing several GeneMark models in order to get better gene-detection performance. Upon applying this procedure to the 1.0 Mb contiguous DNA sequence of Synechocystis sp. strain PCC6803, we were able to cluster predicted genes into distinct classes and to produce the class-specific GeneMark models reflecting statistical characteristics of each gene class. One gene class apparently includes genes of exogenous origin. Using class-specific models reduces the gene under prediction error rate down to 1.7% in comparison with 8.1% reported in the previous study when only one GeneMark model was used.

Research paper thumbnail of The complete genome sequence of the gastric pathogen Helicobacter pylori

Nature, 1997

Helicobacter pylori, strain 26695, has a circular genome of 1,667,867 base pairs and 1,590 predic... more Helicobacter pylori, strain 26695, has a circular genome of 1,667,867 base pairs and 1,590 predicted coding sequences. Sequence analysis indicates that H. pylori has well-developed systems for motility, for scavenging iron, and for DNA restriction and modification. Many putative adhesins, lipoproteins and other outer membrane proteins were identified, underscoring the potential complexity of host-pathogen interaction. Based on the large number of sequence-related genes encoding outer membrane proteins and the presence of homopolymeric tracts and dinucleotide repeats in coding sequences, H. pylori, like several other mucosal pathogens, probably uses recombination and slipped-strand mispairing within repeats as mechanisms for antigenic variation and adaptive evolution. Consistent with its restricted niche, H. pylori has a few regulatory networks, and a limited metabolic repertoire and biosynthetic capacity. Its survival in acid conditions depends, in part, on its ability to establish a positive inside-membrane potential in low pH. grant from the National Center for Research Resources. We thank N. S. Akopyants for preparing high quality chromosomal DNA from H. pylori strain 26695; M. Heaney, J. Scott, A. Saeed and R. Shirley for software and database support; and V. Sapiro, B. Vincent, J. Meehan and D. Mass for computer system support.

Research paper thumbnail of Deriving ribosomal binding site (RBS) statistical models from unannotated DNA sequences and the use of the RBS model for N-terminal prediction

Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing, 1998

Accurate prediction of the position of translation initiation N-terminal prediction is a di cult ... more Accurate prediction of the position of translation initiation N-terminal prediction is a di cult problem. N-terminal prediction from DNA sequence alone is ambiguous if several candidate start sites are close to each other. Protein similarity search is usually unable to indicate the true start of a gene as it would require a strong protein sequence similarity at the N-terminal portion of a protein where conservative regions are rarely situated. With the aid of the GeneMark program for gene identi cation, we extract DNA sequence fragments presumably containing ribosome binding sites RBS from unannotated complete genomic sequences. These DNA segments are aligned to generate the RBS model using the Gibbs' sampling method. N-terminal prediction is then performed by using the RBS model in conjunction with the GeneMark start codon prediction to aid in determining the true N-terminal site.

Research paper thumbnail of How to interpret an anonymous bacterial genome: machine learning approach to gene identification

Genome research, 1998

An international, peer-reviewed genome sciences journal featuring outstanding original research t... more An international, peer-reviewed genome sciences journal featuring outstanding original research that offers novel insights into the biology of all organisms.