David Ussery | Norwegian University of Science and Technology (original) (raw)
Papers by David Ussery
Standards in Genomic Sciences, 2013
The Firmicutes represent a major component of the intestinal microflora. The intestinal Firmicute... more The Firmicutes represent a major component of the intestinal microflora. The intestinal Firmicutes are a large, diverse group of organisms, many of which are poorly characterized due to their anaerobic growth requirements. Although most Firmicutes are Gram positive, members of the class Negativicutes, including the genus Veillonella, stain Gram negative. Veillonella are among the most abundant organisms of the oral and intestinal microflora of animals and humans, in spite of being strict anaerobes. In this work, the genomes of 24 Negativicutes, including eight Veillonella spp., are compared to 20 other Firmicutes genomes; a further 101 prokaryotic genomes were included, covering 26 phyla. Thus a total of 145 prokaryotic genomes were analyzed by various methods to investigate the apparent conflict of the Veillonella Gram stain and their taxonomic position within the Firmicutes. Comparison of the genome sequences confirms that the Negativicutes are distantly related to Clostridium spp., based on 16S rRNA, complete genomic DNA sequences, and a consensus tree based on conserved proteins. The genus Veillonella is relatively homogeneous: inter-genus pairwise comparison identifies at least 1,350 shared proteins, although less than half of these are found in any given Clostridium g enome. Only 27 proteins are found conserved in all analyzed prokaryote genomes. Veillonella has distinct metabolic properties, and significant similarities to genomes of Proteobacteria are not detected, with the exception of a shared LPS biosynthesis pathway. The clade within the class Negativicutes to which the genus Veillonella belongs exhibits unique properties, most of which are in common with Gram-positives and some with Gram negatives. They are only distantly related to Clostridia, but are even less closely related to Gram-negative species. Thoug h the Negativicutes stain Gram-negative and possess two membranes, the genome and proteome analysis presented here confirm their place within the (mainly) Gram positive phylum of the Firmicutes. Further studies are required to unveil the evolutionary history of the Veillonella and other Negativicutes.
Escherichia coli is an important component of the biosphere and is an ideal model for studies of ... more Escherichia coli is an important component of the biosphere and is an ideal model for studies of processes involved in bacterial genome evolution. Sixty-one publically available E. coli and Shigella spp. sequenced genomes are compared, using basic methods to produce phylogenetic and proteomics trees, and to identify the pan-and core genomes of this set of sequenced strains. A hierarchical clustering of variable genes allowed clear separation of the strains into clusters, including known pathotypes; clinically relevant serotypes can also be resolved in this way. In contrast, when in silico MLST was performed, many of the various strains appear jumbled and less well resolved. The predicted pan-genome comprises 15,741 gene families, and only 993 (6%) of the families are represented in every genome, comprising the core genome. The variable or 'accessory' genes thus make up more than 90% of the pan-genome and about 80% of a typical genome; some of these variable genes tend to be co-localized on genomic islands. The diversity within the species E. coli, and the overlap in gene content between this and related species, suggests a continuum rather than sharp species borders in this group of Enterobacteriaceae.
Thirty-two genome sequences of various Vibrio-naceae members are compared, with emphasis on what ... more Thirty-two genome sequences of various Vibrio-naceae members are compared, with emphasis on what makes V. cholerae unique. As few as 1,000 gene families are conserved across all the Vibrionaceae genomes analysed ; this fraction roughly doubles for gene families conserved within the species V. cholerae. Of these, approximately 200 gene families that cluster on various locations of the genome are not found in other sequenced Vibrionaceae; these are possibly unique to the V. cholerae species. By comparing gene family content of the analysed genomes, the relatedness to a particular species is identified for two unspeciated genomes. Conversely, two genomes presumably belonging to the same species have suspiciously dissimilar gene family content. We are able to identify a number of genes that are conserved in, and unique to, V. cholerae. Some of these genes may be crucial to the niche adaptation of this species.
Bacterial pathogens are being sequenced at an increasing rate. To many microbiologists, it appear... more Bacterial pathogens are being sequenced at an increasing rate. To many microbiologists, it appears that there simply is not enough time to digest all the information suddenly available. In this chapter we present several tools for comparison of sequenced pathogenic genomes, and discuss differences between pathogens and non-pathogens. The presented tools allow comparison of large numbers of genomes in a hypothesis-driven manner. Visualization of the results is very important for clear presentation of the results and various ways of graphical representation are introduced.
Standards in genomic sciences, 2014
More than 80% of the microbial genomes in GenBank are of 'draft' quality (12,553 draft vs... more More than 80% of the microbial genomes in GenBank are of 'draft' quality (12,553 draft vs. 2,679 finished, as of October, 2013). We have examined all the microbial DNA sequences available for complete, draft, and Sequence Read Archive genomes in GenBank as well as three other major public databases, and assigned quality scores for more than 30,000 prokaryotic genome sequences. Scores were assigned using four categories: the completeness of the assembly, the presence of full-length rRNA genes, tRNA composition and the presence of a set of 102 conserved genes in prokaryotes. Most (~88%) of the genomes had quality scores of 0.8 or better and can be safely used for standard comparative genomics analysis. We compared genomes across factors that may influence the score. We found that although sequencing depth coverage of over 100x did not ensure a better score, sequencing read length was a better indicator of sequencing quality. With few exceptions, most of the 30,000 genomes have...
Computers & chemistry, 2002
We examined more than 700 DNA sequences (full length chromosomes and plasmids) for stretches of p... more We examined more than 700 DNA sequences (full length chromosomes and plasmids) for stretches of purines (R) or pyrimidines (Y) and alternating YR stretches; such regions will likely adopt structures which are different from the canonical B-form. Since one turn of the DNA helix is roughly 10 bp, we measured the fraction of each genome which contains purine (or pyrimidine) tracts of lengths of 10 bp or longer (hereafter referred to as 'purine tracts'), as well as stretches of alternating pyrimidines/purine (pyr/pur tracts') of the same length. Using this criteria, a random sequence would be expected to contain 1.0% of purine tracts and also 1.0% of the alternating pyr/pur tracts. In the vast majority of cases, there are more purine tracts than would be expected from a random sequence, with an average of 3.5%, significantly larger than the expectation value. The fraction of the chromosomes containing pyr/pur tracts was slightly less than expected, with an average of 0.8%. O...
We have analysed the complete sequence of the Escherichia coli K12 isolate MG1655 genome for chro... more We have analysed the complete sequence of the Escherichia coli K12 isolate MG1655 genome for chromatin-associated protein binding sites, and compared the predicted location of predicted sites with experimental expression data from 'DNA chip' experiments. Of the dozen proteins associated with chromatin in E. coli, only three have been shown to have significant binding preferences: integration host factor (IHF) has the strongest binding site preference, and FIS sites show a weak consensus, and there is no clear consensus site for binding of the H-NS protein. Using hidden Markov models (HMMs), we predict the location of 608 IHF sites, scattered throughout the genome. A subset of the IHF sites associated with repeats tends to be clustered around the origin of replication. We estimate there could be roughly 6000 FIS sites in E. coli, and the sites tend to be localised in two regions flanking the replication termini. We also show that the regions upstream of genes regulated by H-NS are more curved and have a higher AT content than regions upstream of other genes. These regions in general would also be localised near the replication terminus. © 2001 Société française de biochimie et biologie moléculaire / Éditions scientifiques et médicales Elsevier SAS
Microbial Informatics and Experimentation, 2012
Background: The thermophilic Campylobacter jejuni and Campylobacter coli are considered weakly cl... more Background: The thermophilic Campylobacter jejuni and Campylobacter coli are considered weakly clonal populations where incongruences between genetic markers are assumed to be due to random horizontal transfer of genomic DNA. In order to investigate the population genetics structure we extracted a set of 1180 core gene families (CGF) from 27 sequenced genomes of C. jejuni and C. coli. We adopted a principal component analysis (PCA) on the normalized evolutionary distances in order to reveal any patterns in the evolutionary signals contained within the various CGFs.
Infection and Immunity, 2015
Urinary tract infections (UTIs) are among the most common infectious diseases of humans, with Esc... more Urinary tract infections (UTIs) are among the most common infectious diseases of humans, with Escherichia coli responsible for >80% of all cases. One extreme of UTI is asymptomatic bacteriuria (ABU), which occurs as an asymptomatic carrier state that resembles commensalism. To understand the evolution and molecular mechanisms that underpin ABU, the genome of the ABU E. coli strain VR50 was sequenced. Analysis of the complete genome indicated that it most resembles E. coli K-12, with the addition of a 94-kb genomic island (GI-VR50-pheV), eight prophages, and multiple plasmids. GI-VR50-pheV has a mosaic structure and contains genes encoding a number of UTI-associated virulence factors, namely, Afa (afimbrial adhesin), two autotransporter proteins (Ag43 and Sat), and aerobactin. We demonstrated that the presence of this island in VR50 confers its ability to colonize the murine bladder, as a VR50 mutant with GI-VR50-pheV deleted was attenuated in a mouse model of UTI in vivo. We established that Afa is the island-encoded factor responsible for this phenotype using two independent deletion (Afa operon and AfaE adhesin) mutants. E. coli VR50afa and VR50afaE displayed significantly decreased ability to adhere to human bladder epithelial cells. In the mouse model of UTI, VR50afa and VR50afaE displayed reduced bladder colonization compared to wild-type VR50, similar to the colonization level of the GI-VR50-pheV mutant. Our study suggests that E. coli VR50 is a commensal-like strain that has acquired fitness factors that facilitate colonization of the human bladder.
PLoS ONE, 2014
Shiga toxin-producing Escherichia coli (STEC) cause infections in humans ranging from asymptomati... more Shiga toxin-producing Escherichia coli (STEC) cause infections in humans ranging from asymptomatic carriage to bloody diarrhoea and haemolytic uremic syndrome (HUS). Here we present whole genome comparison of Norwegian non-O157 STEC strains with the aim to distinguish between strains with the potential to cause HUS and less virulent strains. Whole genome sequencing and comparisons were performed across 95 non-O157 STEC strains. Twenty-three of these were classified as HUS-associated, including strains from patients with HUS (n = 19) and persons with an epidemiological link to a HUS-case (n = 4). Genomic comparison revealed considerable heterogeneity in gene content across the 95 STEC strains. A clear difference in gene profile was observed between strains with and without the Locus of Enterocyte Effacement (LEE) pathogenicity island. Phylogenetic analysis of the core genome showed high degree of diversity among the STEC strains, but all HUS-associated STEC strains were distributed in two distinct clusters within phylogroup B1. However, non-HUS strains were also found in these clusters. A number of accessory genes were found to be significantly overrepresented among HUS-associated STEC, but none of them were unique to this group of strains, suggesting that different sets of genes may contribute to the pathogenic potential in different phylogenetic STEC lineages. In this study we were not able to clearly distinguish between HUS-associated and non-HUS non-O157 STEC by extensive genome comparisons. Our results indicate that STECs from different phylogenetic backgrounds have independently acquired virulence genes that determine pathogenic potential, and that the content of such genes is overlapping between HUS-associated and non-HUS strains. Citation: Haugum K, Johansen J, Gabrielsen C, Brandal LT, Bergh K, et al. (2014) Comparative Genomics to Delineate Pathogenic Potential in Non-O157 Shiga Toxin-Producing Escherichia coli (STEC) from Patients with and without Haemolytic Uremic Syndrome (HUS) in Norway. PLoS ONE 9(10): e111788.
BMC Genomics, 2003
Background: For most sequenced prokaryotic genomes, about a third of the protein coding genes ann... more Background: For most sequenced prokaryotic genomes, about a third of the protein coding genes annotated are "orphan proteins", that is, they lack homology to known proteins. These hypothetical genes are typically short and randomly scattered throughout the genome. This trend is seen for most of the bacterial and archaeal genomes published to date.
Nature biotechnology, 2014
Most current approaches for analyzing metagenomic data rely on comparisons to reference genomes, ... more Most current approaches for analyzing metagenomic data rely on comparisons to reference genomes, but the microbial diversity of many environments extends far beyond what is covered by reference databases. De novo segregation of complex metagenomic data into specific biological entities, such as particular bacterial strains or viruses, remains a largely unsolved problem. Here we present a method, based on binning co-abundant genes across a series of metagenomic samples, that enables comprehensive discovery of new microbial organisms, viruses and co-inherited genetic entities and aids assembly of microbial genomes without the need for reference sequences. We demonstrate the method on data from 396 human gut microbiome samples and identify 7,381 co-abundance gene groups (CAGs), including 741 metagenomic species (MGS). We use these to assemble 238 high-quality microbial genomes and identify affiliations between MGS and hundreds of viruses or genetic entities. Our method provides the mea...
BioMed Research International, 2015
Helicobacter pylori is a human gastric pathogen implicated as the major cause of peptic ulcer and... more Helicobacter pylori is a human gastric pathogen implicated as the major cause of peptic ulcer and second leading cause of gastric cancer (~70%) around the world. Conversely, an increased resistance to antibiotics and hindrances in the development of vaccines against H. pylori are observed. Pangenome analyses of the global representative H. pylori isolates consisting of 39 complete genomes are presented in this article. Phylogenetic analyses have revealed close relationships among geographically diverse strains of H. Pylori. The conservation among these genomes was further analyzed by pangenome approach; the predicted conserved gene families 193) constitute ~77% of the average H. pylori genome and 45% of the global gene repertoire of the species. Reverse vaccinology strategies have been adopted to identify and narrow-down the potential core-immunogenic candidates. Total of 29 non-host homolog proteins were characterized as universal therapeutic targets based on their functional annotation and proteinprotein interaction. Finally, pathogenomics and genome plasticity analysis revealed 3 highly conserved and 2 highly variable putative pathogenicity islands in all of the H. pylori genome analyzed.
Standards in Genomic Sciences, 2010
We present the pan-genome tree as a tool for visualizing similarities and differences between clo... more We present the pan-genome tree as a tool for visualizing similarities and differences between closely related microbial genomes within a species or genus. Distance between genomes is computed as a weighted relative Manhattan distance based on gene family presence/absence. The weights can be chosen with emphasis on groups of gene families conserved to various degrees inside the pan-genome. The software is available for free as an R-package.
PLoS Computational Biology, 2008
Oligonucleotide usage in archaeal and bacterial genomes can be linked to a number of properties, ... more Oligonucleotide usage in archaeal and bacterial genomes can be linked to a number of properties, including codon usage (trinucleotides), DNA base-stacking energy (dinucleotides), and DNA structural conformation (di-to tetranucleotides). We wanted to assess the statistical information potential of different DNA 'word-sizes' and explore how oligonucleotide frequencies differ in coding and non-coding regions. In addition, we used oligonucleotide frequencies to investigate DNA composition and how DNA sequence patterns change within and between prokaryotic organisms. Among the results found was that prokaryotic chromosomes can be described by hexanucleotide frequencies, suggesting that prokaryotic DNA is predominantly short range correlated, i.e., information in prokaryotic genomes is encoded in short oligonucleotides. Oligonucleotide usage varied more within AT-rich and host-associated genomes than in GC-rich and free-living genomes, and this variation was mainly located in non-coding regions. Bias (selectional pressure) in tetranucleotide usage correlated with GC content, and coding regions were more biased than non-coding regions. Non-coding regions were also found to be approximately 5.5% more AT-rich than coding regions, on average, in the 402 chromosomes examined. Pronounced DNA compositional differences were found both within and between AT-rich and GC-rich genomes. GC-rich genomes were more similar and biased in terms of tetranucleotide usage in non-coding regions than AT-rich genomes. The differences found between AT-rich and GC-rich genomes may possibly be attributed to lifestyle, since tetranucleotide usage within hostassociated bacteria was, on average, more dissimilar and less biased than free-living archaea and bacteria.
F1000Research, 2012
The study of microbial pangenomes relies on the computation of gene families, i.e. the clustering... more The study of microbial pangenomes relies on the computation of gene families, i.e. the clustering of coding sequences into groups of essentially similar genes. There is no standard approach to obtain such gene families. Ideally, the gene family computations should be robust against errors in the annotation of genes in various genomes. In an attempt to achieve this robustness, we propose to cluster sequences by their domain sequence, i.e. the ordered sequence of domains in their protein sequence. In a study of 347 genomes from Escherichia we find on average around 4500 proteins having hits in Pfam-A in every coli genome, clustering into around 2500 distinct domain sequence families in each genome. Across all genomes we find a total of 5724 such families. A binomial mixture model approach indicates this is around 95% of all domain sequences we would expect to see in in the future. A Heaps law analysis indicates E. coli the population of domain sequences is larger, but this analysis is also very sensitive to smaller changes in the computation procedure. The resolution between strains is good despite the coarse grouping obtained by domain sequence families. Clustering sequences by their ordered domain content give us domain sequence families, who are robust to errors in the gene prediction step. The computational load of the procedure scales linearly with the number of genomes, which is needed for the future explosion in the number of re-sequenced strains. The use of domain sequence families for a functional classification of strains clearly has some potential to be explored. Lars-Gustav Snipen ( ) Corresponding author: lars.snipen@umb.no Snipen LG, Ussery DW (2013) A domain sequence approach to pangenomics: applications to [v2; ref PubMed Abstract | Publisher Full Text | Free Full Text 3. Hiller NL, Janto B, Hogg JS, et al.: Comparative Genomic Analyses of Seventeen Streptococcus pneumoniae Strains: Insights into the Pneumococcal Supragenome. J Bacteriol. 2007; 189(22): 8186-8195. PubMed Abstract | Publisher Full Text | Free Full Text 4. Cazalet C, Jarraud S, Ghavi-Helm Y, et al.: Multigenome analysis identifies a worldwide distributed epidemic Legionella pneumophila clone that emerged within a highly diverse species. Genome Res. 2008; 18(3): 431-441. PubMed Abstract | Publisher Full Text | Free Full Text 5. Deng X, Phillippy AM, Li Z, et al.: Probing the pan-genome of Listeria monocytogenes: new insights into intraspecific niche expansion and genomic diversification. BMC Genomics. 2010; 11: 500. PubMed Abstract | Publisher Full Text | Free Full Text 6. Donati C, Hiller NL, Tettelin H, et al.: Structure and dynamics of the pan-genome of Streptococcus pneumoniae and closely related species. Genome Biol. 2010; 11(10): R107. PubMed Abstract | Publisher Full Text | Free Full Text 7. Hao P, Zheng H, Yu Y, et al.: Complete Sequencing and Pan-Genomic Analysis of Lactobacillus delbrueckii subsp. bulgaricus Reveal Its Genetic Basis for Industrial Yogurt Production. PLoS One. 2011; 6(1): e15964. PubMed Abstract | Publisher Full Text | Free Full Text 8. Rasko DA, Rosovitz MJ, Myers GS, et al.: The Pangenome Structure of Escherichia coli: Comparative Genomic Analysis of E. coli Commensal and Pathogenic Isolates. J Bacteriol. 2008; 190(20): 6881-6893. PubMed Abstract | Publisher Full Text | Free Full Text 9. Lukjancenko O, Wassenaar TM, Ussery DW: Comparison of 61 Sequenced Escherichia coli Genomes. Microb Ecol. 2010; 60(4): 708-720. PubMed Abstract | Publisher Full Text | Free Full Text 10. NCBI Genome: Escherichia coli [NCBI Genome: Escherichia coli]. Reference Source 11. Hogg JS, Hu FZ, Janto B, et al.: Characterization and modeling of the Haemophilus influenzae core and supragenomes based on the complete genomic sequences of Rd and 12 clinical nontypeable strains. Genome Biol. 2007; 8(6): R103. PubMed Abstract | Publisher Full Text | Free Full Text 12. Lapierre P, Gogarten JP: Estimating the size of the bacterial pan-genome. Trends Genet. 2009; 25(3): 107-110. PubMed Abstract | Publisher Full Text 13. Tettelin H, Riley D, Cattuto C, et al.: Comparative genomics: the bacterial pan-genome. Curr Opin Microbiol. 2008; 11(5): 472-477. PubMed Abstract | Publisher Full Text 14. Snipen L, Almrey T, Ussery DW: Microbial comparative pan-genomics using References 28. Delcher AL, Bratke KA, Powers EC, et al.: Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics. 2007; 23(6): 673-679. PubMed Abstract | Publisher Full Text | Free Full Text 29. NCBI RefSeq. [NCBI RefSeq]. Reference Source 30. Kislyuk AO, Haegeman B, Bergman NH, et al.: Genomic fluidity: an integrative view of gene diversity within microbial populations. BMC Genomics. 2011; 12(12): 32.
Frontiers in Microbiology, 2014
We have compared chromosome-specific genes in a set of 18 finished Vibrio genomes, and, in additi... more We have compared chromosome-specific genes in a set of 18 finished Vibrio genomes, and, in addition, also calculated the pan- and core-genomes from a data set of more than 250 draft Vibrio genome sequences. These genomes come from 9 known species and 2 unknown species. Within the finished chromosomes, we find a core set of 1269 encoded protein families for chromosome 1, and a core of 252 encoded protein families for chromosome 2. Many of these core proteins are also found in the draft genomes (although which chromosome they are located on is unknown.) Of the chromosome specific core protein families, 1169 and 153 are uniquely found in chromosomes 1 and 2, respectively. Gene ontology (GO) terms for each of the protein families were determined, and the different sets for each chromosome were compared. A total of 363 different "Molecular Function" GO categories were found for chromosome 1 specific protein families, and these include several broad activities: pyridoxine 5' phosphate synthetase, glucosylceramidase, heme transport, DNA ligase, amino acid binding, and ribosomal components; in contrast, chromosome 2 specific protein families have only 66 Molecular Function GO terms and include many membrane-associated activities, such as ion channels, transmembrane transporters, and electron transport chain proteins. Thus, it appears that whilst there are many "housekeeping systems" encoded in chromosome 1, there are far fewer core functions found in chromosome 2. However, the presence of many membrane-associated encoded proteins in chromosome 2 is surprising.
F1000Research, 2012
The comparative genomics of prokaryotes has shown the presence of conserved regions containing hi... more The comparative genomics of prokaryotes has shown the presence of conserved regions containing highly similar genes (the 'core genome') and other regions that vary in gene content (the 'flexible' regions). A significant part of the latter is involved in surface structures that are phage recognition targets. Another sizeable part provides for differences in niche exploitation. Metagenomic data indicates that natural populations of prokaryotes are composed of assemblages of clonal lineages or "meta-clones" that share a core of genes but contain a high diversity by varying the flexible component. This meta-clonal diversity is maintained by a collection of phages that equalize the populations by preventing any individual clonal lineage from hoarding common resources. Thus, this polyclonal assemblage and the phages preying upon them constitute natural selection units.
PLoS ONE, 2010
Campylobacter jejuni strain M1 (laboratory designation 99/308) is a rarely documented case of dir... more Campylobacter jejuni strain M1 (laboratory designation 99/308) is a rarely documented case of direct transmission of C. jejuni from chicken to a person, resulting in enteritis. We have sequenced the genome of C. jejuni strain M1, and compared this to 12 other C. jejuni sequenced genomes currently publicly available. Compared to these, M1 is closest to strain 81116. Based on the 13 genome sequences, we have identified the C. jejuni pan-genome, as well as the core genome, the auxiliary genes, and genes unique between strains M1 and 81116. The pan-genome contains 2,427 gene families, whilst the core genome comprised 1,295 gene families, or about two-thirds of the gene content of the average of the sequenced C. jejuni genomes. Various comparison and visualization tools were applied to the 13 C. jejuni genome sequences, including a species pan-and core genome plot, a BLAST Matrix and a BLAST Atlas. Trees based on 16S rRNA sequences and on the total gene families in each genome are presented. The findings are discussed in the background of the proven virulence potential of M1.
Standards in Genomic Sciences, 2013
The Firmicutes represent a major component of the intestinal microflora. The intestinal Firmicute... more The Firmicutes represent a major component of the intestinal microflora. The intestinal Firmicutes are a large, diverse group of organisms, many of which are poorly characterized due to their anaerobic growth requirements. Although most Firmicutes are Gram positive, members of the class Negativicutes, including the genus Veillonella, stain Gram negative. Veillonella are among the most abundant organisms of the oral and intestinal microflora of animals and humans, in spite of being strict anaerobes. In this work, the genomes of 24 Negativicutes, including eight Veillonella spp., are compared to 20 other Firmicutes genomes; a further 101 prokaryotic genomes were included, covering 26 phyla. Thus a total of 145 prokaryotic genomes were analyzed by various methods to investigate the apparent conflict of the Veillonella Gram stain and their taxonomic position within the Firmicutes. Comparison of the genome sequences confirms that the Negativicutes are distantly related to Clostridium spp., based on 16S rRNA, complete genomic DNA sequences, and a consensus tree based on conserved proteins. The genus Veillonella is relatively homogeneous: inter-genus pairwise comparison identifies at least 1,350 shared proteins, although less than half of these are found in any given Clostridium g enome. Only 27 proteins are found conserved in all analyzed prokaryote genomes. Veillonella has distinct metabolic properties, and significant similarities to genomes of Proteobacteria are not detected, with the exception of a shared LPS biosynthesis pathway. The clade within the class Negativicutes to which the genus Veillonella belongs exhibits unique properties, most of which are in common with Gram-positives and some with Gram negatives. They are only distantly related to Clostridia, but are even less closely related to Gram-negative species. Thoug h the Negativicutes stain Gram-negative and possess two membranes, the genome and proteome analysis presented here confirm their place within the (mainly) Gram positive phylum of the Firmicutes. Further studies are required to unveil the evolutionary history of the Veillonella and other Negativicutes.
Escherichia coli is an important component of the biosphere and is an ideal model for studies of ... more Escherichia coli is an important component of the biosphere and is an ideal model for studies of processes involved in bacterial genome evolution. Sixty-one publically available E. coli and Shigella spp. sequenced genomes are compared, using basic methods to produce phylogenetic and proteomics trees, and to identify the pan-and core genomes of this set of sequenced strains. A hierarchical clustering of variable genes allowed clear separation of the strains into clusters, including known pathotypes; clinically relevant serotypes can also be resolved in this way. In contrast, when in silico MLST was performed, many of the various strains appear jumbled and less well resolved. The predicted pan-genome comprises 15,741 gene families, and only 993 (6%) of the families are represented in every genome, comprising the core genome. The variable or 'accessory' genes thus make up more than 90% of the pan-genome and about 80% of a typical genome; some of these variable genes tend to be co-localized on genomic islands. The diversity within the species E. coli, and the overlap in gene content between this and related species, suggests a continuum rather than sharp species borders in this group of Enterobacteriaceae.
Thirty-two genome sequences of various Vibrio-naceae members are compared, with emphasis on what ... more Thirty-two genome sequences of various Vibrio-naceae members are compared, with emphasis on what makes V. cholerae unique. As few as 1,000 gene families are conserved across all the Vibrionaceae genomes analysed ; this fraction roughly doubles for gene families conserved within the species V. cholerae. Of these, approximately 200 gene families that cluster on various locations of the genome are not found in other sequenced Vibrionaceae; these are possibly unique to the V. cholerae species. By comparing gene family content of the analysed genomes, the relatedness to a particular species is identified for two unspeciated genomes. Conversely, two genomes presumably belonging to the same species have suspiciously dissimilar gene family content. We are able to identify a number of genes that are conserved in, and unique to, V. cholerae. Some of these genes may be crucial to the niche adaptation of this species.
Bacterial pathogens are being sequenced at an increasing rate. To many microbiologists, it appear... more Bacterial pathogens are being sequenced at an increasing rate. To many microbiologists, it appears that there simply is not enough time to digest all the information suddenly available. In this chapter we present several tools for comparison of sequenced pathogenic genomes, and discuss differences between pathogens and non-pathogens. The presented tools allow comparison of large numbers of genomes in a hypothesis-driven manner. Visualization of the results is very important for clear presentation of the results and various ways of graphical representation are introduced.
Standards in genomic sciences, 2014
More than 80% of the microbial genomes in GenBank are of 'draft' quality (12,553 draft vs... more More than 80% of the microbial genomes in GenBank are of 'draft' quality (12,553 draft vs. 2,679 finished, as of October, 2013). We have examined all the microbial DNA sequences available for complete, draft, and Sequence Read Archive genomes in GenBank as well as three other major public databases, and assigned quality scores for more than 30,000 prokaryotic genome sequences. Scores were assigned using four categories: the completeness of the assembly, the presence of full-length rRNA genes, tRNA composition and the presence of a set of 102 conserved genes in prokaryotes. Most (~88%) of the genomes had quality scores of 0.8 or better and can be safely used for standard comparative genomics analysis. We compared genomes across factors that may influence the score. We found that although sequencing depth coverage of over 100x did not ensure a better score, sequencing read length was a better indicator of sequencing quality. With few exceptions, most of the 30,000 genomes have...
Computers & chemistry, 2002
We examined more than 700 DNA sequences (full length chromosomes and plasmids) for stretches of p... more We examined more than 700 DNA sequences (full length chromosomes and plasmids) for stretches of purines (R) or pyrimidines (Y) and alternating YR stretches; such regions will likely adopt structures which are different from the canonical B-form. Since one turn of the DNA helix is roughly 10 bp, we measured the fraction of each genome which contains purine (or pyrimidine) tracts of lengths of 10 bp or longer (hereafter referred to as 'purine tracts'), as well as stretches of alternating pyrimidines/purine (pyr/pur tracts') of the same length. Using this criteria, a random sequence would be expected to contain 1.0% of purine tracts and also 1.0% of the alternating pyr/pur tracts. In the vast majority of cases, there are more purine tracts than would be expected from a random sequence, with an average of 3.5%, significantly larger than the expectation value. The fraction of the chromosomes containing pyr/pur tracts was slightly less than expected, with an average of 0.8%. O...
We have analysed the complete sequence of the Escherichia coli K12 isolate MG1655 genome for chro... more We have analysed the complete sequence of the Escherichia coli K12 isolate MG1655 genome for chromatin-associated protein binding sites, and compared the predicted location of predicted sites with experimental expression data from 'DNA chip' experiments. Of the dozen proteins associated with chromatin in E. coli, only three have been shown to have significant binding preferences: integration host factor (IHF) has the strongest binding site preference, and FIS sites show a weak consensus, and there is no clear consensus site for binding of the H-NS protein. Using hidden Markov models (HMMs), we predict the location of 608 IHF sites, scattered throughout the genome. A subset of the IHF sites associated with repeats tends to be clustered around the origin of replication. We estimate there could be roughly 6000 FIS sites in E. coli, and the sites tend to be localised in two regions flanking the replication termini. We also show that the regions upstream of genes regulated by H-NS are more curved and have a higher AT content than regions upstream of other genes. These regions in general would also be localised near the replication terminus. © 2001 Société française de biochimie et biologie moléculaire / Éditions scientifiques et médicales Elsevier SAS
Microbial Informatics and Experimentation, 2012
Background: The thermophilic Campylobacter jejuni and Campylobacter coli are considered weakly cl... more Background: The thermophilic Campylobacter jejuni and Campylobacter coli are considered weakly clonal populations where incongruences between genetic markers are assumed to be due to random horizontal transfer of genomic DNA. In order to investigate the population genetics structure we extracted a set of 1180 core gene families (CGF) from 27 sequenced genomes of C. jejuni and C. coli. We adopted a principal component analysis (PCA) on the normalized evolutionary distances in order to reveal any patterns in the evolutionary signals contained within the various CGFs.
Infection and Immunity, 2015
Urinary tract infections (UTIs) are among the most common infectious diseases of humans, with Esc... more Urinary tract infections (UTIs) are among the most common infectious diseases of humans, with Escherichia coli responsible for >80% of all cases. One extreme of UTI is asymptomatic bacteriuria (ABU), which occurs as an asymptomatic carrier state that resembles commensalism. To understand the evolution and molecular mechanisms that underpin ABU, the genome of the ABU E. coli strain VR50 was sequenced. Analysis of the complete genome indicated that it most resembles E. coli K-12, with the addition of a 94-kb genomic island (GI-VR50-pheV), eight prophages, and multiple plasmids. GI-VR50-pheV has a mosaic structure and contains genes encoding a number of UTI-associated virulence factors, namely, Afa (afimbrial adhesin), two autotransporter proteins (Ag43 and Sat), and aerobactin. We demonstrated that the presence of this island in VR50 confers its ability to colonize the murine bladder, as a VR50 mutant with GI-VR50-pheV deleted was attenuated in a mouse model of UTI in vivo. We established that Afa is the island-encoded factor responsible for this phenotype using two independent deletion (Afa operon and AfaE adhesin) mutants. E. coli VR50afa and VR50afaE displayed significantly decreased ability to adhere to human bladder epithelial cells. In the mouse model of UTI, VR50afa and VR50afaE displayed reduced bladder colonization compared to wild-type VR50, similar to the colonization level of the GI-VR50-pheV mutant. Our study suggests that E. coli VR50 is a commensal-like strain that has acquired fitness factors that facilitate colonization of the human bladder.
PLoS ONE, 2014
Shiga toxin-producing Escherichia coli (STEC) cause infections in humans ranging from asymptomati... more Shiga toxin-producing Escherichia coli (STEC) cause infections in humans ranging from asymptomatic carriage to bloody diarrhoea and haemolytic uremic syndrome (HUS). Here we present whole genome comparison of Norwegian non-O157 STEC strains with the aim to distinguish between strains with the potential to cause HUS and less virulent strains. Whole genome sequencing and comparisons were performed across 95 non-O157 STEC strains. Twenty-three of these were classified as HUS-associated, including strains from patients with HUS (n = 19) and persons with an epidemiological link to a HUS-case (n = 4). Genomic comparison revealed considerable heterogeneity in gene content across the 95 STEC strains. A clear difference in gene profile was observed between strains with and without the Locus of Enterocyte Effacement (LEE) pathogenicity island. Phylogenetic analysis of the core genome showed high degree of diversity among the STEC strains, but all HUS-associated STEC strains were distributed in two distinct clusters within phylogroup B1. However, non-HUS strains were also found in these clusters. A number of accessory genes were found to be significantly overrepresented among HUS-associated STEC, but none of them were unique to this group of strains, suggesting that different sets of genes may contribute to the pathogenic potential in different phylogenetic STEC lineages. In this study we were not able to clearly distinguish between HUS-associated and non-HUS non-O157 STEC by extensive genome comparisons. Our results indicate that STECs from different phylogenetic backgrounds have independently acquired virulence genes that determine pathogenic potential, and that the content of such genes is overlapping between HUS-associated and non-HUS strains. Citation: Haugum K, Johansen J, Gabrielsen C, Brandal LT, Bergh K, et al. (2014) Comparative Genomics to Delineate Pathogenic Potential in Non-O157 Shiga Toxin-Producing Escherichia coli (STEC) from Patients with and without Haemolytic Uremic Syndrome (HUS) in Norway. PLoS ONE 9(10): e111788.
BMC Genomics, 2003
Background: For most sequenced prokaryotic genomes, about a third of the protein coding genes ann... more Background: For most sequenced prokaryotic genomes, about a third of the protein coding genes annotated are "orphan proteins", that is, they lack homology to known proteins. These hypothetical genes are typically short and randomly scattered throughout the genome. This trend is seen for most of the bacterial and archaeal genomes published to date.
Nature biotechnology, 2014
Most current approaches for analyzing metagenomic data rely on comparisons to reference genomes, ... more Most current approaches for analyzing metagenomic data rely on comparisons to reference genomes, but the microbial diversity of many environments extends far beyond what is covered by reference databases. De novo segregation of complex metagenomic data into specific biological entities, such as particular bacterial strains or viruses, remains a largely unsolved problem. Here we present a method, based on binning co-abundant genes across a series of metagenomic samples, that enables comprehensive discovery of new microbial organisms, viruses and co-inherited genetic entities and aids assembly of microbial genomes without the need for reference sequences. We demonstrate the method on data from 396 human gut microbiome samples and identify 7,381 co-abundance gene groups (CAGs), including 741 metagenomic species (MGS). We use these to assemble 238 high-quality microbial genomes and identify affiliations between MGS and hundreds of viruses or genetic entities. Our method provides the mea...
BioMed Research International, 2015
Helicobacter pylori is a human gastric pathogen implicated as the major cause of peptic ulcer and... more Helicobacter pylori is a human gastric pathogen implicated as the major cause of peptic ulcer and second leading cause of gastric cancer (~70%) around the world. Conversely, an increased resistance to antibiotics and hindrances in the development of vaccines against H. pylori are observed. Pangenome analyses of the global representative H. pylori isolates consisting of 39 complete genomes are presented in this article. Phylogenetic analyses have revealed close relationships among geographically diverse strains of H. Pylori. The conservation among these genomes was further analyzed by pangenome approach; the predicted conserved gene families 193) constitute ~77% of the average H. pylori genome and 45% of the global gene repertoire of the species. Reverse vaccinology strategies have been adopted to identify and narrow-down the potential core-immunogenic candidates. Total of 29 non-host homolog proteins were characterized as universal therapeutic targets based on their functional annotation and proteinprotein interaction. Finally, pathogenomics and genome plasticity analysis revealed 3 highly conserved and 2 highly variable putative pathogenicity islands in all of the H. pylori genome analyzed.
Standards in Genomic Sciences, 2010
We present the pan-genome tree as a tool for visualizing similarities and differences between clo... more We present the pan-genome tree as a tool for visualizing similarities and differences between closely related microbial genomes within a species or genus. Distance between genomes is computed as a weighted relative Manhattan distance based on gene family presence/absence. The weights can be chosen with emphasis on groups of gene families conserved to various degrees inside the pan-genome. The software is available for free as an R-package.
PLoS Computational Biology, 2008
Oligonucleotide usage in archaeal and bacterial genomes can be linked to a number of properties, ... more Oligonucleotide usage in archaeal and bacterial genomes can be linked to a number of properties, including codon usage (trinucleotides), DNA base-stacking energy (dinucleotides), and DNA structural conformation (di-to tetranucleotides). We wanted to assess the statistical information potential of different DNA 'word-sizes' and explore how oligonucleotide frequencies differ in coding and non-coding regions. In addition, we used oligonucleotide frequencies to investigate DNA composition and how DNA sequence patterns change within and between prokaryotic organisms. Among the results found was that prokaryotic chromosomes can be described by hexanucleotide frequencies, suggesting that prokaryotic DNA is predominantly short range correlated, i.e., information in prokaryotic genomes is encoded in short oligonucleotides. Oligonucleotide usage varied more within AT-rich and host-associated genomes than in GC-rich and free-living genomes, and this variation was mainly located in non-coding regions. Bias (selectional pressure) in tetranucleotide usage correlated with GC content, and coding regions were more biased than non-coding regions. Non-coding regions were also found to be approximately 5.5% more AT-rich than coding regions, on average, in the 402 chromosomes examined. Pronounced DNA compositional differences were found both within and between AT-rich and GC-rich genomes. GC-rich genomes were more similar and biased in terms of tetranucleotide usage in non-coding regions than AT-rich genomes. The differences found between AT-rich and GC-rich genomes may possibly be attributed to lifestyle, since tetranucleotide usage within hostassociated bacteria was, on average, more dissimilar and less biased than free-living archaea and bacteria.
F1000Research, 2012
The study of microbial pangenomes relies on the computation of gene families, i.e. the clustering... more The study of microbial pangenomes relies on the computation of gene families, i.e. the clustering of coding sequences into groups of essentially similar genes. There is no standard approach to obtain such gene families. Ideally, the gene family computations should be robust against errors in the annotation of genes in various genomes. In an attempt to achieve this robustness, we propose to cluster sequences by their domain sequence, i.e. the ordered sequence of domains in their protein sequence. In a study of 347 genomes from Escherichia we find on average around 4500 proteins having hits in Pfam-A in every coli genome, clustering into around 2500 distinct domain sequence families in each genome. Across all genomes we find a total of 5724 such families. A binomial mixture model approach indicates this is around 95% of all domain sequences we would expect to see in in the future. A Heaps law analysis indicates E. coli the population of domain sequences is larger, but this analysis is also very sensitive to smaller changes in the computation procedure. The resolution between strains is good despite the coarse grouping obtained by domain sequence families. Clustering sequences by their ordered domain content give us domain sequence families, who are robust to errors in the gene prediction step. The computational load of the procedure scales linearly with the number of genomes, which is needed for the future explosion in the number of re-sequenced strains. The use of domain sequence families for a functional classification of strains clearly has some potential to be explored. Lars-Gustav Snipen ( ) Corresponding author: lars.snipen@umb.no Snipen LG, Ussery DW (2013) A domain sequence approach to pangenomics: applications to [v2; ref PubMed Abstract | Publisher Full Text | Free Full Text 3. Hiller NL, Janto B, Hogg JS, et al.: Comparative Genomic Analyses of Seventeen Streptococcus pneumoniae Strains: Insights into the Pneumococcal Supragenome. J Bacteriol. 2007; 189(22): 8186-8195. PubMed Abstract | Publisher Full Text | Free Full Text 4. Cazalet C, Jarraud S, Ghavi-Helm Y, et al.: Multigenome analysis identifies a worldwide distributed epidemic Legionella pneumophila clone that emerged within a highly diverse species. Genome Res. 2008; 18(3): 431-441. PubMed Abstract | Publisher Full Text | Free Full Text 5. Deng X, Phillippy AM, Li Z, et al.: Probing the pan-genome of Listeria monocytogenes: new insights into intraspecific niche expansion and genomic diversification. BMC Genomics. 2010; 11: 500. PubMed Abstract | Publisher Full Text | Free Full Text 6. Donati C, Hiller NL, Tettelin H, et al.: Structure and dynamics of the pan-genome of Streptococcus pneumoniae and closely related species. Genome Biol. 2010; 11(10): R107. PubMed Abstract | Publisher Full Text | Free Full Text 7. Hao P, Zheng H, Yu Y, et al.: Complete Sequencing and Pan-Genomic Analysis of Lactobacillus delbrueckii subsp. bulgaricus Reveal Its Genetic Basis for Industrial Yogurt Production. PLoS One. 2011; 6(1): e15964. PubMed Abstract | Publisher Full Text | Free Full Text 8. Rasko DA, Rosovitz MJ, Myers GS, et al.: The Pangenome Structure of Escherichia coli: Comparative Genomic Analysis of E. coli Commensal and Pathogenic Isolates. J Bacteriol. 2008; 190(20): 6881-6893. PubMed Abstract | Publisher Full Text | Free Full Text 9. Lukjancenko O, Wassenaar TM, Ussery DW: Comparison of 61 Sequenced Escherichia coli Genomes. Microb Ecol. 2010; 60(4): 708-720. PubMed Abstract | Publisher Full Text | Free Full Text 10. NCBI Genome: Escherichia coli [NCBI Genome: Escherichia coli]. Reference Source 11. Hogg JS, Hu FZ, Janto B, et al.: Characterization and modeling of the Haemophilus influenzae core and supragenomes based on the complete genomic sequences of Rd and 12 clinical nontypeable strains. Genome Biol. 2007; 8(6): R103. PubMed Abstract | Publisher Full Text | Free Full Text 12. Lapierre P, Gogarten JP: Estimating the size of the bacterial pan-genome. Trends Genet. 2009; 25(3): 107-110. PubMed Abstract | Publisher Full Text 13. Tettelin H, Riley D, Cattuto C, et al.: Comparative genomics: the bacterial pan-genome. Curr Opin Microbiol. 2008; 11(5): 472-477. PubMed Abstract | Publisher Full Text 14. Snipen L, Almrey T, Ussery DW: Microbial comparative pan-genomics using References 28. Delcher AL, Bratke KA, Powers EC, et al.: Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics. 2007; 23(6): 673-679. PubMed Abstract | Publisher Full Text | Free Full Text 29. NCBI RefSeq. [NCBI RefSeq]. Reference Source 30. Kislyuk AO, Haegeman B, Bergman NH, et al.: Genomic fluidity: an integrative view of gene diversity within microbial populations. BMC Genomics. 2011; 12(12): 32.
Frontiers in Microbiology, 2014
We have compared chromosome-specific genes in a set of 18 finished Vibrio genomes, and, in additi... more We have compared chromosome-specific genes in a set of 18 finished Vibrio genomes, and, in addition, also calculated the pan- and core-genomes from a data set of more than 250 draft Vibrio genome sequences. These genomes come from 9 known species and 2 unknown species. Within the finished chromosomes, we find a core set of 1269 encoded protein families for chromosome 1, and a core of 252 encoded protein families for chromosome 2. Many of these core proteins are also found in the draft genomes (although which chromosome they are located on is unknown.) Of the chromosome specific core protein families, 1169 and 153 are uniquely found in chromosomes 1 and 2, respectively. Gene ontology (GO) terms for each of the protein families were determined, and the different sets for each chromosome were compared. A total of 363 different "Molecular Function" GO categories were found for chromosome 1 specific protein families, and these include several broad activities: pyridoxine 5' phosphate synthetase, glucosylceramidase, heme transport, DNA ligase, amino acid binding, and ribosomal components; in contrast, chromosome 2 specific protein families have only 66 Molecular Function GO terms and include many membrane-associated activities, such as ion channels, transmembrane transporters, and electron transport chain proteins. Thus, it appears that whilst there are many "housekeeping systems" encoded in chromosome 1, there are far fewer core functions found in chromosome 2. However, the presence of many membrane-associated encoded proteins in chromosome 2 is surprising.
F1000Research, 2012
The comparative genomics of prokaryotes has shown the presence of conserved regions containing hi... more The comparative genomics of prokaryotes has shown the presence of conserved regions containing highly similar genes (the 'core genome') and other regions that vary in gene content (the 'flexible' regions). A significant part of the latter is involved in surface structures that are phage recognition targets. Another sizeable part provides for differences in niche exploitation. Metagenomic data indicates that natural populations of prokaryotes are composed of assemblages of clonal lineages or "meta-clones" that share a core of genes but contain a high diversity by varying the flexible component. This meta-clonal diversity is maintained by a collection of phages that equalize the populations by preventing any individual clonal lineage from hoarding common resources. Thus, this polyclonal assemblage and the phages preying upon them constitute natural selection units.
PLoS ONE, 2010
Campylobacter jejuni strain M1 (laboratory designation 99/308) is a rarely documented case of dir... more Campylobacter jejuni strain M1 (laboratory designation 99/308) is a rarely documented case of direct transmission of C. jejuni from chicken to a person, resulting in enteritis. We have sequenced the genome of C. jejuni strain M1, and compared this to 12 other C. jejuni sequenced genomes currently publicly available. Compared to these, M1 is closest to strain 81116. Based on the 13 genome sequences, we have identified the C. jejuni pan-genome, as well as the core genome, the auxiliary genes, and genes unique between strains M1 and 81116. The pan-genome contains 2,427 gene families, whilst the core genome comprised 1,295 gene families, or about two-thirds of the gene content of the average of the sequenced C. jejuni genomes. Various comparison and visualization tools were applied to the 13 C. jejuni genome sequences, including a species pan-and core genome plot, a BLAST Matrix and a BLAST Atlas. Trees based on 16S rRNA sequences and on the total gene families in each genome are presented. The findings are discussed in the background of the proven virulence potential of M1.
This book is part of the series Computing for Comparative Microbial Genomics. The book is written... more This book is part of the series Computing for Comparative Microbial Genomics. The book is written for microbiologists needing an introduction to genomics as well as for bioinformaticists who need to be introduced to microbiology. First, a brief overview of molecular biology and of the concept of sequences as biological information are given. There are four main parts: Introduction, Comparative Genomics, Transcriptomics and Proteomics, and Microbial Communities. "It is a very well-written review of genomics and proteomics of microbes, and makes convincing arguments for the practicality of applying bioinformatics to the study of communities of these species. The references are well chosen. The writing style is superb. … There is an amazing amount of interesting material… The book is probably more suitable as an introduction to contemporary applications of bioinformatics and microbiology for computational scientists." (Anthony J. Duben, ACM Computing Reviews,