First gene-ontology enrichment analysis based on bacterial coregenome variants: insights into adaptations of Salmonella serovars to mammalian- and avian-hosts - PubMed (original) (raw)

First gene-ontology enrichment analysis based on bacterial coregenome variants: insights into adaptations of Salmonella serovars to mammalian- and avian-hosts

Arnaud Felten et al. BMC Microbiol. 2017.

Abstract

Background: Many of the bacterial genomic studies exploring evolution processes of the host adaptation focus on the accessory genome describing how the gains and losses of genes can explain the colonization of new habitats. Consequently, we developed a new approach focusing on the coregenome in order to describe the host adaptation of Salmonella serovars.

Methods: In the present work, we propose bioinformatic tools allowing (i) robust phylogenetic inference based on SNPs and recombination events, (ii) identification of fixed SNPs and InDels distinguishing homoplastic and non-homoplastic coregenome variants, and (iii) gene-ontology enrichment analyses to describe metabolic processes involved in adaptation of Salmonella enterica subsp. enterica to mammalian- (S. Dublin), multi- (S. Enteritidis), and avian- (S. Pullorum and S. Gallinarum) hosts.

Results: The 'VARCall' workflow produced a robust phylogenetic inference confirming that the monophyletic clade S. Dublin diverged from the polyphyletic clade S. Enteritidis which includes the divergent clades S. Pullorum and S. Gallinarum (i). The scripts 'phyloFixedVar' and 'FixedVar' detected non-synonymous and non-homoplastic fixed variants supporting the phylogenetic reconstruction (ii). The scripts 'GetGOxML' and 'EveryGO' identified representative metabolic pathways related to host adaptation using the first gene-ontology enrichment analysis based on bacterial coregenome variants (iii).

Conclusions: We propose in the present manuscript a new coregenome approach coupling identification of fixed SNPs and InDels with regards to inferred phylogenetic clades, and gene-ontology enrichment analysis in order to describe the adaptation of Salmonella serovars Dublin (i.e. mammalian-hosts), Enteritidis (i.e. multi-hosts), Pullorum (i.e. avian-hosts) and Gallinarum (i.e. avian-hosts) at the coregenome scale. All these polyvalent Bioinformatic tools can be applied on other bacterial genus without additional developments.

Keywords: Bacterial fixed variants; Bacterial genomics; Gene-ontology enrichment analysis.

PubMed Disclaimer

Conflict of interest statement

Not applicable.

Not applicable.

Competing interests

The authors declare that they have no conflict of interest.

Figures

Fig. 1

Fig. 1

Boxplots (median, 25th percentile, 75th percentile, minimum and maximum) of pairwise distances expressed as single nucleotide polymorphisms (SNPs) (a) or small insertions/deletions (InDels) (b) into Salmonella enterica subsp. enterica serovars Dublin (n = 60), Enteritidis (n = 528), Pullorum (n = 10) and Gallinarum (n = 28). Normality of distribution and equality of variances were checked with Shapiro-Wilk and Fisher tests, respectively. Statistical differences (*: p < 5.0×10−2; **: p < 1.0×10−2; ***: p < 1.0×10−3; ****: p < 1.0×10−4; *****: p < 1.0×10−5; ******: p < 1.0×10−6) are calculated with Wilcoxon rank sum (i.e. non-normal distribution with equality of variances) or Kolmogorov-Smirnov (i.e. non-normal distribution without equality of variances) tests

Fig. 2

Fig. 2

Phylogenetic inference based on coregenome single nucleotide polymorphisms (SNPs) identified in Salmonella enterica subsp. enterica serovars Dublin, Enteritidis, Pullorum, and Gallinarum. The color legend corresponds to serovars presented by Langridge et al. (Proc. Natl. Acad. Sci. 2015;112:863–8). The variants were identified by the ‘VARCall’ workflow against the reference genome S. Enteritidis (strain P125109, accession NC_011294.1). The produced pseudogenomes (4,685,848 bp) were inferred with RAxML based on a bootstrap analysis and search for best-scoring Maximum Likelihood tree with General Time-Reversible model of substitution and the secondary structure 16-state model. Bootstraps higher than 90% are represented by black circles. The phylogenetic inference converged after 200 bootstrap replicates with a log likelihood score of −8.106 for 1000 computed trees. The tree is rooted on the branch of S. Dublin

Fig. 3

Fig. 3

Densities of single nucleotide polymorphisms (SNPs) per 1000 bp (curves), Salmonella pathogenic islands (dotted lines), and recombination events (rectangles) across Salmonella enterica subsp. enterica serovars (a: 59 genomes, 12,929 SNPs), including Dublin (b: 13 genomes, 5084 SNPs), Enteritidis (c: 33 genomes, 5136 SNPs), Pullorum (d: 5 genomes, 2225 SNPs), and Gallinarum (e: 8 genomes, 671 SNPs). Pathogenicity island database from KonKuk University (Seoul, South Korea) were used to detect Salmonella Pathogenic Islands (SPIs) SPI-1 (2890501–2,934,879), SPI-2 (1727425–1,769,273), SPI-4 (4333507–4,361,514), SPI-5 (1053174–1,074,167), SPI-6 (299796–330,890), SPI-11 (1904313–1,912,607), SPI-12 (2328077–2,347,757) and PAI III 536 (2801306–2,810,695) of the reference genome S. Enteritidis (strain P125109, accession NC_011294.1)

Fig. 4

Fig. 4

Homoplastic (grew bars) and non-homoplastic (white bars) variants (SNPs versus InDels, intragenic versus intergenic, non-synonymous versus synonymous) fixed across all branches of the phylogenetic inference including genomes of Salmonella enterica subsp. enterica serovars Enteritidis (n = 33), Pullorum (n = 5), Gallinarum (n = 8) and Dublin (n = 13). The variant annotation was performed with SnpEff against reference genome S. Enteritidis (strain P125109, accession NC_011294.1). The fixed non-homoplastic variants are defined by common genotypes across the considered group of genomes, as well as different genotypes in all the others compared genomes. The fixed homoplastic variants are defined by common genotypes across the considered group of genomes and genomes of independent phylogenetic clades, as well as different genotypes in genomes of the compared child-branches. The term ‘reference genotype’ refers to fixed variants presenting genotype of the reference genome. This analysis was performed with the script ‘phyloFixedVar’ (i.e. dependently of the phylogenetic inference). Statistical differences (*: p < 1.0×10−6; **: p < 1.0×10−7; ***: p < 1.0×10−8; ****: p < 1.0×10−9; *****: p < 1.0×10−10) are calculated with Wilcoxon signed rank tests. The vertical bars represent the standard deviation

Fig. 5

Fig. 5

Genes impacted by single nucleotide polymorphisms (SNPs), involved in the amino acid catabolism, and fixed at the branch representing divergence between Salmonella serovars Dublin and Enteritidis/Pullorum/Gallinarum. Round bars represent missense (white) and synonymous SNPs (grew)

Fig. 6

Fig. 6

Amino acid pathways in which intragenic and non-homoplastic fixed single nucleotide polymorphisms (SNPs) differentiating Salmonella serovars Dublin versus Enteritidis have been detected. The dotted lines represent enzymatic steps for which the corresponding genes encoding enzymes have been specifically mutated. AST, NADH, OAA and PPi stand for ammonia-producing arginine succinyltransferase, nicotinamide adenine dinucleotide, oxaloacetic acid and pyrophosphate, respectively. KEGG database were used as a database for reference pathway (Nucl. Acids Res. 2016;44:D457–62)

Fig. 7

Fig. 7

Programs (i.e. black letters) and commands (i.e. grew letters) implemented in the ‘VARCall’ workflow aiming to call single nucleotide polymorphisms (SNPs) and small insertions/deletions (InDels). The scripts referring to alignment against reference genome (i.e. ‘BAMmaker’), variant calling (i.e. ‘VCFmaker_SNP’ and ‘VCFmaker_INDEL’), variant combination (i.e. ‘SNP-INDEL_merge’), pairwise distances (i.e. ‘VCFtoMATRIX’), variant concatenation (i.e. ‘VCFtoFASTA’), pseudogenome assemblies (i.e. ‘VCFtoPseudoGenome’), and report about breadth and depth coverages (i.e. ‘reportMaker’) were written with Python 2.7 and are invoked by the driven script ‘VARCall’ (i.e. black arrow). The script ‘BAMmaker’ is performed for each genome (i.e. circular arrow)

Fig. 8

Fig. 8

Programs (i.e. black letters) and corresponding effects (i.e. grew letters) implemented in the scripts ‘phyloFixedVar’, ‘GetGOxML’ and ‘EveryGO’ aiming to identify sensitive (Se) and specific (Sp) variants (SNPs and InDels) at each branches of corresponding phylogenetic inference, associate prokaryotic gene ontology (GO) terms with these homoplastic and/or non-homoplastic variants, and perform gene-ontology enrichment analysis based on the parent-child approach integrating hypergeometric tests and Bonferroni corrections, respectively. Online databases are queried by the scripts ‘GOSlimer’ and ‘GOxML’ (i.e. clouds). The GO database of the Gene Ontology Consortium is used by the script ‘GOSlimer’ to identify prokaryotic GO-terms. The QuickGO browser of the UniProt GO annotation program is queried by the script ‘GOxML’ to associate the variants with the corresponding GO-terms. These scripts were written with Python 2.7 and implement R libraries ‘p.ajust’ and ‘phyper’. The whole workflow is semi-automated (i.e. grew arrows) and the scripts ‘GetGOxML’ and ‘EveryGO’ can be performed for each variant and each branch, respectively (i.e. circular arrow)

References

    1. Rimmer A, Phan H, Mathieson I, Iqbal Z, Twigg SRF, Wilkie AOM, et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat Genet. 2014;46:912–918. doi: 10.1038/ng.3036. -DOI -PMC -PubMed
    1. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. doi: 10.1101/gr.107524.110. -DOI -PMC -PubMed
    1. Yang Z, Rannala B. Molecular phylogenetics: principles and practice. Nat. Rev. Genet. 2012;13:303–314. doi: 10.1038/nrg3186. -DOI -PubMed
    1. Zhou Z, McCann A, Litrup E, Murphy R, Cormican M, Fanning S, et al. Neutral genomic microevolution of a recently emerged pathogen, salmonella enterica Serovar Agona. Casadesús J, editor. PLoS Genet 2013;9:e1003471. -PMC -PubMed
    1. Didelot X, Wilson DJ. ClonalFrameML: efficient inference of recombination in whole bacterial genomes. Prlic A, editor. PLoS Comput Biol 2015;11:e1004041. -PMC -PubMed

MeSH terms

LinkOut - more resources