The evolution of DNA regulatory regions for proteo-gamma bacteria by interspecies comparisons - PubMed (original) (raw)

Comparative Study

. 2002 Feb;12(2):298-308.

doi: 10.1101/gr.207502.

Affiliations

Comparative Study

The evolution of DNA regulatory regions for proteo-gamma bacteria by interspecies comparisons

Nikolaus Rajewsky et al. Genome Res. 2002 Feb.

Abstract

The comparison of homologous noncoding DNA for organisms a suitable evolutionary distance apart is a powerful tool for the identification of cis regulatory elements for transcription and translation and for the study of how they assemble into functional modules. We have fit the three parameters of an affine global probabilistic alignment algorithm to establish the background mutation rate of noncoding sequence between E. coli and a series of gamma proteobacteria ranging from Salmonella to Vibrio. The lower bound we find to the neutral mutation rate is sufficiently high, even for Salmonella, that most of the conservation of noncoding sequence is indicative of selective pressures rather than of insufficient time to evolve. We then use a local version of the alignment algorithm combined with our inferred background mutation rate to assign a significance to the degree of local sequence conservation between orthologous genes, and thereby deduce a probability profile for the upstream regulatory region of all E. coli protein-coding genes. We recover 75%-85% (depending on significance level) of all regulatory sites from a standard compilation for E. coli, and 66%-85% of sigma sites. We also trace the evolution of known regulatory sites and the groups associated with a given transcription factor. Furthermore, we find that approximately one-third of paralogous gene pairs in E. coli have a significant degree of correlation in their regulatory sequence. Finally, we demonstrate an inverse correlation between the rate of evolution of transcription factors and the number of genes they regulate. Our predictions are available at http://www.physics.rockefeller.edu/([tilde-see text])siggia.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Phylogeny of relevant bacterial species. The three-letter abbreviations are as follows: eco, Escherichia coli K12 (genbank entry NC_000913); stm, Salmonella typhimurium LT2 (

genome.wustl.edu/gsc/bacterial/salmonella.shtml

); kpn, Klebsiella pneumoniae MGH78578 (

genome.wustl.edu/gsc/Projects/bacterial/klebsiella.shtml

); ype, Yersinia pestis CO-92 (

www.sanger.ac.uk/Projects/Y\_pestis/

); vcb, Vibrio cholerae N16961 (genbank NC_002505 and NC_002506); hin, Haemophilus influenzae Rd (genbank NC_000907). The phylogenetic tree is based on 16S ribosomal RNA sequences. H. influenzae is shown only for comparative purposes and was not analyzed in our study.

Figure 2

Figure 2

The probability profiles for the orthologous region upstream of the gene lpdA (lipoamide dehydrognease (NADH). The abscissa is in bp units, and the start codon for lpdA begins at position 325. In (a), κ = 0 for all species, whereas in (b) it is optimized separately in each case (as explained in the text), which yields κ = 0.006, 0.003, 0.01, and 0.06 for kpn, stm, vch, and ype, respectively. The two known factor binding sites for sigma 70 (rpoD17) and an anaerobic factor arcA are marked. In (b), the predictions of McCue et al. (2001) are marked with “W” and the remaining bars are our predictions from the summed profiles.

Figure 3

Figure 3

The probability profiles for the intergenic region between the conserved divergently transcribed pair of E. coli genes, yfhD to the left and purL to the right, whose 5′ end begins at position = 396. An optimal κ = 0.006, 0.001, 0, 0.003 was determined for kpn, stm, vch, and ype, respectively. There is only one documented binding site for purine repressor (purR). The predictions of McCue et al. (2001) for both genes are combined without distinction and labeled with “W”.

Figure 4

Figure 4

Normalized score histograms of genes with known function and genes with unknown function.

Figure 5

Figure 5

Protein conservation and DNA binding specificity. The plot shows (1-PID) versus DNA binding specificity x Eq. (5). Each data point corresponds to one of the 51 E. coli transcription factors which has an ortholog in Vibrio cholera. The straight line shown is a linear fit with slope 0.086 ± 0.005. Note that there is an upper cutoff of 0.7 in. (1-PID) since by definition, all orthologs have a PID of at least 0.3. The two obvious outliers at x = 3.4 and (1-PID) ∼ 0.7 are FarR and SoxS. Note that for some of the factors (e.g., FarR), only very few binding sites are known; that is, our estimate of the binding specificity has a large error.

Similar articles

Cited by

References

    1. Bailey, T. and Elkan, C. 1994. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings ISMB'94, pp. 28–36. - PubMed
    1. Blanchette, M., Schwikowski, B., and Tompa, M. 2000. An exact algorithm to identify motifs in orthologous sequences from multiple species. Proceedings of ISMB2000, pp. 37–45. - PubMed
    1. Brazma A, Johnassen I, Vilo J, Ukkonen E. Predicting gene regulatory elements in silico on a genomic scale. Genome Res. 1998;8:1202–1215. - PMC - PubMed
    1. Bussemaker HJ, Li H, Siggia ED. Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis. Proc Natl Acad Sci. 2000;97:10096–10100. - PMC - PubMed
    1. Bussemaker HJ, Li H, Siggia ED. Regulatory element detection using correlation with genome-wide mRNA expression data. Nat Genetics. 2001;2:167–171. - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources