Connecting protein structure with predictions of regulatory sites - PubMed (original) (raw)

Connecting protein structure with predictions of regulatory sites

Alexandre V Morozov et al. Proc Natl Acad Sci U S A. 2007.

Abstract

A common task posed by microarray experiments is to infer the binding site preferences for a known transcription factor from a collection of genes that it regulates and to ascertain whether the factor acts alone or in a complex. The converse problem can also be posed: Given a collection of binding sites, can the regulatory factor or complex of factors be inferred? Both tasks are substantially facilitated by using relatively simple homology models for protein-DNA interactions, as well as the rapidly expanding protein structure database. For budding yeast, we are able to construct reliable structural models for 67 transcription factors and with them redetermine factor binding sites by using a Bayesian Gibbs sampling algorithm and an extensive protein localization data set. For 49 factors in common with a prior analysis of this data set (based largely on phylogenetic conservation), we find that half of the previously predicted binding motifs are in need of some revision. We also solve the inverse problem of ascertaining the factors from the binding sites by assigning a correct protein fold to 25 of the 49 cases from a previous study. Our approach is easily extended to other organisms, including higher eukaryotes. Our study highlights the utility of enlarging current structural genomics projects that exhaustively sample fold structure space to include all factors with significantly different DNA-binding specificities.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.

Fig. 1.

PWM predictions for five TFs in the Zn2-Cys6 binuclear cluster family, with co-crystal structures showing extensive spacing and orientation variability. (Left) Structure-based priors. (Right) PWMs refined with Gibbs sampling (see Materials and Methods). Arrows show the relative orientation of two monomeric half-sites in the dimeric site from the crystal structure. The monomeric half-sites can be arranged in direct (tail-to-head; HAP1), inverted (head-to-head; GAL4, PPR1, PUT3), and everted (tail-to-tail; LEU3) orientations.

Fig. 2.

Fig. 2.

Illustration of how structural and sequence data are mined in the case of ARG81. A DNA-binding domain of the Zn2-Cys6 binuclear cluster type is found in the ARG81 protein sequence. The HAP1 homodimer (PDB code 1hwt) is identified as the homolog with the highest interface scores _S_hm (93.5 for chain C, 88.7 for chain D). The interface scores reflect the similarity of the HAP1 and ARG81 DNA-binding interfaces on the basis of their protein sequence alignments. Interface amino acids are labeled “b” for the DNA phosphate backbone contacts and “s” for the DNA base contacts. Observed amino acid mutations at the interface are sufficiently conservative and thus are assumed not to change the binding specificity significantly. However, to approximate previously characterized ARG81 binding sites (26), columns 4–6 are removed from the HAP1 PWM, and the CGC half-sites are replaced by the more common CGG half-sites. The 1hwt-based PWM modified in this way is used as the informative prior for the Gibbs sampling algorithm, which is run on the intergenic sequences known to be bound by ARG81 from the ChIP-chip experiment (2). After the ARG81 sites are identified, their alignment is used to compile the ARG81 PWM. Each site in the alignment is weighted by its posterior probability p(s, c) (>0.05).

Fig. 3.

Fig. 3.

Prediction of the informative prior for the phosphatase system regulator PHO4. (A) Crystal structure of the PHO4 helix–loop–helix dimer bound to its consensus site (PDB code 1a0a). (B) Atomic profile: the number of heavy atoms, Ni, within 4.5 Å of base pair i in the binding site. (C) Consensus base probability profile: the probability w _i_α (Ni) of the consensus base α at position i in the binding site (cf. Eq. 1). (D) Structure-based PWM prediction.

Similar articles

Cited by

References

    1. Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreider J, Hannett N, Kanin E, et al. Science. 2000;290:2306–2309. - PubMed
    1. Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, MacIsaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J, et al. Nature. 2004;431:99–104. - PMC - PubMed
    1. Mukherjee S, Berger MF, Jona G, Wang XS, Muzzey D, Snyder M, Young RA, Bulyk ML. Nat Genet. 2004;36:1331–1339. - PMC - PubMed
    1. Liu X, Noll DM, Lieb JD, Clarke ND. Genome Res. 2005;15:421–427. - PMC - PubMed
    1. MacIsaac KD, Wang T, Gordon DB, Gifford DK, Stormo GD, Fraenkel E. BMC Bioinformatics. 2006;7:113. - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources