Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities - PubMed (original) (raw)

doi: 10.1101/gr.100552.109. Epub 2010 Apr 8.

Teemu Kivioja, Jarkko Toivonen, Lu Cheng, Gonghong Wei, Martin Enge, Mikko Taipale, Juan M Vaquerizas, Jian Yan, Mikko J Sillanpää, Martin Bonke, Kimmo Palin, Shaheynoor Talukder, Timothy R Hughes, Nicholas M Luscombe, Esko Ukkonen, Jussi Taipale

Affiliations

Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities

Arttu Jolma et al. Genome Res. 2010 Jun.

Abstract

The genetic code-the binding specificity of all transfer-RNAs--defines how protein primary structure is determined by DNA sequence. DNA also dictates when and where proteins are expressed, and this information is encoded in a pattern of specific sequence motifs that are recognized by transcription factors. However, the DNA-binding specificity is only known for a small fraction of the approximately 1400 human transcription factors (TFs). We describe here a high-throughput method for analyzing transcription factor binding specificity that is based on systematic evolution of ligands by exponential enrichment (SELEX) and massively parallel sequencing. The method is optimized for analysis of large numbers of TFs in parallel through the use of affinity-tagged proteins, barcoded selection oligonucleotides, and multiplexed sequencing. Data are analyzed by a new bioinformatic platform that uses the hundreds of thousands of sequencing reads obtained to control the quality of the experiments and to generate binding motifs for the TFs. The described technology allows higher throughput and identification of much longer binding profiles than current microarray-based methods. In addition, as our method is based on proteins expressed in mammalian cells, it can also be used to characterize DNA-binding preferences of full-length proteins or proteins requiring post-translational modifications. We validate the method by determining binding specificities of 14 different classes of TFs and by confirming the specificities for NFATC1 and RFX3 using ChIP-seq. Our results reveal unexpected dimeric modes of binding for several factors that were thought to preferentially bind DNA as monomers.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

Schematic description of the high-throughput SELEX process. (A) Protein expression. (Top) Proteins are expressed as fusion proteins with SBP-tagged _Gaussia_-luciferase. (Bottom) The GATEWAY recombination cloning system is used to transfer DNA sequences encoding DBDs or TFs from donor-vectors to the pD40htSELEX expression vector. (B) Ligand design that accommodates multiplexing of samples using barcodes. Each DNA ligand contains a 14-bp randomized region (14N), and a 5-bp barcode (Barcode) that uniquely identifies the individual SELEX sample. To increase specificity, each barcode differs from all other barcodes by at least 2 bp. These variable sequences are flanked by constant sequences that include an Illumina Genome Analyzer sequencing primer site (Seq. primer) and bridge amplification primer binding regions (Fw, Rev; arrows), which are extended in their 5′ regions to accommodate partially nested primers (used in successive SELEX rounds). (C) Basic principle of high-throughput SELEX. A double-stranded DNA mixture containing all possible 14-bp sequences (from B) is incubated with a DNA-binding protein immobilized into a well of a 96-well plate, resulting in binding of DNA to the protein. After washing and elution, the resulting population of more specific sequences is amplified by PCR and subjected to high-throughput single-molecule sequencing. The specificity of the TF can then be constructed by iterating the process and calculating the abundance of distinct sequences after different numbers of cycles. In each cycle, multiple reactions are mixed into a single sequencing lane, and the TFs are identified using the barcode sequences.

Figure 2.

Figure 2.

Enrichment of specific sequences during the SELEX process. (A) Position weight matrices built around the most enriched sequence for four different TFs (see Methods for details). The height of the letter at each position is directly proportional to the incidence of the indicated base in sequences where all other bases exactly match the most enriched sequence. Note that clear enrichment of sequences is observed after one or two SELEX rounds, and that two separate experiments for TCF4 result in a very similar enrichment pattern. In the first cycle, the algorithm used here detects incorrect binding profile for PRDM1 (asterisk) due to a low number of the relatively long consensus sequences. The enrichment of high-affinity sequences can, however, be detected by seeding the algorithm with consensus from the later cycles (see Supplemental Fig. S4A). (B) The fraction of all fragments containing the most enriched sequence from the third SELEX cycle plotted as a function of the SELEX cycle.

Figure 3.

Figure 3.

Description of the bioinformatic visualization and quality-control pipeline. (A) Hamming distance plot. Incidences of all possible subsequences of length 10 (substring count) are plotted as a function of their Hamming distance (number of substitutions) from the most enriched sequence or its reverse complement. To facilitate visualization, random floating point values between −0.3 and 0.3 are added to all plotted x and y values. Note that in a successful experiment (left, GLI2), clear enrichment of sequences is observed, and many enriched sequences are found at a short Hamming distance (1 to 2). In a failed experiment (right, FOXO3), enrichment is very weak, and the enriched sequences are not clearly related to each other. (Insets) Position weight matrices from the same experiments. (B) Position plot. (Bottom) Fractional incidence (frequency) of subsequences of indicated length at each position in both strands (blue), the forward (direction indicated in top; light blue), and reverse (yellow) strand of the 14-bp random sequence. Numbers are separately normalized for each set of bars to add up to 100%, and uniform distribution (red) is shown as control. (Left) Note that in cases where flanking sequences do not interfere with binding, a very uniform distribution of sequences is observed. (Right) In cases where a part of the binding sequence for a TF is found in the constant region or barcode, a strong positional bias is observed. (C) Enrichment plot. Enrichment of a sample of the most enriched sequences and random sequences are plotted as a function of the SELEX cycle. (Left) Note that the enriched sequences show exponential enrichment (log scale), whereas the random sequences are not appreciably enriched. (Right) In cases of barcode contamination (see text), different sequences can appear to enrich in different SELEX cycles ([black sequence] correct; [red sequence] contaminating sequence). The data in C are from SELEX analysis using purified ARID5A protein-coated plates (see Methods).

Figure 4.

Figure 4.

Distance dendogram based on the minimum Kullback-Leibler divergences between TF position weight matrices from the Jaspar database and reference matrices of Figure5. Note that the binding profiles generated using the SELEX method are in general similar to existing matrices for the same or related factors in cases where they are available. Note also that the profiles generated using SELEX for 14 of the 23 major DNA-binding domain families occupy most major branches of TF binding specificities, highlighting the broad utility of the method.

Figure 5.

Figure 5.

Binding profiles. (A) Comparison of determined binding-specificity models with previously known data. The left columns indicate the transcription factor analyzed and its DNA-binding domain family. The SELEX cycle from where the model is derived and the number of independent sequences included in it are also indicated. The previous model for the same protein or for the closest related ortholog (o) or paralog (p) is shown, including reference. These are RUNX1 for RUNX3, mouse RXRA for RXRG, RFX1 for RFX3, mouse MEIS1 for MEIS2, mouse EHF for EHF, mouse POU2F2 for POU2F2, mouse MEIS1 for MEIS2, CEBPA for CEBPE, and MEF2A for MEF2C. References: (1) Kel et al. (1999); (2) Badis et al. (2009); (3) Wei et al. (2010); (4) Lord et al. (2009); (5) Hallikas et al. (2006); (6) Merika and Orkin (1993); (7) Meyers et al. (1993); (8) Emery et al. (1996); (9) Berger et al. (2008); (10) Kroeger and Morimoto (1994); (11) Fisher et al. (1991); (12) Pscherer et al. (1996); (13) Clauss et al. (1996); (14) Grange et al. (1991); (15) Pollock and Treisman (1991). (B) Monomeric and dimeric binding modes. Factors that can bind DNA either as monomers or as dimers are shown; arrows indicate orientations of the monomeric sites. Two dimeric motifs, dimer1 (top) and dimer2 (bottom), were found for NFATC1; a profile similar to the dimer2 has been previously reported by Falvo et al. (2008) for the paralog NFATC2.

Figure 6.

Figure 6.

Validation of binding models by ChIP-seq. (A) Comparison of in vitro derived binding models for RFX3 and NFATC1 and previously described models (Kel et al. 1999; Badis et al. 2009) with motifs that are enriched in peaks from a ChIP-seq experiment for the same factors. The MEME algorithm was used to identify enriched motifs in the ChIP-seq peaks. For both factors, two different antibodies were used in the experiments shown; the expectation value for the motifs is indicated on the right. (Top) Note that both our model and the model of Badis et al. (2009) (PBM model) are supported by ChIP-seq. For NFATC1, the position where our model matches the ChIP-seq-derived model better than the previous model is boxed. Note also that the ChIP-seq-enriched motif for NFATC1 (bottom) appears to contain also signal (g/a)TGA(g/c) that is located right of the NFATC1 monomer profile tGGAAAa(t/a). This signal is likely derived from a dimerization partner, as it is well known that NFAT proteins dimerize with many other TFs (Macian 2005). (B) Relative fractions of peaks containing monomer and dimer sites and combinations thereof for NFATC1 and ERG (ChIP-seq data from Wei et al. 2010). For all matrices, the cut-off score was set to yield 1 site per 10 kb of human genome. Note that a significant fraction of peaks contain motifs that match the dimer model but not the monomer model. The total fraction of peaks with sites was 24% and 14% for NFATC1 and ERG, respectively.

References

    1. Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti L, Corpet F, Croning MD, et al. 2001. The InterPro database, an integrated documentation resource for protein families, domains, and functional sites. Nucleic Acids Res 29: 37–40 - PMC - PubMed
    1. Badis G, Chan ET, van Bakel H, Pena-Castillo L, Tillo D, Tsui K, Carlson CD, Gossett AJ, Hasinoff MJ, Warren CL, et al. 2008. A library of yeast transcription factor motifs reveals a widespread function for Rsc3 in targeting nucleosome exclusion at promoters. Mol Cell 32: 878–887 - PMC - PubMed
    1. Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, Chan ET, Metzler G, Vedenko A, Chen X, et al. 2009. Diversity and complexity in DNA recognition by transcription factors. Science 324: 1720–1723 - PMC - PubMed
    1. Bailey TL, Elkan C 1995. The value of prior knowledge in discovering motifs with MEME. Technical Report CS95-413. Department of Computer Science, University of California, San Diego - PubMed
    1. Benos PV, Bulyk ML, Stormo GD 2002. Additivity in protein–DNA interactions: How good an approximation is it? Nucleic Acids Res 30: 4442–4451 - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources