TAL Effector-Nucleotide Targeter (TALE-NT) 2.0: tools for TAL effector design and target prediction (original) (raw)

Abstract

Transcription activator-like (TAL) effectors are repeat-containing proteins used by plant pathogenic bacteria to manipulate host gene expression. Repeats are polymorphic and individually specify single nucleotides in the DNA target, with some degeneracy. A TAL effector-nucleotide binding code that links repeat type to specified nucleotide enables prediction of genomic binding sites for TAL effectors and customization of TAL effectors for use in DNA targeting, in particular as custom transcription factors for engineered gene regulation and as site-specific nucleases for genome editing. We have developed a suite of web-based tools called TAL Effector-Nucleotide Targeter 2.0 (TALE-NT 2.0; https://boglab.plp.iastate.edu/) that enables design of custom TAL effector repeat arrays for desired targets and prediction of TAL effector binding sites, ranked by likelihood, in a genome, promoterome or other sequence of interest. Search parameters can be set by the user to work with any TAL effector or TAL effector nuclease architecture. Applications range from designing highly specific DNA targeting tools and identifying potential off-target sites to predicting effector targets important in plant disease.

INTRODUCTION

Transcription activator-like (TAL) effectors from the plant pathogenic bacterial genus Xanthomonas represent a class of DNA binding proteins that can be readily engineered to target novel DNA sequences. During infection, TAL effectors are deployed by the bacteria to modulate host gene expression, with each effector directly binding an effector-specific DNA target (1,2). A central region in the protein, composed of a variable number of tandem, near-identical, 33–35 amino acid repeats, determines the target(s) of each TAL effector. Repeat-to-repeat variation occurs primarily at residues 12 and 13 (termed the repeat variable diresidue or RVD). The RVD sequence has been shown both computationally and experimentally to correspond directly to the DNA target site sequence; repeats with different RVDs recognize different nucleotides, in a code-like fashion, with some degeneracy (3,4). Custom TAL effectors can be targeted to novel DNA sequences by assembling an array of repeats that corresponds to the intended target site (5). Designing custom TAL effectors for DNA targeting has proved to be a much simpler and less labor-intensive process than the design of other customizable DNA binding proteins such as zinc fingers (6), and a variety of rapid construction methods for custom TAL effectors and TAL effector fusion proteins have recently been developed (7–12). Increasingly, therefore, TAL effectors and TAL effector-based fusion proteins have been adopted as tools for DNA targeting applications (6).

Site-specific DNA modification has been achieved using TAL effector-endonuclease fusion proteins (TAL effector nucleases or TALENs), which create targeted double-stranded breaks (DSBs) in DNA. TALEN architectures to date combine the central repeat region and some portion of the flanking parts of the TAL effector with the catalytic domain of FokI, which functions as a dimer. Thus, TALENs work in heteromeric pairs, with the monomers binding to opposing target sites oriented 5′–3′ on opposite strands across a spacer, allowing the C-terminal FokI domains to dimerize and create a DSB in that spacer (5,13–15). In eukaryotes, such breaks are repaired by either non-homologous end joining (NHEJ), which is prone to errors and useful for creating gene knockouts, or by homologous recombination (HR), which replaces the sequences surrounding the break with a supplied template. TALEN-mediated NHEJ and HR have been demonstrated in a variety of cell types and organisms (7,8,13,14,16–20). TAL effectors have also been used as custom transcriptional activators, most effectively in plants in their native form and in mammalian cells with the native acidic activation domain replaced by the VP16 activation domain (or tetrameric derivative VP64) from herpes simplex virus (10,21,22). The TAL effector targeting domain has also been fused to a repressor protein and used successfully for specific gene repression in plants (23).

Custom TAL effector and TALEN architectures using different proportions of the flanking regions on either side of the repeat region have been reported. In the case of TALENs, the optimum spacer length between the TALEN monomers depends on the architecture used (5,13,15,24). In addition, different methods assemble different numbers of repeats in an array, with some allowing a wide range (7–12). Therefore, tools for designing custom TAL effectors and TALENs should allow for a range or a prescribed number of repeats and for TALENs, various spacer lengths. Most TAL effector targets in nature are preceded by a T at the 5′-end (3), but at least one example of a TAL effector target preceded by a C has been identified (25), and some custom TAL effectors have been reported to be active on sites preceded by a C (13). The preference for T at this position is structurally specified by the region of the protein immediately N-terminal to the central repeat region (26,27). Thus, accounting for the ‘−1 position’ nucleotide is important for target prediction and custom TAL effector design as well.

Because of the degeneracy in the TAL effector–DNA recognition code, off-targeting is a concern in the use of TAL effectors and TALENs. RVDs associate preferentially, but not exclusively, with specific nucleotides. TAL effector–target pairs in nature are observed to have up to 50% mismatches (positions at which the RVD is not aligned with its most frequently associated nucleotide) (3), and TAL effector-based custom transcription factors have been shown to activate transcription of off-target genes whose promoters contain sequences similar but not identical to the intended targets (21). If custom TAL effectors and TALENs are to be a widely adopted biotechnology, accurate prediction of potential off-target activity and design to minimize such activity will be crucial. Prediction of potential TAL effector binding sites that takes the degeneracy into account is also important for the identification of disease targets in plant hosts of Xanthomonas spp.

To aid researchers in adopting TALENs as gene editing tools, we converted a computational script we had created for TALEN design, called ‘TALEN Targeter’, to a web-based tool and posted it on a site we called TAL Effector-Nucleotide Targeter (TALE-NT 1.0). The tool designed TALEN pairs according to five design guidelines based on positional and composition biases observed in known TAL effector–target pairs (7). Herein, we describe a new version of the web site, TALE-NT 2.0, that offers a suite of tools for TALEN and TAL effector design as well as target prediction. Since the publication of TALE-NT 1.0, it was shown that the design guidelines have little effect on TALEN efficiency (12). Therefore, for TALE-NT 2.0, we removed the guidelines from TALEN Targeter and instead allow users to search for TALENs targeting a specific base. We also updated TALEN Targeter to provide users with additional options for entering sequences, customizing their queries to accommodate different TALEN architectures and other preferences, searching for sites preceded by T or C and viewing output. We added two new tools: ‘TAL Effector Targeter’, for designing custom TAL effectors for single sites, and ‘Target Finder’, for predicting candidate TAL effector targets, ranked by likelihood, in the genomes or gene promoteromes of several model organisms or in a sequence supplied by the user. The new web site (https://boglab.plp.iastate.edu/) makes this suite of tools freely available to the research community. Applications range from designing highly specific DNA targeting reagents and identifying potential off-target sites to predicting effector targets important in plant disease.

SOFTWARE AND ALGORITHMS

Programming

TALENT 2.0 tools other than Target Finder are written in the Python programming language and use Biopython libraries for parsing input DNA sequences and other sequence operations (28). Target Finder is written in C and uses Kseq.h (http://lh3lh3.users.sourceforge.net/parsefastq.shtml) for sequence parsing.

Tools for designing custom TALENs and TAL effectors

TALEN Targeter and TAL Effector Targeter design paired and single custom TAL effector repeat arrays, respectively, for targeting DNA sequences of interest. Both tools require one or more FASTA-formatted DNA sequences as input. TALEN Targeter identifies paired monomer (typically heteromeric) binding sites oriented 5′–3′ on opposite strands of the DNA and separated by a spacer. TAL Effector Targeter identifies single TAL effector binding sites; users have the option of searching the reverse complement in addition to searching the DNA sequence as entered. In either tool the number of TAL effector repeats can vary across a user-specified range. TALEN Targeter allows users also to specify a range for the spacer length; all possible combinations of repeat numbers and spacer lengths are considered. Identified binding sites are converted into RVD sequences using the four most common RVD–nucleotide pairs (NI-A, HD-C, NN-G and NG-T). Target sites, RVD sequences and other information are returned to the user (the output for all tools is described more fully in the ‘Web Interface’ section). Arrays designed with TAL Effector Targeter are by default designed to meet the five guidelines developed based on TAL effectors observed in nature, as the effect of these guidelines on TAL effector transcription factors has not been determined. Users may choose not to enforce one or more guidelines.

A tool for identifying candidate TAL effector targets

Target Finder allows users to enter an RVD sequence and search for candidate targets in the genome or gene promoterome of any of several model organisms or in one or more user-provided, FASTA-formatted DNA sequences. The tool may be used to identify and rank candidate plant genomic targets of TAL effectors important for disease or potential off-target binding sites of custom TAL effector proteins.

To predict and rank sites, the tool uses a simple scoring function developed by Moscou and Bogdanove (3) based on RVD–nucleotide association frequencies found in a set of known TAL effector–target pairs. We have used this scoring function to predict 21 previously unknown and subsequently experimentally verified plant targets for 14 TAL effectors (A. Cernadas, E.Doyle and A.Bogdanove, unpublished results). Briefly, for scoring, for each RVD in the set of TAL effector–target pairs, the frequency with which it pairs with each nucleotide was calculated and then converted to a weighted RVD–nucleotide association probability. For RVDs that are not observed in the set of pairs, each nucleotide association was given an equal probability. The score for a DNA/RVD sequence alignment is found by summing the negative logs of the appropriate RVD–nucleotide association probabilities, such that better alignments have lower scores. A detailed description of the scoring function, including weighted RVD–nucleotide association probabilities, is provided in the Supplementary Material.

Target Finder returns a list of the lowest scoring (best) sites in the queried DNA sequence below a cutoff (discussed in the Web Interface section). Candidate binding sites are not required to conform to any of the design guidelines except that they must be preceded by a T, or C if that option is selected. The tool by default searches for binding sites on both strands of the DNA sequence(s), but users may opt to search only the forward strand.

WEB INTERFACE

Design and general features

The TALE-NT 2.0 web site is powered by Drupal 7 on Red Hat Enterprise Linux 6. Job queuing is handled by Celery using Redis as a message broker. All features of the web site have been tested on common web browsers.

TALE-NT 2.0 makes the three tools for design of TAL effectors and prediction of TAL effector targets freely available to all users, with no log-in requirement. Upon submitting a job, users are taken to a bookmarkable page updating them on the status of their query or supplying a link that will take them to their results when the query has finished processing. Processing times are not excessive: a search of the entire rice genome for possible targets of an average length TAL effector (18 RVDs) took <3 min. Nevertheless, users have the option to enter an address to receive email notification when their job finishes. Results for each tool are displayed in a sortable table by default. Users may also or instead download results as a tab-delimited text file. For Target Finder users can download either or both of two formats, standard or GFF3, described further below.

All tools allow users to design or search for TAL effector binding sites preceded by a 5′ T only, C only, or T or C. In our hands, however, with either TALENS or TAL effectors, sites preceded by a C are significantly less active than those preceded by a T, so we suggest using T, which is the default selection.

TALEN Targeter

The TALEN Targeter web interface allows users to design custom TALENs to target one or more DNA sequences of interest. Users enter their sequence(s) in a text box or upload a file containing the sequence(s). Allowable file size is up to 2 MB. Users may select from four common TALEN architectures (7,13,15,24) with pre-selected ranges for spacer size and numbers of repeats or enter their own ranges for these parameters. Users may choose to allow binding sites to be preceded by a T only, C only, or by T or C. In addition, users have the option to output just TALEN pairs targeting a specific base, a filtered list including up to one TALEN pair for each base or the complete list of all TALEN pairs targeting anywhere in the sequence. The targeted base is defined as the base in the center (or immediately to the left of the center) of the spacer. If users choose to return up to one TALEN pair per base, the pair with the smallest average number of RVDs and shortest spacer targeting a given base will be returned.

Each line of the output describes a pair of monomers that will function together as a TALEN. Information provided includes the name of the input sequence (in case more than one sequence was entered), the starting position and number of repeats for each monomer, the target sequence including the −1 nucleotide and RVD sequences corresponding to each monomer.

The output is ordered by sequence name, with TALENs for each sequence grouped by the start position of the first monomer. The order of TALENs in the output does not relate to how well they might function. No scoring or other prioritization method is used. Users should select TALENs closest to the site of desired cleavage that have the fewest or poorest predicted off-target sites determined using Target Finder and further analysis. This analysis should take into account not only the heterodimer but both of the possible homodimers as well. Users may also wish to choose TALENs that straddle a spacer with a diagnostic restriction endonuclease site to facilitate detection of NHEJ-mediated mutation of the spacer sequence (such restriction endonuclease sites are indicated in the output table) or in the case of HR, a TALEN that uniquely binds the sequence to be replaced, so that following replacement it does not cut again. Users might opt to introduce silent mismatches in the HR template to facilitate this.

TAL Effector Targeter

TAL Effector Targeter allows users to design a single custom TAL effector repeat array to target a DNA sequence of interest. The array may be used for the design of a TAL effector or any TAL effector-based fusion protein that functions as a monomer. As with TALEN Targeter, users enter sequence(s) in a text box or upload a FASTA-formatted file. Additional text boxes allow users to specify a range for the number of repeats in each TAL effector (default is 15–30). Checkbox options allow users to turn off individual design guidelines. By default, the tool searches only the sequence(s) as entered; a checkbox option includes also the reverse complement sequence(s). Users may also choose to search only for sites preceded by T at the 5′-end, or to allow T and/or C.

Each line of the output describes a single custom TAL effector repeat array. Information returned includes the name of the input sequence (in case multiple sequences were entered), the strand and coordinates of the target, the target sequence including the −1 nucleotide and corresponding RVD sequences.

The output is ordered first by sequence name, with target sites for each sequence sorted by their position in the sequence. As with TALEN Targeter, arrays are not ranked or prioritized. Users should choose according to their own targeting criteria. If there are no target sites in a sequence or no sites in the desired region, users may be able to increase the number of sites by relaxing some of the design guidelines and/or changing the range for number of repeats. We recommend that users leave the percent composition rule enforced to better assure good overall affinity. Although data regarding relative affinities of different RVDs for their partner nucleotide(s) is lacking, based on the published structures (26,27), we predict that the highest affinity interactions are those that occur most frequently in nature. So, if a user chooses not to enforce the percent composition guideline, at a minimum, arrays should be selected that have overall higher numbers of HDs and NGs than NIs and NNs. Also, in most contexts tested, NN pairs with G or A, so we anticipate that minimizing NN content will maximize specificity.

Target Finder

Target Finder uses the scoring function of Moscou and Bogdanove (3) to identify the best-scoring sites in a DNA sequence for a user-specified string of RVDs. Users input a TAL effector RVD sequence containing from 12 to 35 RVDs, each separated by a space, using single-letter amino acid abbreviations. All possible RVDs constructed using the standard 20 amino acids and ‘*’ to indicate a missing 13th amino acid are allowed. Users may select a genome or gene promoterome from a drop-down list or provide their own sequence of interest in a text box or as a file upload. Promoterome is defined as the collection of sequences 1000 bp upstream of annotated translational start sites in a genome. Available genomes and promoteromes include those of rice (Oryza sativa), Arabidopsis thaliana, human (Homo sapiens), fruit fly (Drosophila melanogaster), mouse (Mus musculus), nematode (Caenorhabditis elegans) and zebra fish (Danio rerio). Genome sequences were obtained from Ensembl (www.ensembl.org), except for the rice genome, which was downloaded from the MSU Rice Genome Annotation Project (http://rice.plantbiology.msu.edu, Version 6.1). Promoteromes were downloaded from the UCSC Genome Bioinformatics Site (http://genome.ucsc.edu/index.html), except for the rice promoterome, which was downloaded from the Rice Genome Annotation Project and the Arabidopsis promoterome, which was downloaded from The Arabidopsis Information Resource (TAIR; http://www.arabidopsis.org/). By default, the tool searches both strands, but users may choose to search only the forward strand. Users may choose to search for sites preceded by a 5′ T only, C only or T and C.

Target Finder returns all targets scoring under a threshold cutoff defined as a ratio of observed score to best possible score of 3.0 or less. The best possible score is the score for the array aligned to its code-specified, perfect match DNA sequence. Users may relax the cutoff to 4.0. The default threshold of 3.0 was selected because the naturally occurring TAL effector–target site pairs we analyzed to decipher the TAL effector DNA binding code and develop the scoring matrix (3) typically had scores <3.0 times the best possible score for the TAL effector. In addition, for binding sites of naturally occurring TAL effectors predicted using the scoring function, experimentally verified sites also had scores <3.0 times the best possible score (A. Cernadas, E.Doyle and A.Bogdanove, unpublished results). Users wishing to identify more sites should choose the less stringent cutoff. The best possible score for an array is included at the top of the output table or file. Information provided for each target in the output includes the coordinates, the DNA strand being targeted, the score for the target and the target sequence. Targets in each DNA sequence are ordered from best (lowest scoring) to worst (highest scoring). For searches of genomes or promoteromes from the drop-down list, the output in the default displayed table includes a link for each target that configures a custom track to show the target location in a corresponding genome browser when available. The number of targets displayed in the output table is set by default to a maximum of 10, though users can specify a different number. In the downloadable output, all targets below the threshold are included. As noted above, the downloadable output is available in two formats. The standard file includes the same information for each target as is displayed in the browser output table. The GFF3 file contains gene feature coordinates and can be used in many genome browser and related applications.

It is important to point out that the contributions of individual RVD–nucleotide associations to overall binding affinity are not yet worked out. Scoring based on RVD–nucleotide association frequencies provides only an estimate of relative affinities. Thus, not all sites returned may be efficiently bound by the TAL effector, irrespective of their score, and some biologically relevant targets may be missed if their scores rise above the arbitrary cutoff. Target Finder output should be considered a best estimate of the most probable binding sites for a given TAL effector based on available information. It represents a good starting point for further study to identify true targets or off-targets through experimentation, and a tool for initial assessment of probable relative specificities to choose among multiple arrays that might be available to target a sequence of interest.

A final note about the web interface for Target Finder: TALE-NT 2.0 provides a convenient way to use the tool, but working through the web interface restricts users to the model organism sequences available on the site, or relatively short, user-provided sequences. To study large datasets not included on the site, users can download from TALE-NT 2.0 a C-coded version of Target Finder and run it locally under an open source license.

CONCLUSION

TALEN Targeter and TAL Effector Targeter are versatile tools that allow design of custom RVD arrays for gene editing, engineered gene regulation and other applications. Although other TALEN design tools exist (TALEN Hit, http://talen-hit.cellectis-bioresearch.com/search; ZiFit Targeter 4.0 (16), http://zifit.partners.org/ZiFiTBeta/Introduction.aspx and idTALE (29), http://idtale.kaust.edu.sa/index.html), TALEN Targeter is the only tool that works with any architecture by allowing users to specify ranges for both spacer size and number of repeats. TAL Effector Targeter is the only tool available that targets single custom TAL effector arrays. Although idTALE allows users to search a genome for paired TALEN sites, its search function identifies exact matches only. Target Finder uses a scoring function that allows a biologically relevant number of mismatches in the TAL effector–target alignment, useful for identifying candidate targets of naturally occurring TAL effectors, as well as potential off-targets for custom TAL effector-based proteins.

In addition to these design and targeting tools, TALE-NT 2.0 also includes help pages and tutorials, with guides to interpretation of results. A ‘Protocols and Reagents’ page provides useful links to other TAL effector and TALEN resources. With its wide range of capabilities and content, TALE-NT 2.0 should be a valuable resource for anyone studying TAL effector function or using custom TAL effectors, TALENs or other TAL effector-based proteins for DNA targeting applications.

A new tool, Paired Target Finder, that automates the identification of potential off-target sites for TALENs, as heterodimer or either homodimer, was recently added to the web site.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online: Supplementary Table 1.

FUNDING

The National Science Foundation [IOS\#0820831] (to A.B. in part and ISO\#1221984 to V.B. in part) and the National Institutes of Health [R01GM098861] (to A.B. and D.V.). Funding for open access charge: NIH grant [R01GM098861].

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

The authors thank D. Mistry, W. Kokulapalan and the Iowa State University Bioinformatics and Computational Biology Laboratory for assistance creating the web site and members of the Bogdanove and Voytas laboratories for beta-testing and suggestions.

REFERENCES