The Phyre2 web portal for protein modelling, prediction and analysis (original) (raw)

. Author manuscript; available in PMC: 2017 Feb 8.

Published in final edited form as: Nat Protoc. 2015 May 7;10(6):845–858. doi: 10.1038/nprot.2015.053

Summary

Phyre2 is a suite of tools available on the web to predict and analyse protein structure, function and mutations. The focus of Phyre2 is to provide biologists with a simple and intuitive interface to state-of-the-art protein bioinformatics tools. Phyre2 replaces Phyre, the original version of the server for which we previously published a protocol. In this updated protocol, we describe Phyre2, which uses advanced remote homology detection methods to build 3D models, predict ligand binding sites, and analyse the effect of amino-acid variants (e.g. nsSNPs) for a user’s protein sequence. Users are guided through results by a simple interface at a level of detail determined by them. This protocol will guide a user from submitting a protein sequence to interpreting the secondary and tertiary structure of their models, their domain composition and model quality. A range of additional available tools is described to find a protein structure in a genome, to submit large number of sequences at once and to automatically run weekly searches for proteins difficult to model. The server is available at http://www.sbg.bio.ic.ac.uk/phyre2. A typical structure prediction will be returned between 30mins and 2 hours after submission.

Keywords: protein modelling, protein structure prediction, homology modelling, phyre2, poing, structural bioinformatics, nsSNPs, disease variants, protein modelling server, CASP

Introduction

In September 2014, The UniProtKB/TrEMBL protein database contained over 80 million protein sequences. The Protein Data Bank contains just over 100,000 experimentally determined 3D structures. This ever-widening gap between our knowledge of sequence space and structure space poses serious challenges for researchers seeking the structure and function of a protein sequence of interest.

Fortunately, advances in computational techniques to predict protein structure and function can substantially shrink this gap. On average 50-70% of a typical genome can be structurally modelled using such techniques1. The key principles on which such techniques work are a) that protein structure is more conserved in evolution than protein sequence and b) that there is evidence of a finite and relatively small (1,000-10,000) number of unique protein folds in nature2. These principles permit the protein structure prediction problem to be considered as a problem of matching a sequence of interest to a library of known structures, rather than the more complex and error-prone approach of simulated folding.

For over 30 years researchers have developed and refined computational methods for protein structure prediction. Such methods include simulated folding using physics-based or empirically-derived energy functions, construction of the model from small fragments of known structure, threading where the compatibility of a sequence with an experimentally-derived fold is determined using similar energy functions, and template-based modelling, where a sequence is aligned to a sequence of known structure based on patterns of evolutionary variation. Template-based modelling encompasses the strategies that have been called homology modelling, comparative modelling and fold recognition. It is this technique that has become the most universally reliable and widely used by both the modelling and wider bioscience communities. The success of template-based modelling over other methods is due to three main factors: 1, the development of powerful statistical techniques to extract evolutionary relationships from homologous sequences; 2, the enormous growth in sequencing projects which provides the raw information; and 3, the power of computing to process large databases with a fast turn-around.

Today, the most widely used and reliable methods for protein structure prediction rely on some method to compare a protein sequence of interest to a large database of sequences, construct an evolutionary/statistical profile of that sequence, and subsequently scan this profile against a database of profiles for known structures. This results in an alignment between two sequences, one of unknown structure and one of known structure. One can then use this alignment, or set of equivalences, to construct a model of one sequence based on the structure of another. When the sequence similarity between the protein of interest and the database protein(s) is low, then detection of the relationship and the subsequent alignment can be enhanced if structural information is included to augment the sequence analysis.

Phyre2 is the successor of Phyre, for which we previously published a protocol3 Although the original Phyre3 and the new Phyre2 share the common aim of protein modelling, the new Phyre2 system described in this updated protocol has been designed from scratch and shares no components with its predecessor. Phyre2 was launched in January 2011, but users have needed to reference the original Phyre protocol3, which has been cited over 2,400 times. Phyre2 is one of the most widely used protein structure prediction servers and serves approximately 40,000 unique users per year, processing approximately 700-1000 user-submitted proteins per day. In collaboration with other groups we have applied Phyre2 to the annotation of a wide range of genomes46.

Comparison to other methods

There exist a number of other powerful structure prediction servers on the web. However, for the majority of modelling tasks the differences in accuracy between such tools are minor7. The key, differentiating factor for Phyre2 is ease of use. One of the primary objectives of the Phyre2 server is to provide a user-friendly interface to cutting edge bioinformatics methods. This enables biologists inexpert in bioinformatics to use state-of-the-art techniques without the very steep learning curve typical of many on-line modelling tools.

Some of the most widely used web servers for protein modelling are Phyre2, i-TASSER8, Swiss-Model9, HHpred10, PSI-Pred11, Robetta12 and Raptor13. In international blind trials of protein structure prediction methods (CASP)7, it is observed that for the majority of modelling tasks, there is no significant difference in the accuracy of these methods. In extremely difficult modelling tasks, where remote homology is uncertain and where significant regions of a sequence cannot be matched to a known structure, i-TASSER8 has shown a small but significant performance improvement over other methods. Phyre2 has been tested in CASP9, 10 and 11 experiments (results can be seen at: http://predictioncenter.org/index.cgi). To compare performance to other systems we consider fully automated systems for template-based modelling (TBM) and the average model quality (known as GDT_TS in CASP) they produce over the course of the CASP experiment (120 protein domain targets in CASP9, and 98 in CASP10). Typically these domains share <30% sequence identity with an identified template. As a single research group may submit multiple servers in CASP, we consider only the single best performing server from each participating group. In CASP9 Phyre2 ranked 6th out of approximately 55 unique groups. The 5 superior groups to Phyre2 had an average improvement in model quality of 2.8% with only i-TASSER showing a 5% improvement. In CASP10, Phyre2 ranked 10th out of approximately 45 groups. Excluding i-TASSER (8% improvement) the remaining 8 superior groups showed an average improvement over Phyre2 of 3.7%.

To understand these improvements in a structural context one should note that in a typical 200 residue protein a 1% improvement in model quality roughly corresponds to 2 extra residues being within 4.5Å of the native. CASP11 data for average model quality is not yet available from the prediction centre website. We consider that the primary difference between these servers and Phyre2 is not in accuracy, but in ease-of-use by non-bioinformaticians.

Limitations

There are two principle limitations to the methods used by Phyre2 and other related servers. First, if homology cannot be detected between a user-supplied sequence and a sequence of known structure, then modelling will either be impossible or extremely unreliable. This reflects the wider on-going difficulty of the protein-folding problem. There are still no reliable methods to predict a protein structure purely from sequence alone without reference to known structures.

The second limitation, again applicable to all widely used methods, is predicting the structural effects of point mutations. Phyre2 has functionality to predict the phenotypic effect of a point mutation, but is unable to accurately determine, beyond the estimated position of a sidechain, the wider structural impact of a point mutation. This means a user attempting to model several single position variants using Phyre2 will receive essentially identical models with a different sidechain at the position of the variant. No alterations of the backbone of the protein will generally be observed.

It is often the case that a user does not want only a single chain model of their protein but a model of a multimer. This is not yet possible in Phyre2, but work is currently underway to add this functionality by using known multimeric structures as templates for complex building.

Finally it is important to understand the potential limitations of modelling multi-domain proteins using the ‘intensive’ mode of Phyre2 (described in stage 3b of the ‘Modelling a single sequence’ section, below). If homology models of separate domains without any mutual overlap are combined using the ab initio techniques described in stage 3b, the relative orientation of the domains in the resulting multi-domain homology model is very likely to be incorrect. Such cases can be discerned by examining the table discussed in Step 12b. This can also apply to transmembrane proteins where a homologous crystal structure of the globular/hydrophilic domain may be found and then grafted onto a transmembrane domain from another protein. This limitation will not apply if homology can be detected to a structure that spans the entirety of the user sequence. Future versions of Phyre2 will automatically detect these cases and provide a warning to the user.

The Phyre2 Server

The Phyre2 system is a combination of a large number of disparate software components created by our own group and others written in multiple languages. The system runs on a shared Linux farm of approximately 300 CPU cores. The Phyre2 server may be used in several different ways depending on the focus of the user’s research. The most commonly used facility is the prediction of the 3D structure of a single submitted protein sequence. Advanced facilities include a) Backphyre to search a structure against a range of genomes, b) batch submission of a large number of protein sequences for modelling, c) one-to-one threading of a user sequence onto a user structure, d) Phyrealarm for automatic weekly scans for proteins difficult to model and e) Phyre Investigator for in-depth analysis of model quality, function and the effects of mutations. First modelling of a single sequence will be discussed, followed by brief explanations of tools a) to e). The Procedure will deal mainly with a single query submission to Phyre2. The advanced facilities will not be detailed in the Procedure, with the exception of the use of Phyre Investigaor (optional Procedure Steps 35-39) as the results they produce and their interpretation largely follow that described for a single sequence.

Modelling a single sequence

The core method of Phyre2 for generating a 3D model of a protein sequence is composed of 4 underlying technical stages, described below and illustrated in Figure 1. There is also an optional ‘intensive mode’ which attempts to create a complete full-length model of a sequence through a combination of multiple template modelling and simplified ab initio folding simulation. This is described in stage 3b and illustrated in Figure 2. These stages and the corresponding figures refer to the underlying algorithm being used for structure prediction. In contrast, the steps in the Procedure are a guide to user navigation and analysis of the results of this algorithm. Throughout, the term ‘query’ will refer to the user-submitted protein sequence.

Figure 1.

Figure 1

Normal mode Phyre2 pipeline showing algorithmic stages. Stage numbers are shown in circles and elements within a stage are surrounded by a dashed box. Stage 1 (gathering homologous sequences): A query sequence is scanned against the specially curated nr20 (no sequences with >20% mutual sequence identity) protein sequence database with HHblits. The resulting multiple sequence alignment is used to predict secondary structure with PSI-pred and both the alignment and secondary structure prediction combined into a query hidden Markov model. Stage 2 (Fold library scanning): This is scanned against a database of HMMs of proteins of known structure. The top scoring alignments from this search are used to construct crude backbone-only models. Stage 3 (loop modelling): Insertions and deletions in these models are corrected by loop modelling. Stage 4 (Side chain placement): Finally amino acid side chains are added to generate the final Phyre2 model.

Figure 2.

Figure 2

Intensive mode Phyre2 pipeline.

Once a set of models has been generated as shown in stages 1-3 of Figure 1, models are chosen by heuristics to maximise both confidence and coverage of the query sequence. Pairwise Cα-Cα distances are extracted from these models and treated as linear inelastic springs in Poing. Regions not covered by templates are handled by the ab initio components of the Poing algorithm: preferentially bombardment of hydrophobic residues by notional solvent molecules to encourage burial, predicted secondary structure springs to maintain alpha helix or beta strand conformations, and prevention of steric clash. The new protein is ‘synthesised’ from a virtual ribosome in the context of these forces and the final Cα structure is used to construct a full backbone using Pulchra followed by sidechain addition using R3.

Stage 1: Gathering homologous sequences

The first stage is to determine an evolutionary profile for the query that captures the residue preferences at each position along its length. In order to construct an evolutionary profile, one needs to gather a large number of diverse yet true homologues. Diversity is key in order to create a statistically representative distribution of amino acid preferences at each position in the protein, whilst avoiding false positives is vital so as not to pollute this distribution. Diversity may be achieved by searching the ever-growing protein sequence databases. In the past the sequence database was mined using programs such as PSI-Blast14 that iteratively evolve a profile through multiple BLAST14 scans of the sequence database – so called sequence-profile matching. However, the most powerful approach to specific and sensitive collection of homologues is through profile-profile matching. Unfortunately, applying such a technique to large sequence databases is computationally prohibitive. Fortunately recent powerful heuristics have been developed that overcome much of this computational burden. These heuristics effectively reduce profile-profile matching to sequence-profile matching by discretizing the vectors of 20 amino acid probabilities at each position into a restricted alphabet. This method, known as HHblits15, demonstrates 50-100% increase in sensitivity (% of all true homologues detected) over PSI-Blast and more accurate alignments without sacrificing computational speed. HHblits is used to scan the query against a sequence database where no pair of sequences shares more than 20% identity, resulting in a sequence profile. In addition, the secondary structure of the query is predicted using PSI-Pred16. PSI-Pred is one of the most widely-used methods for secondary structure prediction and uses neural networks trained on protein sequence profiles to predict the presence of alpha helices, beta strands and coils with an average 3-state accuracy of 75-80%.

Stage 2: Fold library scanning

The profile calculated in stage 1, together with the predicted secondary structure is converted to a hidden Markov model (HMM). This HMM is then scanned using HMM-HMM matching against a pre-compiled database of HMMs of known structure known as the fold library. The fold library is composed of a representative set of experimentally determined protein structures whose profiles have been calculated using the same approach as stage 1. The alignment algorithm used in Phyre2 is HHsearch10, which is one of the leading homology detection methods as demonstrated in international blind trials of protein structure prediction (CASP)7. The end result of the fold library scan is a list of query-template alignments ranked by their posterior probabilities as produced by HHsearch. These alignments are used to generate crude backbone models often containing insertions and deletions (indels) and without sidechains.

Stage 3: Loop modelling

Indels are handled using a library of fragments of known protein structures from lengths of 2-15 amino acids. This library is constructed by a complete fragmentation of the structure database followed by structural clustering. A given gap in a model is characterised by its sequence, geometry of flanking regions and distance between end points. For insertions, a sequence-profile search is performed using the missing inserted sequence to detect fragments with similar sequence composition and end-point distances, creating a short list (typically 100 members) of potentially useful fragments. Similarly for deletions, the sequence encompassing a window either side of the deletion is used. These fragments are fitted to the crude model using cyclic coordinate descent (CCD)17, a robotic arm algorithm that attempts to fit the ends of the fragment to the crude model whilst minimising changes in the dihedral angles of the fragment. Finally fitted fragments are ranked using a combination of empirical energy terms and the top scoring model selected. In some cases it is not possible to fit a fragment to an indel and such gaps remain in the backbone. This is often an indication of errors in the original alignment. See Steps 25-28 on alignment interpretation in the Procedure.

Stage 3b: Multiple template modelling with Poing

This stage is only performed if ‘intensive mode’ is used. The aim of this stage is to create a complete model of the query protein even when different regions/domains are modelled by separate templates, or when there are no templates at all (ab initio modelling). To do this we use Poing18, a simplified protein-folding simulator. First heuristics are used to select a subset of models produced in stages 2 and 3 that increase coverage of the query protein whilst maintaining high confidence as reported by HHsearch. These input models provide distance constraints between different pairs of residues. These restraints are modelled as linear inelastic springs. In Poing, restraints are added as the query protein is slowly synthesised from a virtual ribosome. Residues not constrained by input models are modelled ab initio by Poing’s solvent bombardment model, predicted secondary structure springs and penalisation of steric clashes. 5-100 models are synthesised in this way depending on protein size (fewer for large proteins due to computational demand) and clustered to choose a final representative model. As this model is composed only of alpha carbons, its backbone is reconstructed using Pulchra19.

Stage 4: Sidechain placement

Sidechain fitting to the backbone generated in stage 3 or 3b is performed using the R3 protocol20 that involves a fast graph-based technique and sidechain rotamer library to place sidechains in their most probable rotamer whilst avoiding steric clashes. This technique is approximately 80% accurate if the backbone is correct.

Advanced facilities in Phyre2

Backphyre – detecting a structure across genomes

Frequently, users have a protein structure of interest and want to determine if homologous structures exist in other genomes. For this purpose, HMM libraries must be generated for genomes of interest. Phyre2 currently contains such libraries for 30 genomes and this number is constantly growing based on user-requests.

In Backphyre a user uploads a structure in PDB format. The sequence of this structure is extracted and processed as in stage 1 above, whilst also including the known secondary structure within the HMM. This HMM is scanned as in stage 2 against one or more user-selected genomes from the 30 available. The final output screen is similarly laid out to that described in Steps 23-30 of the Procedure.

Batch analysis

It is possible to run the single sequence protocol on a large number of sequences uploaded by a user. By default users are permitted to upload 100 sequences at a time, but this limit can be changed on request. Batch jobs are processed on spare computing power as it becomes available and so are often somewhat slower than individually submitted jobs. Phyre2 processes on average 16,000 individual submissions per month and 7,000 batch sequences a month. Batch jobs can be monitored during processing. Summary pages for batch jobs are made available, as are facilities to download detailed or summary results for the entire batch. Each individual sequence has associated results pages whose interpretation is the same as in Steps 23-34 of the Procedure.

One-to-one threading

Although the detection of remote homologous structures by Phyre2 has high specificity and sensitivity, it is sometimes the case that a user wishes to use a particular structural template on which to model their protein. Perhaps a user has a newly solved structure that is not yet published or a user has some biological information that indicates their chosen template would produce a more accurate model than the one(s) automatically chosen by Phyre2. One-to-one threading allows a user to upload both a sequence to be modelled and the template on which to model it. HMMs of both the sequence and uploaded structure are calculated as in stage 1 above and aligned using the HHsearch algorithm. Unlike ordinary Phyre2 results, one-to-one threading does not of course produce a list of hits. Instead the user is presented with an alignment view and a model of the protein together with information on the confidence of the match. See Steps 25-28 of the Procedure for how to interpret this.

Phyrealarm

Based on statistics from 30,000 Phyre2 submissions over two months, on average more than 50% of all proteins submitted have had over 75% of their length modelled with >90% confidence. Of the remaining 50% of submissions, 25% have had less than a quarter of their sequence modelled and 25% have between a quarter and three quarters confidently modelled. A failure to detect confident structural matches for significant regions of a query is typically caused by one of three factors: 1. A lack of a sufficient number and diversity of homologous sequences to the query to create a useful profile/HMM, 2. The evolutionary distance between the query and any known structure being too great to detect with the HMM-HMM matching method, or 3. The query adopting a novel fold not present in the current structural database.

Fortunately, both the protein sequence database and structure database are growing every week, meaning a currently undetectable homology may likely become detectable in the near future. For this reason the Phyrealarm service was developed. If a user query cannot be modelled confidently, the protein may be added to an automated scan of new structures and new sequence databases each week. Every week approximately 100 new structures are added to the Phyre2 fold library and every few months the clustered sequence database used for profile construction is updated. If a confident match is detected to this newly released data, the user query is automatically processed through the full Phyre2 modelling pipeline and the user sent the results and links by email.

Phyre Investigator

Given a confident model produced by Phyre2, it is often desirable to perform more in-depth analyses of model quality, potential function and the effects of mutations (see optional Procedure Steps 35-39). For these purposes Phyre Investigator was developed. Any model produced by Phyre2 can be submitted to Phyre Investigator with one click from the results page for a range of analyses including:

The Phyre Investigator interface (Figure 3) has been designed to make this large amount of data easy to navigate and interpret simultaneously in a sequence and structural context. The screen is divided into 3 main sections from top to bottom: 1. the information box, 2. the structure view and analyses buttons and 3. the sequence view.

Figure 3.

Figure 3

Phyre Investigator user interface.

a. information box, b. structure and analyses view, c. sequence view. The structure and analyses view shows an interactive 3D JSmol viewer, buttons to toggle different analyses and two bar graphs, in this case for residue A34, showing the sequence profile preferences and predicted likelihood of a phenotypic effect from each of the 20 possible mutations at this position.

The structure view and analyses section is itself divided into 3 regions, from left to right: The JSmol interactive viewer (http://www.jmol.org/), the Analyses buttons, and two graphs showing sequence profile and mutational predictions. Clicking on an analysis button will display, in the information box, a brief summary of whichever analysis is currently active and links to downloadable raw data. It will also colour the structure in the JSMol view in accordance with the analysis chosen and display a colour-coded key to the left of the structure. Finally it will add an extra row to the sequence view, illustrating the same information but in a sequence context.

The sequence view displays the predicted secondary structure of your sequence, the confidence in this prediction, the secondary structure of the model, the amino acid sequence and which regions have been modelled. In addition, clicking on an analysis button will reveal an extra row showing the corresponding information from the analysis in a sequence context.

Hovering over a sequence position will highlight that position with vertical bars to either side of the residue in question. It will also highlight that residue in the JSmol 3D viewer as a red halo around the atoms of that residue. Finally, it will show the appropriate sequence profile and mutation graphs for that position described later. Clicking on a residue will cause that residue to be spacefilled in the JSmol viewer. You may select multiple residues by repeated clicking. At any time you can clear your selection by clicking the "Clear selection" button above the sequence view. You may also take a snapshot of the structure at any time using the "Take JMol snaphot" button.

The sequence profile graph represents residue preferences in your protein at a particular sequence position. Residue preference for each amino acid type is displayed as a vertical coloured bar, with tall, red bars being more favourable than shorter blue bars. These values are taken from the position-specific scoring matrix (PSSM) calculated by a PSI-Blast scan of the query against a large sequence database (Uniref50).

The mutational analysis graph represents the predicted effect of mutations at a particular position in your sequence. Tall, red bars above a residue type indicate that a mutation to this residue is strongly predicted to have a phenotypic effect. These predictions are made using the SuSPect25 method. SuSPect uses sequence conservation, solvent accessibility and protein-protein interaction network information to predict how likely a variant is to lead to disease in humans, demonstrating superior benchmark performance over other available methods, such as PolyPhen-229, SIFT30 and Condel31. The SuSPect method is available as a standalone web server (http://www.sbg.bio.ic.ac.uk/suspect/), with more options for uploading sets of sequences, viewing pre-calculated results for the entire human proteome and more.

When using SuSPect through Phyre Investigator, it is important that your sequence is the wild-type. Submitting a mutant protein to Phyre2 and then Investigator will lead to misleading predictions from SuSPect. If the protein is human, pre-calculated scores will be returned. For non-human proteins, scores will be calculated using a version of SuSPect incorporating protein structure but no network information. By incorporating network information, SuSPect performs best on mutations in human proteins.

Phyre2 job manager

If a user registers with the Phyre2 server (which is free), they gain access to various other tools including the Phyre2 job manager. This is accessed via the ‘View past jobs’ link at the top of the home page when logged in. Clicking the job manager takes the user to a page allowing them to see a summary and links to all of their past and running jobs. Every completed job has a link to results which, when hovered over with the mouse, displays an image of the top scoring model with summary confidence and coverage information. Completed jobs remain by default on the server for 30 days. The job manager permits a user to select past jobs and renew them to prevent expiry, or delete them. This is also possible within the results page as described in Step 8 of the Procedure.

Materials

Equipment

Procedure

Sequence submission

? Troubleshooting

Obtaining results

Main Results Page Navigation

Figure 4.

Figure 4

Example Phyre2 summary results page.

On the left is an image of a large all-beta structure. Clicking on the image will download a PDB formatted file containing this structure. On the right are various data regarding the model including: PDB code of the template used, information about the protein template extracted from the PDB file, confidence in the model and coverage of the query sequence (100% and 28% respectively). In this case there is additional text informing the user that although only 28% of the query could be modelled by a single template, other high confidence templates were also detected that could increase this coverage to 55% by using Phyre’s intensive mode. Finally there is a link to launch the JSmol 3D viewer in the browser and a link to a FAQ describing popular external molecular viewing software.

Figure 5.

Figure 5

Samples of the three main sections of a typical Phyre2 results page. The sections are labelled a-c and discussed below.

a. Example secondary structure and disorder prediction. The query sequence is coloured as described in Step 17. Question marks indicate predicted disordered regions. Each type of prediction is associated with a rainbow colour-coded confidence (red: highest confidence, blue: lowest confidence)

b. Example of the domain analysis results section described in Steps 20-22. The width of the box indicates the length of the query sequence. In this example confident (red) matches have been found at the N-terminus (rank 6) and the C-terminus (ranks 1-5) but no confident matches have been found to the intervening segment.

c. Example of the detailed table of results described in Steps 23-24, and 29-32. In this example, the rank 1 and 2 matches have confidence of 100% and sequence identities of 23 and 24% respectively.

Figure 6.

Figure 6

Example alignment between user query sequence and known structure, as described in Steps 25-28. Sequence colouring is as described in Step 17. Identical residues between query and template have a grey background. Secondary structures (predicted and known) are displayed; in this case alpha helices. Colour-coded per-residue confidence in both the alignment (from HHsearch) and in secondary structure prediction is displayed. The level of residue conservation for both the query and template sequences is also shown where thicker horizontal bars indicate greater degrees of conservation.

Summary Section

Critical Step

Sometimes the confidence of the top model is too low to be useful. It is not recommended to consider models with a confidence value <90%. Similarly it may be that the top model does not cover a significant fraction of the user protein. Sometimes this is because there are multiple domains in the protein covered by separate templates. See Steps 20-22 and the associated troubleshooting to see if ‘intensive mode’ may be valuable here. Phyre2 attempts to automatically determine whether other templates covering additional portions of your protein are available and will provide a message to that effect and a recommendation to try ‘intensive’ mode.

However, if the confidence is poor (<90%) and there are no extra templates, the user is alerted to use Phyrealarm. In this case clicking on the Phyrealarm icon or link will take the user to a pre-filled web form where one click will add their sequence to the Phyrealarm system. As discussed in the Introduction, once added to Phyrealarm, the sequence will automatically be scanned against new structures as they become available in the fold library each week. If a confident hit is detected, a full Phyre2 modelling job is automatically run and the user emailed the results. If this happens, the user would resume the protocol at Step 11.

Sequence analysis

Secondary structure and disorder prediction

Domain analysis

? Troubleshooting

Detailed template information

? Troubleshooting

Phyre Investigator

Superposition of Models

Binding site prediction

Transmembrane helix prediction

Troubleshooting

Step 4: How to handle long sequences?

There is currently a sequence length limit of 1200 amino acids. Work is underway to extend this limit. If the query exceeds this limit, it is advised that the query be submitted to the Conserved Domain Database29 to determine likely domain boundaries. The query may then be chopped at these boundaries to ensure the length is below the limit and resubmitted to Phyre2. Future versions of Phyre2 will automate this step and display optional cut points to the user.

Step 4: What if I only have an identifier and no sequence?

If the user has only an identifier or descriptor of the protein of interest as opposed to the sequence itself they can click the ‘sequence finder’ on the main submission page. This performs a rapid keyword search of a number of sequence databases to retrieve likely matches to the user query. Matches are returned as a table of sequences, species and Uniprot descriptors. One click inserts the chosen sequence into the main form.

Step 22: Should I resubmit my protein in intensive mode?

This step gives you vital information on whether you should consider the ‘intensive’ mode of Phyre2. If you see multiple, high confidence, largely non-overlapping hits, this indicates that your protein contains multiple domains each of which can be modelled confidently. In this case, you should consider trying ‘intensive’ mode as it will attempt to connect these individual domains together using ab initio modelled connecting segments where required.

CAUTION

If you observe long (>100 residue) unmodelled segments, you can try ‘intensive’ mode, but such regions are extremely unlikely to be well modelled due to the limitations of ab initio protein modelling.

Step 29: What if a template is found but not modelled?

If a structural template of interest is present lower down the list and thus has not been automatically modelled, you can generate a model using this template by using the One-to-One threading method. Clicking on the identifier in the ‘Template’ column of the detailed results table takes the user to the Phyre2 fold library where the user can download the PDB coordinates of the template. The user may then upload their sequence and this template to the One-to-One threading method. Simply return to the Phyre2 home page, switch to ‘expert mode’ (in the top left of the home page once logged in to Phyre2) and navigate to One-to-one threading.

Timing

In ‘Normal’ mode, job completion typically takes between 20 minutes to several hours depending on sequence length, number of detected homologous sequences and server load. ‘Intensive’ mode jobs can take considerably longer (2-6 hours) if there is a significant amount of the sequence that cannot be modelled by known homologous structures or the protein is large (>700 amino acids).

Anticipated Results

Once the job is completed the user is notified by an e-mail containing information on the confidence of the modelling, a link to a web page of results and an attachment containing the top scoring model in PDB format (see Step 7). The web page of results contains:

Editorial Summary.

Phyre2 is a web-based tool for predicting and analysing protein structure and function. Phyre2 uses advanced remote homology detection methods to build 3D models, predict ligand binding sites, and analyse amino-acid variants in a protein sequence.

Acknowledgments

This work was supported by the Biotechnology and Biological Sciences Research Council (BBSRC) (LA Kelley: BB/J019240/1, M Wass: BB/F020481/1), the Medical Research Council (MRC) (C Yates: MRC Standard Research Student (DTA) G1000390-1/1) and the Engineering and Physical Sciences Research Council (EPSRC) (S Mezulis: EPSRC Standard Research Student (DTG) EP/K502856/1).

Footnotes

Author contribution

L.A.K. designed the Phyre2 system and wrote the paper. M.J.E.S. supervised the project. S.M. developed the multiple template modelling protocol. C.M.Y. developed the SuSPect method. M.N.W. developed the 3D-Ligandsite web resource.

Competing financial interests

MJES is a Director and shareholder in Equinox Pharma Ltd which uses bioinformatics and chemoinformatics in drug discovery research and services.

References