H++ 3.0: automating pK prediction and the preparation of biomolecular structures for atomistic molecular modeling and simulations (original) (raw)

Abstract

The accuracy of atomistic biomolecular modeling and simulation studies depend on the accuracy of the input structures. Preparing these structures for an atomistic modeling task, such as molecular dynamics (MD) simulation, can involve the use of a variety of different tools for: correcting errors, adding missing atoms, filling valences with hydrogens, predicting p_K_ values for titratable amino acids, assigning predefined partial charges and radii to all atoms, and generating force field parameter/topology files for MD. Identifying, installing and effectively using the appropriate tools for each of these tasks can be difficult for novice and time-consuming for experienced users. H++ (http://biophysics.cs.vt.edu/) is a free open-source web server that automates the above key steps in the preparation of biomolecular structures for molecular modeling and simulations. H++ also performs extensive error and consistency checking, providing error/warning messages together with the suggested corrections. In addition to numerous minor improvements, the latest version of H++ includes several new capabilities and options: fix erroneous (flipped) side chain conformations for HIS, GLN and ASN, include a ligand in the input structure, process nucleic acid structures and generate a solvent box with specified number of common ions for explicit solvent MD.

INTRODUCTION

Molecular modeling and simulations are routinely used to study the structure, function and activity of biomolecules (1–6). The accuracy of such modeling and simulations depends critically on the accuracy of the input structure (7). Approximately 88% of the structures in the protein data bank (PDB) (8) are determined by X-ray crystallography, which can not, in general, resolve positions of most hydrogen atoms. While the positions for most of the missing hydrogen atoms can be easily estimated using simple chemical rules, predicting the protonation states of titratable groups such as the side chains of ASP, GLU, ARG, LYS, TYR, HIS or CYS is non-trivial. Their protonation states depend on complex electrostatic interactions between these groups themselves and with the surrounding environment.

Many theoretical methods have been developed for predicting protonation states of titratable groups (9–23). While these methods vary in the details of the underlying physical models, they share one common feature—computational and algorithmic complexity. Hence, the computational process usually involves multiple non-trivial steps. There is often an additional complication arising from errors and inconsistencies in the input PDB structures, requiring significant ‘pre-processing’ of structures. As a result, predicting protonation equilibria (p_K_), adding missing protons to PDB structures and generating the necessary input files for molecular dynamics (MD) simulation can involve the use of multiple tools from different packages. The complexity of this computational ‘pipe-line’ makes it difficult for novice and even moderately experienced users to identify, install, learn and effectively use the appropriate tools, increasing the likelihood of errors and creating unnecessary barriers for the adoption of these tools. Even for experts, the careful setup required for MD simulations takes a considerable amount of time.

H++ (http://biophysics.cs.vt.edu/H++) is a freely available open-source web server for automating the addition of missing protons to the PDB structures based on predicted p_K_ s of titratable groups, and generating basic parameter/topology and coordinate files for MD simulations (currently only in AMBER format). The calculations are based on the standard continuum solvent methodology (11), within the framework of the Poisson Boltzmann (PB) model (21,24). The server is intended for both experts and non-experts. Typical default values are used for processing parameters and options, although a limited subset of parameters and options can be modified on the input screen.

The original implementation of H++ was described in Gordon et al. (25). The present work summarizes the key features of H++ and describes in additional detail significant new enhancements since the original implementation, see the ‘Materials and Methods’ section below. The accuracy of the current version of H++ is presented in the ‘Results’ section.

MATERIALS AND METHODS

The input structure to be processed by H++ can be either in the PDB format, or alternatively in the PQR format, which is similar to the PDB format with atomic charges and radii included. Figure 1 shows the five main processing steps in H++, which are described below.

Figure 1.

Figure 1.

Flowchart of computations performed by the H++ server described here.

Step 1: pre-process the input structure

Input PDB files can contain numerous errors and format inconsistencies, such as missing heavy atoms, suboptimal residue conformations and non-standard atom names. H++ attempts to make automatic, albeit conservative corrections for many of these problems when possible. Otherwise the errors are identified for possible manual correction. For example, the N and O atoms in the amide groups of ASN and GLN, and the N and C atoms in the imidazole ring of HIS cannot be easily distinguished from electron density maps. Thus, the assignment of these atoms in the PDB file may be (optionally) ‘flipped’ using the reduce algorithm that is based on an analysis of van der Waals contacts and H-bonding (26). An example of errors that are identified for manual correction are missing residues in the middle of protein chains. Input PDB files may also contain HETATM entries for solvent and ligand molecules; H++ removes these entries. Solvent molecules are removed by default because they are treated implicitly by the continuum solvent methodology used. Non-protein ligands are removed by default, but an option is now available to manually include many ligands and specific buried water molecules for processing, as described on the H++ site. Inclusion of buried waters has been shown to improve the accuracy of computed p_K_ of nearby groups (27). For peptide, protein, DNA and RNA ligands, current AMBER force field parameters are used to add H atoms and assign atomic partial charges. For other organic ligands, H++ uses OpenBabel (28) to add H atoms, and atomic partial charges are assigned using the antechamber module from AmberTools (29) and the generalized AMBER force field (GAFF) parameter set. PDB structures may also contain residues with partial occupancy representing multiple possible conformations. Without manual intervention from the user, H++ selects the ‘A’ conformation and ignores all others.

An input PQR file, on the other hand, is assumed to have already been validated (e.g. in order to compute the atomic charges and radii included in the PQR file). Therefore, most error and consistency checks are bypassed for input PQR files. In addition, H++ requires AMBER compatible atom and residue names in the input PQR file.

Step 2: add missing atoms

In this step, missing heavy atoms and protons are added (assuming standard protonation states of all titratable groups at this stage), and atomic partial charges and radii [Bondi (30)] are assigned using the LEAP modules in AmberTools. The position of added atoms, and assignment of atomic partial charges and radii are based on the latest AMBER force field parameters (ff10). This step is bypassed for PQR files, (Figure 1) assuming that missing atoms have been added, and other consistency checks have been performed by the user.

Step 3: calculate interaction energies between titratable groups

The interaction energies between all titratable groups and other “fixed” charges in the molecule are required to calculate the titration curves in the next step. The finite difference Poisson-Boltzmann (FDPB) continuum electrostatics methodology (11,31), implemented in MEAD (32), is used to calculate the energetics of proton transfer. See Gordon et al. (25) for additional details.

A key input to the FDPB computations are the atomic partial charges in the protonated and deprotonated states of titratable groups. The partial charges from the AMBER force field parameters are used for the protonated and deprotonated states of ASP, CYS, GLU, HIS, LYS, and the deprotonated state of ARG and TYR. Since AMBER does not contain parameters for the rarely occurring protonated (neutral) state of ARG and TYR, these partial charges were obtained from Ryde (33) and MEAD, respectively.

H++ uses the single-conformer version of MEAD. However, conformational variability is partially accounted for by the ‘smeared charge’ representation of titratable groups, and the optimization of the positions of added protons. In the case of ASP and GLU, the charge of the additional proton in the protonated state is equally divided between the two O atoms to which it may be attached. In the case of ARG, HIS and LYS, the charge reduction in the deprotonated state is equally divided between the protons that may be removed. While not the most systematic or exhaustive way of incorporating conformational variability, we believe, based on previous experiences (18,31,34,35), that this particular model is a reasonable balance between speed and accuracy, and is therefore a good choice for a web-based calculation.

Step 4: calculate titration curves

The titration curve for a titratable group is the protonation probability as a function of solution pH. The protonation probability can be calculated as the Boltzmann average over the protonation states for the set of interaction energies calculated above (11). However, the exact computation of these statistical averages involves summations over all possible 2_N_ microstates, where N is the number of titratable groups. To make the computation of Boltzmann averages tractable, two different approximations are used. For structures with less than 80 titratable groups, a clustering algorithm is employed to approximate the Boltzmann averages (36,37). For larger structures, the full Monte Carlo method as implemented in MCTI (12), is used. The p_K_ of each group is computed as the mid-point of the corresponding titration curve, that is p_K_ = p_K_1/2. Users are encouraged to check shapes of titration curves displayed by H++, as p_K_1/2 loses its meaning for titration curves that strongly deviate from the standard sigmoidal shape (38).

Step 5: generating outputs

H++ produces several outputs useful for molecular modeling and simulations. These include the estimated p_K_ value for each titratable group, a PDB and the corresponding PQR file in the predicted protonation state, AMBER format topology and force field parameter files for explicit and implicit solvent simulations, energies of protonation microstates (for structures with <25 titratable groups), breakdown of contribution to p_K_ shift of each group and the computed isoelectric point.

For efficiency, the titration curve is only calculated in the experimentally accessible pH range from 0 to 12. In some cases, the pH value at which the calculated protonation probability is 0.5 may be outside this range. For cases where the 0.5 protonation probability occurs outside the above pH range, the p_K_ value is reported as ‘<0’ or ‘>12’.

The input PDB file is updated with titratable groups in their predicted protonation states based on estimated p_K_ values and the user specified pH value—protonated if pH<p_K_, deprotonated otherwise. The PDB file is updated using the LEAP module in AmberTools which also adds atomic radii and partial charges producing a PQR format file. The LEAP module also produces AMBER force field parameter and coordinate files for running molecular dynamics simulations using AMBER or NAMD. These files can optionally include a cubic or octahedral solvent box with the user-specified number and type of (commonly used) ions, for explicit solvent simulations.

RESULTS

The accuracy of H++ was evaluated for a set of 23 high-quality structures (X-ray structures with resolution ≤2.50 A and no missing residues in the middle of the protein sequence), for which experimentally determined p_K_ values are available (39) (Table 1). The set includes 201 titratable groups, of which 66 have large p_K_ shifts (Δp_K_ ≥1.0) and 135 have small p_K_ shifts (Δp_K_ <1.0). For comparison, predictions of a ‘Null’ model are also shown. Additional details for the results shown in Table 1, including PDB codes of the test proteins, can be found at http://biophysics.cs.vt.edu/faq.php.

Table 1.

RMS error in p_K_ values computed by H++ relative to experiment, based on 23 protein structures with a total of 201 titratable groups

Δp_K_ RMS error Prot. accuracy (%) at pH 7 Δp_K_ direction accuracy (%)
H++ Null model H++ Null model H++ Null model
≥1.0 1.44 2.05 98 88 92 50
<1.0 1.39 0.51 97 95 75 50

Most of the test sets used for benchmarking p_K_ prediction methods, including the one described above, lack groups with extreme p_K_ shifts. To address this deficiency, we added to our test set a membrane-embedded protein bacteriorhodopsin (BR), a system with functionally relevant extreme p_K_ shifts (40–42) (Table 2). The trends of the predicted p_K_ shifts are in reasonable agreement with experiment. However, for the deeply buried groups, the use of a protein dielectric constant of 4, lower than the default value ϵ_in_ = 6, results in better agreement with experiment.

Table 2.

Experimental and computed p_K_ values of the membrane-embedded Bacteriorhodopsin

Site Experiment H++
ϵ_in_ = 6 (default) ϵ_in_ = 4
ARG 82 >13.8 >12 >12
ASP 85 2.6 2.3 1.8
ASP 86 >12 7.6 >12
ASP 115 >9.5 8.0 10.4
ASP 212 <2.5 <0 <0
Schiff Base (216) >12 >12 >12

There are many factors that limit the accuracy of p_K_ predicted by H++. Arguably, the more important factors are: shortcomings of the semi-microscopic, 2-dielectric PB implicit solvent model; incomplete accounting of conformational flexibility; and the absence of coupling between protonation and conformational states. Additional factors that affect the accuracy of p_K_ prediction in general, can be found in (20) and (43). The PB implicit solvent model in H++ uses a fixed dielectric value for the solute (protein), even though the appropriate dielectric value can vary considerably depending on the location of the group in question within the protein (16,20,27,44,45). Significant improvement in p_K_ prediction have also been reported when conformational flexibility is taken into account (46,47). Moreover, constant pH molecular dynamics studies have shown a strong correlation between conformation and protonation states (48,49). Inclusion of variable dielectric, conformational flexibility and the coupling between protonation states using currently available methods involve significant additional computational costs, which is counter to our goal of a fast web-based application.

CONCLUSION

The free open-source H++ web server described here automates the process of p_K_ estimation and many of the steps required to prepare biomolecular structures for atomistic simulations. Input PDB structures are protonated according to the computed p_K_s, and force-field parameter/topology and coordinate files compatible with AMBER/NAMD formats are generated. The server is intended for both novice and expert users alike. A distinctive feature of H++ is that it includes extensive input consistency checks, safeguards, and pre- and post-processing warning messages that can help users make informed decisions when using the outputs from the server. The p_K_ calculations in H++ are based on the established PB continuum electrostatics methodology that has been successfully used for this purpose for over two decades. The accuracy and usefulness of H++ has been improved in the latest version which includes several new capabilities: such as fixing erroneous (flipped) residue conformations; including a ligand in the input structure; processing nucleic acid structures; and generating a solvent box with specified number of common ions for explicit solvent MD. In addition, upgrades to the H++ server resulted in a 3× speedup relative to H++ 1.0 (25). We continue to enhance H++, with future plans to automate ligand processing and to add the ability to generate force field parameter/topology files for other popular modeling packages. Since its original release in August 2005, over 38 000 structures have been uploaded to H++ for processing by 2287 registered users and thousands of anonymous users, proving the usefulness of this tool to the molecular modeling and simulation community. The source code for H++ is also available, upon request, for offline batch processing of a large number of structures.

FUNDING

Funding open access charge: National Institutes of Health [GM076121 to A.V.O.].

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

We thank John Gordon for helping to maintain the H++ server during the first 3 years of its operation.

REFERENCES