PubChem chemical structure standardization - PubMed (original) (raw)

PubChem chemical structure standardization

Volker D Hähnke et al. J Cheminform. 2018.

Abstract

Background: PubChem is a chemical information repository, consisting of three primary databases: Substance, Compound, and BioAssay. When individual data contributors submit chemical substance descriptions to Substance, the unique chemical structures are extracted and stored into Compound through an automated process called structure standardization. The present study describes the PubChem standardization approaches and analyzes them for their success rates, reasons that cause structures to be rejected, and modifications applied to structures during the standardization process. Furthermore, the PubChem standardization is compared to the structure normalization of the IUPAC International Chemical Identifier (InChI) software, as manifested by conversion of the InChI back into a chemical structure.

Results: The observed rejection rate for substances processed by PubChem standardization was 0.36%, which is predominantly attributed to structures with invalid atom valences that cannot be readily corrected without additional information from contributors. Of all structures that pass standardization, 44% are modified in the process, reducing the count of unique structures from 53,574,724 in substance to 45,808,881 in compound as identified by de-aromatized canonical isomeric SMILES. Even though the processing time is very low on average (only 0.4% of structures have individual standardization time above 0.1 s), total standardization time is completely dominated by edge cases: 90% of the time to standardize all structures in PubChem substance is spent on the 2.05% of structures with the highest individual standardization time. It is worth noting that 60% of the structures obtained from PubChem structure standardization are not identical to the chemical structure resulting from the InChI (primarily due to preferences for a different tautomeric form).

Conclusions: Standardization of chemical structures is complicated by the diversity of chemical information and their representations approaches. The PubChem standardization is an effective and efficient tool to account for molecular diversity and to eliminate invalid/incomplete structures. Further development will concentrate on improved tautomer consideration and an expanded stereocenter definition. Modifications are difficult to thoroughly validate, with slight changes often affecting many thousands of structures and various edge cases. The PubChem structure standardization service is accessible as a public resource ( https://pubchem.ncbi.nlm.nih.gov/standardize ), and via programmatic interfaces.

Keywords: Aromaticity; InChI; Kekulization; PubChem; Standardization; Tautomerism.

PubMed Disclaimer

Figures

Fig. 1

Fig. 1

Exemplary drawings conventions for functional groups. a Examples taken from the IUPAC graphical representation standards for chemical structure diagrams concerning ionic bonds and salts and nitrogen compounds [7]. b Examples taken from the FDA substance registration system standard operating procedure substance definition manual. For nitro group and nitrogen oxides, both conventions agree on the preferred representation [8]

Fig. 2

Fig. 2

Natural effects contributing to molecular diversity. Implicit hydrogen atoms on carbon atoms are not shown. a Three tautomeric variants of pyrimidin-4-one. The bottom two structures are different Kekulé representations of the same tautomer. b Thioacetic acid as an example for tautomerism (top, in left–right direction), ionization (left and right, in top–bottom direction) and mesomerism (bottom, in left–right direction). Redrawn with permission from Sayle 2010 [32]

Fig. 3

Fig. 3

Effects of solvent on tautomeric preference for simple heterocycles. Listed are percentages of three tautomeric variants of the same structure in different solvents [36]

Fig. 4

Fig. 4

Tautomers of Guanine. Tautomers were generated by the approach described in the “Methods” section (under Standardize Valence Bond Form) in the indicated order. The dashed frame highlights the variant chosen from the ensemble as the canonical tautomer

Fig. 5

Fig. 5

Comparison of five aromaticity perception models. Structure classification as aromatic is indicated by color (blue: aromatic; orange: not aromatic; grey: not available). Aromaticity was perceived in every structure using the function OEAssignAromaticFlags in the OpenEye OEChem C++ toolkit with the aromaticity models OEAroModelMDL (MDL), OEAroModelTripos (Tripos), OEAroModelMMFF (MMFF), OEAroModelDaylight (Daylight) and OEAroModelOpenEye (OpenEye). If at least one atom or bond in a structure was identified as aromatic, the whole structure was classified as aromatic. Atomic element Te is not available in the MMFF and Tripos aromaticity models. Redrawn with permission from the OpenEye Scientific Software Inc. OEChem C++ toolkit documentation [62]

Fig. 6

Fig. 6

Substance to Compound relationship for guanine. In total, 153 entries in Substance are standardized to and mapped to the structure for CID 764. a Eight representative SIDs with non-identical structures that get standardized to guanine. No explicit hydrogen atoms were provided for (i), (ii), (iv), (v) and (viii). Hydrogen atoms are depicted as deposited for (iii), (vi) and (vii). b Standardized structure of Guanine (CID 764)

Fig. 7

Fig. 7

PubChem structure standardization protocols. For detailed descriptions of each step, see the “Methods” section

Fig. 8

Fig. 8

Standardization statistics. The version of PubChem Substance used in this study contains 116,641,122 deposited substances. Almost 90% of those entries contain fully specified structures. The majority of these are organic. The average standardization success rate is 99.64% with 44.43% of successfully standardizing structures getting modified in the process

Fig. 9

Fig. 9

Standardization failure examples. Each structure is shown as it enters the respective standardization step, including modifications from previous steps. a SID 479450 contains two tetra-valent oxygen atoms (indicated by arrows) and fails the verification of atomic valences. b SID 8021026 contains a penta-valent carbon atom (indicated by arrow) and fails the verification of atomic valences as well. c SID 235635 contains two adjacent nitrogen atoms that both have a positive charge (indicated by arrows). A post-processing evaluation in the determination of a canonical tautomer lets structures fail if they have neighboring charges of the same type

Fig. 10

Fig. 10

Functional group standardization statistics. A total of 522,757 substances is modified in the Verify Functional Groups step, which normalizes non-standard functional group configurations to preferred ones based on a set of standardization rules, each of which is designated with an integer called a “transformation index” for convenience. The total number of substances modified in this step is smaller than the sum of functional group transformations because multiple changes can be performed in the same structure. Nine standardization rules set ionic bonds (8, 9, 10, 22-27), one sets complex bonds (15—the processing of transition metals), and two set dative bonds (11, 28). Rule 13 is not used, indicating that carbon monoxide is only encountered in the correct configuration

Fig. 11

Fig. 11

Binned tautomer counts per substance. Histogram is non-cumulative. The first data series (blue) shows how many substances have the respective range of tautomers generated during valence bond canonicalization. The second data series (red) indicates the total number of tautomers generated for those substances in the tautomer count range

Fig. 12

Fig. 12

Example for a substance with the highest number of generated tautomers. Shown is SID 30283854 as it enters the step Standardize Valence Bond Form. Dashed lines indicate complex bonds as set in Verify Functional Groups. Zr and Cl ions skip valence bond canonicalization. Each one of the non-monoatomic connected components reaches the limit of 250,000 generated tautomers. In total, 1 million tautomers are generated during the standardization of this substance, with none of them being considered preferred over the original one by the standardization protocol

Fig. 13

Fig. 13

Top-five substances with highest standardization times. Dotted lines indicate complex bonds that were set in conjunction with charges on connected atoms. Shown are fully standardized structures. a SID 143137591, 9648 s; b SID 142254533, 7555 s; c SID 143474510, 3094 s; d SID 143474488, 2231 s; e SID 138154965, 1187 s

Fig. 14

Fig. 14

Standardization time statistics. Time was measured as wall time on a mixed-use, heterogeneous compute cluster. a Per substance standardization time as non-cumulative histogram. For each bin, the lower (inclusive) and upper (exclusive) boundary is provided. Making the step from s to min, a value of 0.17 min equals 10 s. b Cumulative standardization time per substance (sorted by ascending standardization time). 10% of total standardization time is spent on 97.95% of all substances. Within those 97.95%, the average standardization time is 0.0019 s (± 0.0012 s). c Average contribution to per substance standardization time per standardization step. Standardization steps are numbered by roman numerals: verify element (I), verify hydrogen (II), verify functional groups (III), verify valence (IV), standardize annotations (V), standardize valence bond form (VI), standardize aromaticity (VII), standardize stereochemistry (VIII), standardize explicit hydrogens (IX). For each substance, the time necessary for standardization is dominated by step (VI), which performs valence bond canonicalization (44 ± 12%)

Fig. 15

Fig. 15

Structure duplicate frequencies in PubChem. Structure equivalency determined by de-aromatized canonical isomeric SMILES before standardization (a), after PubChem standardization (b) and by standard InChIs (c). The x-axis indicates the number of duplicates per structure, _Y_-axis the frequency of this number of duplicates. Plots are double-logarithmic for clarity to emphasize the region of low duplicate counts where the highest differences occur

Fig. 16

Fig. 16

Differences in structure duplicate frequencies. a Frequency differences between before and after standardization, structure equivalency determined by de-aromatized canonical isomeric SMILES; b frequency differences between PubChem standardization and InChI normalization. X-axis indicates the number of duplicates per structure, _Y_-axis the frequency of this number of duplicates. Plots are double-logarithmic for clarity to emphasize the region of low duplicate counts where the largest differences occur

Fig. 17

Fig. 17

Comparison between PubChem standardization and InChI Normalization

Fig. 18

Fig. 18

Example structure rejected by InChI normalization and PubChem standardization. SID 7822769 contains various invalid valences and therefore is rejected by both approaches

Fig. 19

Fig. 19

Differences between PubChem-standardized and InChI-derived structures—protonation. Sulfuric acid is the most commonly deposited structure in PubChem. The structures shown in a with their SIDs are normalized to the protonated form of sulfuric acid b by InChI normalization but not by PubChem standardization, which does not alter them at all

Fig. 20

Fig. 20

Differences between PubChem-standardized and InChI-derived structures—valence models. Examples illustrate modifications from InChI-derived structures applied in Verify Hydrogen (a) and Standardize Explicit Hydrogen (b) during PubChem standardization. a SID 1300. State in (i) as deposited. InChI-derived structure results in alternate Kekulé structure (ii). Subsequent PubChem standardization adds positive charge to tetra-valent nitrogen in step Verify Hydrogen (iii). The final result of PubChem standardization is shown in (iv). b SID 576083. State in (i) as deposited. Standard InChI-derived structure disconnects nitrogen and palladium as well as palladium and oxygen and places charges as appropriate (ii). During subsequent PubChem standardization, two hydrogen atoms are added to each oxygen atom (iii). The result of original PubChem standardization is shown in (iv). Even though (iii) does not possess the complex bonds between nitrogen, palladium and oxygen, the SMILES strings generated for the structures in (iii) and (iv) are identical

Fig. 21

Fig. 21

Differences between PubChem-standardized and InChI-derived structures—functional groups. The y-axis indicates the number of affected substances (not the absolute number of modifications in a substance) in the Verify Functional Groups step during the PubChem standardization of InChI-derived structures. Transformation indices represent respective standardization rules used to normalize non-standard functional group configurations (see the “Methods” section). All non-visible bars indicate zero affected substances

Fig. 22

Fig. 22

Analysis of 37,882,174 substances first modified in the Standardize Valence Bond Form step during PubChem standardization of InChI-derived structures. Modifications made in this step can be tracked by comparing the de-aromatized canonical isomeric SMILES of InChI-derived structures and PubChem-standardized structures. RDB stands for the count of ring double bonds, and RDBPubChem and RDBInChI are the RDB counts for PubChem-standardized and InChI-derived structures, respectively, emphasizing the tautomer scoring approach in PubChem for exocyclic double bonds

Fig. 23

Fig. 23

Differences between PubChem-standardized and InChI-derived structures—tautomeric preference in functional groups. Examples for tautomeric preferences of characteristic functional groups. a Amide (SID 75764); b Thioamide (SID 108898); c Amidine (SID 132494); d Functional groups and their preferences can occur simultaneously (SID 5856091). In all cases: (i) InChI-derived structure; (ii) structure after subsequent PubChem standardization

Fig. 24

Fig. 24

Five types of tautomeric state differences between PubChem-standardized and InChI-derived structures. The difference in tautomeric states between PubChem-standardized and InChI-derived structures are identified using SMARTS. Crossed double bonds are used to indicate stereogenic double bonds with undefined cis/trans configuration

Fig. 25

Fig. 25

Differences between PubChem-standardized and InChI-derived structures—cyclic double bonds. Examples for preference of cyclic double bonds for PubChem standardization. a SID 1462; b SID 70471; c SID 78008. In all cases: (i) InChI-derived structure; (ii) structure after subsequent PubChem standardization

Fig. 26

Fig. 26

Differences between PubChem-standardized and InChI-derived structures—tautomeric preference. Examples for tautomeric preferences not rooted in specific functional group preferences or the size of conjugated systems. a SID 1403; b SID 4970. c SID 468090. In all cases: (i) InChI-derived structure; (ii) structure after subsequent PubChem standardization

Fig. 27

Fig. 27

Examples for diverging annotation of stereochemistry in PubChem-standardized and InChI-derived structures. a SID 12127575, the phosphorus atom is not considered to be chiral by PubChem standardization. b Analogous case in SID 2438124. c SID 127817816, PubChem standardization recognizes that the stereogenic carbon atoms do not have neighbors of four different symmetry classes and removes the annotated stereo configurations. d SID 158375861, the fully configured double bond in (i) is not considered to be a stereocenter by PubChem standardization due to the identical symmetry classes of adjacent atoms in the ring system. In all cases: (i) InChI-derived structure; (ii) Structure after subsequent PubChem standardization

Fig. 28

Fig. 28

Element classifications as used in PubChem standardization. a Organic elements; b metals; c transition metals; d semiconductors. Note that B, Si, As, Te, and At are not included into any element class because of the diversity of bonding possibilities of these elements

Fig. 29

Fig. 29

Non-standard bond types in PubChem. a Ionic bond between sodium and sulfur in sodium thiopental (CID 23665410); b Complex bond between nitrogen atoms and Fe(II) represented as Fe2+ in heme b (CID 4973); c Dative bond between boron and oxygen in boron trifluoride diethyl etherate (CID 517922). Contrary to other annotations of this bond type, in PubChem, the participating atoms do not get assigned charges

Fig. 30

Fig. 30

Functional group standardization I. If not mentioned otherwise, hydrogen atoms are as depicted and wildcard asterisks (*), representing connected any atoms, can be hydrogen atoms. Connected carbon atoms are shown without labels and should not be confused with ‘any’ connections. Parenthesis indicates terminal atom. Numbers above arrows are transformation indices for respective standardization rules (see the text for the description of transformation indices). a Oxygen and sulfur terminal; no implicit hydrogens on central atom. b Both oxygen atoms terminal, no implicit hydrogens on manipulated oxygen or center atom. c Center atom has one more explicit connection that is not further specified (with respect to bond order and adjacent atom). Oxygen is terminal, but carbon does not have to be terminal. Center atom and charged partner have no implicit hydrogen atoms. d Oxygen is terminal. Hydrogen atoms on uncharged carbon atoms are not checked. Center atom and charged partner cannot have implicit hydrogen atoms. e Ionic bond is set if situation is unambiguous, with A1 and A2 being the only matches of their kind. f No charges are assigned if A2 is di-valent oxygen or tri-valent nitrogen (after modification to ionic bond). Charge modification is incremental. Charge limit is + 1 on A1 and − 1 on A2. g No charges are assigned if A2 is di-valent oxygen or tri-valent nitrogen (after modification to ionic bond). Charge modification is incremental. Charge limit is + 2 on A1 and − 2 on A2. h Bond is annotated as dative bond. i M is a metal as defined in Fig. 28

Fig. 31

Fig. 31

Functional group standardization II. Shown are cases that will not be modified (ac) and pre-processing steps carried out before the covalent single bond is replaced by a complex bond. Z indicates the transition metals and semiconductors (see Fig. 28). Z′ as used in b and c is a subset of the elements in Z. Terminal atoms are specified as such by visually restraining them using a parenthesis ‘]’. The transformation index for transition metal processing (df) is 15 (see the text for the description of transformation indices). a Bonds that are not modified (true for all elements in Z): double bond to oxygen, single bond to oxygen that is single-bonded to a metal M (see Fig. 28b), single bond to a halogen X, single bond to hydrogen. b Bonds that are not modified for elements in Z’: single bond to tetra-valent carbon. c Bonds that are not modified for elements in Z’: single bond to di-valent oxygen, single bond to di-valent sulfur, single-bond to tri-valent nitrogen. d A positive charge is moved from tetra-valent nitrogen to the transition metal. e Special case of carbon and nitrogen in carbon-only and nitrogen-containing five-membered aromatic rings, respectively. The same transformation applies to 7-membered aromatic carbon-only rings. f Special case of carbon and nitrogen double-bonded to oxygen

Fig. 32

Fig. 32

Functional Group Standardization III. If not mentioned otherwise, hydrogen atoms are as depicted, and wildcard asterisks (*), representing connected any atoms, can be hydrogen atoms. Connected carbon atoms are shown without labels and should not be confused with ‘any’ connections. Parenthesis indicates terminal atom. Numbers above arrows are transformation indices for respective standardization rules (see the text for the description of transformation indices). a Penta-valent nitrogen connected to terminal nitrogen (triple bond) and carbon, nitrogen or oxygen (double bond). b Penta-valent nitrogen connected to terminal oxygen or sulfur (double bond) and non-terminal carbon (triple bond). c Nitro group and nitrate (penta-valent representation). d Single-bonded atoms adjacent to nitrogen are (not necessarily terminal) carbon. e Covalent single bond between penta-valent nitrogen and oxygen or sulfur replaced by ionic bond. f, g Covalent single bond between penta-valent nitrogen and halogen replaced by ionic bond. h, i Covalent single bond between tetra-valent nitrogen and halogen replaced by ionic bond. j Double bond between tetra-valent nitrogen and boron replaced by dative bond. k, l Nitrogen without implicit hydrogens. m Nitro group (tetra-valent representation)

Fig. 33

Fig. 33

Functional group standardization IV. Numbers above arrows are transformation indices for respective standardization rules (see the text for the description of transformation indices). a Different representations of cyclopentadiene used in metallocene structures are unified. b Case for benzene analogous to a. c Standardization of thiophene derivatives. Transformation is only successful if implicit hydrogen counts are sufficient

Fig. 34

Fig. 34

Valence list statistics. This heatmap illustrates the number of configurations per element in the valence database that are valid. For every element, the valence database contains configurations describing valid combinations of the charge, number of sigma bonds, number of pi bonds and number of implicit hydrogen atoms. All combinations are supplied as supporting information in Additional file 1

Fig. 35

Fig. 35

Standardization of Kekulé structure and aromaticity annotation. SID 7 is used as example, without annotation of stereochemistry or isotopic information for clarity. a Deposited structure of SID 7. b Intermediate representation of the same structure with bond order information in the aromatic system deleted. All bonds are represented as aromatic bonds. This structure is submitted to the OEKekulize function of the OpenEye OEChem C++ toolkit that generates a Kekulé form with single and double bonds instead of aromatic bonds. c Result of the aromaticity standardization. The detection of aromaticity, deletion of bond orders and Kekulization resulted in a different Kekulé structure

Fig. 36

Fig. 36

Stereoconfiguration examples. A Variants of identical tetrahedral configuration of the same chiral center. a Representations with explicit hydrogen atom using: (i) two bonds in, one behind and one in front of the drawing plane (favored representation); (ii) one bond in, two in front of and one behind the drawing plane; (iii) one bond in, one bond in front of and two bonds behind the drawing plane; (iv and v) two bonds in front of and two bond behind the drawing plane. b Equivalent representations of the same configuration with implicit hydrogen atom if possible. B Representation of a double bond with unknown cis/trans configuration as represented in PubChem (i) and corresponding accepted IUPAC recommendations (ii–iii)

Similar articles

Cited by

References

    1. Brown FK. Chapter 35—chemoinformatics: what is it and how does it impact drug discovery. In: James AB, editor. Annual reports in medicinal chemistry. New York: Academic; 1998. pp. 375–384.
    1. Hann M, Green R. Chemoinformatics—a new name for an old problem? Curr Opin Chem Biol. 1999;3(4):379–383. doi: 10.1016/S1367-5931(99)80057-X. - DOI - PubMed
    1. Gasteiger J. Chemoinformatics: a new field with a long tradition. Anal Bioanal Chem. 2006;384(1):57–64. doi: 10.1007/s00216-005-0065-y. - DOI - PubMed
    1. Engel T. Basic overview of chemoinformatics. J Chem Inf Model. 2006;46(6):2267–2277. doi: 10.1021/ci600234z. - DOI - PubMed
    1. Varnek A, Baskin II. Chemoinformatics as a theoretical chemistry discipline. Mol Inform. 2011;30(1):20–32. doi: 10.1002/minf.201000100. - DOI - PubMed

LinkOut - more resources