A new, automated retrosynthetic search engine: ARChem (original) (raw)
1. ARChem Route Designer Antony Williams On behalf of A. Peter Johnson University of Leeds / SimBioSys Inc. Howard Y. Ando (Pfizer Inc.), Zsolt Zsoldos, Aniko Simon, James Law, Darryl Reid, Yang Liu, Sing Yoong Khew, Sarah Major
2. Computer Aided OrganicSynthesis Design (CAOSD) assists chemists in finding synthetic routes to target compounds The first CAOS system (LHASA) was introduced by EJ Corey nearly 40 years ago Since then…
3. CAOSD systems LHASA( L ogic and H euristics A pplied to S ynthetic A nalysis )(E.J.Corey et.al.) SynChem (Gelenter et.al) IGOR (Ugi et.al) EROS and WODCA (Gasteiger et.al) SynGen (Hendrickson et.al.) Arthur (commercial: Synthematix) Others…
4. Typical Retrosynthetic AnalysisRetrosynthetic analysis works backward from the target and generates increasingly simple precursors
5. Retrosynthetic Analysis versusReaction Databases Reaction databases are popular aids in reaction planning Databases are large, highly curated and good tools exist for searching and data-mining Compare with “general prediction technology” LogP NMR spectra etc…
6. CAOSD Systems Thereis little routine use of retrosynthetic analysis tools. Why? Chemists ARE the knowledge base? Reaction database data mining suffices? Who trusts computers anyway? While reaction databases are very valuable couple “predictions” and reference data
7. Knowledge base creationRetrosynthetic analysis is driven by rules describing scope, limitations and structure changes associated with a reaction Rules have to be manually encoded Only experienced synthetic chemists have the knowledge to create good rules
8. Goals of RouteDesigner perform rule based retrosynthetic analyses of target molecules back to readily available materials provide fully automated generation of retrosynthetic reaction rules by analysis of a reaction database – avoid time consuming manual creation provide the user with literature examples of the transformations suggested by the retrosynthetic analysis provide a set of alternative routes to a given target
9. System Design OverviewAutomatic Rule Extraction User input: Molecule Starting Materials: Aldrich, Acros Lancaster Reaction DB (MOS, Beilstein, CASREACT etc.) Reaction Rules Route Des. search Output: Reaction pathways
10. Starting Materials Automaticselection of starting materials from commercially available compounds is important for retrosynthetic analysis With known starting materials as a basis analysis is directed at portions of the target molecule that cannot be made from available starting materials Others Lancaster Acros Aldrich
11. Previous works: Rule extraction from reaction databases 1) Satoh et al : A Novel Approach to Retrosynthetic Analysis Using Knowledge Bases Derived from Reaction Databases , J. Chem. Inf. Comput. Sci. 1999, 39, 316-325 2) Gelernter et al : Building and Refining a Knowledge Base for Synthetic Organic Chemistry via the Methodology of Inductive and Deductive Machine Learning J. Chem. Inf Comput. Sci. 1990, 30, 492-504 3) Wang et al : Construction of a generic reaction knowledge base by reaction data mining, Journal of Molecular Graphics and Modelling 19, 427–433, 2001
12. Route Designer ReactionRules Reaction DB ( RDF format) Reaction Rules 1) The extraction process converts many reactions into a few rules. 2) The combinatorial explosion of the retrosynthetic search process is controlled. Generic leaving groups reduce the number of rules Cluster reactions by chemical equivalence Group identical reaction cores Find the core of the reaction Methods of Organic Synthesis ( MOS ) ~42k Reactions Large reaction DBs OK ( millions of reactions supported) 4k rules from 47k reactions
13. Identifying Reaction CoresThe Core - atoms that undergo “changes” during a reaction Atom mappings identify atoms attached to bonds changed, made or broken in the reaction Extracted Core :
14. Extension to “non-reacting”atoms Initial core is extended to include structural features essential for the reaction (difficult process) Empirical rules attempt to capture these features
15.
16. Generic leaving groupsGeneric rule Reaction Rule Nucleofuge (NF) - a leaving group which carries away the bonding electron pair
17. Clustering Cores Establishedapproaches (Morgan numbers) are used to identify the reaction core and the entire extended core Clustered by exact matching of the extended cores and different extended cores may be combined Rules specifying bond making and breaking operations are constructed
18. Rule Generation SummaryOther examples clustered: Reaction DB in RDF file format Esterification examples clustered: Esterification Reaction Rule: ...-> ... Some other rule
19. Rule Generation fromMOS DB The Methods in Organic Synthesis database contains ca. 42k reactions Rule extraction performed on this database gave ~3800 rules
20.
21. FG Interchange and FG Additive Non-disconnective transformations such as FGI and FGA must either lead to an available starting material change functionality so that a disconnective transform can be triggered uncontrolled use of FGI / FGA would give a combinatorial explosion
22.
23.
24.
25. Exhaustive Search Controlthe combinatorial explosion!!! Fundamental transformations (esterification, amide formation …) applied at first stage to break up target User selectable search constraints Depth limit for search Minimum number of examples for a rule to be used Strategic disconnections / preserved bonds FGI/FGA ( eg CH2OH ==> CO2H) restricted to cases triggering subsequent disconnective transforms or finding starting materials Unstable functionality generated must be removed at next step (some organometallics or acyl chlorides)
26. Preserve and TargetBonds Preserve selected bonds and Target other bonds during analysis
27. Ordering of ResultsPrioritization of solutions Disconnective transforms before FGI/FGA Minimize wastage (atom efficient reactions) Starting material coverage Prefer thoroughly explored chemistry (based on example count) The more bonds broken in the retrosynthetic transformation the better
28. Real World ExamplesTested on hundreds of examples including drugs, natural products, publications An example: Zatosetron,36 a potent, selective and long acting 5HT receptor antagonist from Lilly used in the treatment of nausea and emesis associated with oncolytic drugs < 5 mins on standard PC,
29.
30.
31. Zatosteron The routeshown is virtually identical to the published route for Zatosetron (Robertson et al . J. Med. Chem. , 1992; Vol. 35, pp 310-319) Minor differences include: using the 3-bromo-2-methylpropene rather than the chloro version the aminotropene was not found in our starting material database but the tropinone precursor was
32. How Fast? HowComplex? Retrosynthetic analyses from minutes to hours based on complexity and constraints System is based on construction of skeletal connections not on stereochemistry Does not take into account conditions – temperature, pressure etc Estimates yields based on database reactions but they are ESTIMATES!
33. How Fast? HowComplex? Very intuitive and fast to initiate request Can be expanded with other catalogs of starting materials easily “ Training” via cluster analysis is not difficult but is not an everyday task either Can be tested using an online platform Proven application at a number of companies already
34. Under Development Indicate preferred starting materials to bias the search Improve clustering to fully capture chemical constraints including better regioselectivity and stereoselectivity – target is much smaller rule set Deal with interfering functional groups Order search results to reflect chemists’ preferences ...
35. Interfering functionality Compatiblefunctionality is detected through comparison with reaction example databases Possible interfering functionality is inferred Search result rank is marginally weighted against interfering functional groups ...
36. Rule set optimizationsPromotion of heteroaromatic rules using lower example threshold than other rules 300k rules generated from the Beilstein database gave >50k rules with heteroaromatic relevance Initial results show dramatic performance improvements Future extension to other rule categories is under investigation ...
37. Route Designer summary“ Predictions” and reference data are proven approaches – extend to retrosynthetic analysis Route Designer already provides good routes for a variety of targets AND they are “predictions” Any updates in reaction databases can be easily incorporated into the rule base – “user training” Starting materials database can be enhanced and extended easily – there are 10s of catalog vendors which can be added
38. Acknowledgements Acknowledgements Pfizer Robert Wade Robert Dugger James Gage Peter Wuts David Entwistle Inaki Morano Richard Nugent Bryan Li Julian Smith Accelrys Rob Brown Eric Jamois Symyx-MDL Maurizio Bronzetti Jochen Tannemann