Improve synthon substructure search performance ~20% by d-b-w · Pull Request #9305 · rdkit/rdkit (original) (raw)
sortAndUniquifyToTry previously built a parallel vector of (index, string) pairs, sorted by string, erased duplicates, then rebuilt the original vector — O(N log N) with one heap allocation per candidate product.
Replace with an erase-remove over a boost::unordered_flat_set keyed on buildProductHash (boost::hash_combine over synthon IDs + reaction ID). Dedup is now O(N) average with no string allocations on the hot path.
Also switch SearchResults::d_molNames from std::unordered_setstd::string to boost::unordered_flat_setstd::string for the same open-addressing cache locality benefit during mergeResults.
Perf (42-rxn / 140B-product Freedom space, maxHits=3000, hitStart=1000, 9 queries; vanilla.log → 2unordered_flat_set.log): Benzene: 6.92s → 5.64s (−19%) Tolueneish: 6.19s → 5.07s (−18%) Acetaminophen: 4.50s → 3.63s (−19%) Allopurinol: 4.41s → 3.94s (−11%) Theophylline: 4.39s → 3.90s (−11%) Nicotine: 4.87s → 3.97s (−18%) Ciprofloxacin: 6.82s → 6.09s (−11%) Aspirin: 4.51s → 3.42s (−24%) Metoprolol: 5.11s → 4.07s (−20%) Total: 48.40s → 40.33s (−17%)
Hit counts and MaxNumResults unchanged across all queries.
Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com
greglandrum pushed a commit that referenced this pull request
sortAndUniquifyToTry previously built a parallel vector of (index, string) pairs, sorted by string, erased duplicates, then rebuilt the original vector — O(N log N) with one heap allocation per candidate product.
Replace with an erase-remove over a boost::unordered_flat_set keyed on buildProductHash (boost::hash_combine over synthon IDs + reaction ID). Dedup is now O(N) average with no string allocations on the hot path.
Also switch SearchResults::d_molNames from std::unordered_setstd::string to boost::unordered_flat_setstd::string for the same open-addressing cache locality benefit during mergeResults.
Perf (42-rxn / 140B-product Freedom space, maxHits=3000, hitStart=1000, 9 queries; vanilla.log → 2unordered_flat_set.log): Benzene: 6.92s → 5.64s (−19%) Tolueneish: 6.19s → 5.07s (−18%) Acetaminophen: 4.50s → 3.63s (−19%) Allopurinol: 4.41s → 3.94s (−11%) Theophylline: 4.39s → 3.90s (−11%) Nicotine: 4.87s → 3.97s (−18%) Ciprofloxacin: 6.82s → 6.09s (−11%) Aspirin: 4.51s → 3.42s (−24%) Metoprolol: 5.11s → 4.07s (−20%) Total: 48.40s → 40.33s (−17%)
Hit counts and MaxNumResults unchanged across all queries.
Co-authored-by: Claude Sonnet 4.6 noreply@anthropic.com
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})