RLHF in Drug Discovery Models: Architecture & QA Explained (original) (raw)

[Revised February 27, 2026]

Executive Summary

Reinforcement Learning from Human Feedback (RLHF) is an emerging paradigm originally developed to align large language models with human preferences ([1]). In recent years, researchers and industry have begun adapting RLHF concepts to molecular design and drug discovery tasks. The core idea is to use expert feedback (e.g. from medicinal chemists or biologists) to train a reward model, which then guides a generative or optimization policy (the drug design model) via reinforcement learning. This human-in-the-loop approach promises to inject domain expertise into generative pipelines, improving molecule validity, drug-likeness, and overall alignment with clinical objectives. For example, Insilico Medicine’s ReLEHF initiative explicitly invites chemists to rate AI-proposed molecules, aiming to refine its Chemistry42 platform ([2]).

This report presents a comprehensive technical overview of RLHF in drug discovery models. We first review background on generative modeling for molecules and standard RL pipelines. We then detail the RLHF architecture: collecting expert preferences, training reward models, and using policy optimization (commonly PPO) to fine-tune generators. We contrast RLHF with classical RL approaches, highlighting advantages (e.g. flexible preference learning) and challenges (data efficiency, human variability).

Quality assurance (QA) is crucial: drug discovery demands rigorous validation. We discuss model and output evaluation, including validity/uniqueness of molecules, ADMET and docking score checks, and compliance with regulatory guidelines ([3]) ([4]). Transparency and documentation (following FDA’s Good Machine Learning Practices) are emphasized. We illustrate real-world usage via case studies: Insilico (Chemistry42, GENTRL, and ReLEHF — whose AI-discovered rentosertib published Phase IIa results in Nature Medicine in June 2025 ([5])), novel research like DrugGen (a transformer + PPO system achieving 100% chemically valid molecules, now published in Scientific Reports ([6])), multi-agent RL (MolRL-MGPT) ([7]), and others. Industry trends (e.g. Lilly’s TuneLab platform ([8]) and their January 2026 co-innovation AI lab with NVIDIA ([9]), Nabla Bio/Takeda’s $1B+ partnership ([10])) underscore the rapid adoption of AI in pharma.

Finally, we discuss broader implications: ethical and regulatory considerations of AI-driven design, future research directions (automated labs with closed-loop RLHF, integration with large language models), and the potential impact on drug development timelines. The report concludes that RLHF offers a powerful new tool for steering generative drug design, but success will depend on careful architecture, robust QA, and alignment with human expertise.

Introduction and Background

Drug discovery is inherently challenging: identifying novel molecules with therapeutic effect involves massive search spaces, complex multi-objective criteria (potency, selectivity, ADMET profiles, synthetic accessibility), and high costs. Traditional pipelines relied on iterative experimentation guided by domain experts. Modern computational drug design employs machine learning to accelerate lead discovery. Generative models (deep neural networks that propose novel molecules) have shown promise. Representative approaches include variational autoencoders (VAEs) for SMILES strings ([11]), generative adversarial networks, and more recently graph-based networks (e.g. graph VAEs or graph transformers) that directly construct molecular graphs.

However, purely data-driven models can produce chemically valid but clinically irrelevant molecules. To steer generation towards desirable therapeutic properties, reinforcement learning (RL) methods have been applied. For example, Insilico’s GENTRL (2019) combined a VAE with RL: it generated novel DDR1 inhibitors by optimizing objectives like synthetic feasibility, novelty, and biological activity ([11]). Likewise, Graph-based RL and policy gradient methods have been used to maximize docking scores or QSAR predictions for targets ([12]). These successes illustrate RL’s utility in molecular design. Yet fixed reward functions (e.g. a deterministic docking score) can be limited: they may mis-align with expert intuition, overlook secondary considerations, or encourage "gaming" (optimizing the reward function rather than true efficacy).

Reinforcement Learning from Human Feedback (RLHF) addresses this gap by learning the reward function itself from human judgements. In the context of large language models (LLMs), RLHF typically involves three stages: supervised fine-tuning on example outputs, collecting human preference data (e.g. ranking answer pairs), then training a reward model to predict preferences ([1]). The reward model is used by policy optimization (e.g. PPO) to update the base model. The result is an agent that better reflects nuanced human values or task objectives, as famously demonstrated by OpenAI’s InstructGPT (which learned a “helpfulness” objective from human feedback) ([13]).

Translating RLHF to drug discovery involves a key adaptation: the “language” of human feedback now consists of domain-specific evaluations of molecular structures. Expert chemists or biologists can examine AI-generated candidates and indicate which are more promising (or acceptable). This feedback – numeric ratings, pairwise comparisons, or categorical approvals – is used to train a molecular reward model. The generative policy (which may output SMILES, SELFIES, or graphs) is then fine-tuned using that reward model, nudging generation towards chemist-approved designs. Initial research suggests this integration can substantially improve outcomes. Nahal et al. (2024) showed that allowing chemists to interactively refine property predictors via RLHF led to higher predictive accuracy and “drug-likeness” in the top generated molecules ([14]). In other words, expert-in-the-loop approaches help realign the AI’s objectives with human judgement.

This report examines the technical architectures of RLHF-based drug discovery models and the associated quality assurance processes. We first delve into RLHF fundamentals and how they map to generative molecule models, then discuss system design details, human-data integration, and validation. Throughout, we cite state-of-the-art studies and real industry applications. We emphasize multiple perspectives – academic breakthroughs (e.g. DrugGen_、_MolRL-MGPT), proprietary platforms (Lilly, Insilico), and news on how regulators and firms view AI in pharma. The goal is a deep, nuanced survey that equips researchers and engineers with the knowledge to design robust, effective RLHF pipelines for drug discovery.

RLHF Concepts and Pipeline

RLHF Overview

Reinforcement Learning from Human Feedback (RLHF) is a hybrid learning paradigm that bridges supervised learning (trainer-provided examples) and reinforcement learning (self-optimization with rewards). Its essence is to align a model’s outputs with human preferences when it is hard to specify a reward function explicitly ([1]). In practice, RLHF pipelines typically involve these stages:

  1. Supervised Baseline / Imitation: Start with a generative model pre-trained on vast data. For molecules, this might be a language model trained on SMILES strings or a graph-based generator trained on public chemical databases. 2.Preference Data Collection: Show outputs (e.g. proposed molecules or sequences) to humans. Experts rank or rate them according to desirability (effectiveness, novelty, safety, etc). Data can come from pairwise comparisons (“Which of these two compounds is preferred?”) or absolute ratings.
  2. Reward Model Training: Use the labeled comparisons to train a reward function R(⋅). This model, often a neural network, predicts the human-given score for any candidate molecule.
  3. Reinforcement Learning (Policy Optimization): Employ an RL algorithm (commonly Proximal Policy Optimization, PPO) that fine-tunes the original model. The policy now receives feedback from R instead of a fixed formula. Effectively, the model learns to generate molecules that receive high predicted human-reward.
  4. Evaluation and Iteration: The improved model is evaluated on held-out tasks or new feedback rounds. Additional human data can be collected in subsequent iterations to continually refine the reward model.

This cycle is illustrated schematically in Table 1 below, contrasting it with standard RL:

Stage RL (Fixed Reward) RLHF (Human Feedback)
Reward Definition Predefined, explicit (e.g. docking score, solubility) Learned from human-provided labels ([1]) ([15])
Data Source Simulation or calculators (no human needed) Expert chemists/biologists providing preference signals ([2])
Adaptability Rigid; may not capture nuanced preferences Flexible; can incorporate subjective criteria ([15])
Bias/Noise Bias arises from mis-specified reward Bias arises from human label variability or errors
Examples GENTRL VAE+RL optimizing predicted affinity ([11]) ChatGPT/InstructGPT (guided by labelers) ([13])、Insilico ReLEHF (expert annotations) ([2])

Reward Modeling in Drug RLHF

A key challenge in molecular RLHF is designing the reward model architecture. The input is a candidate compound (often encoded as a graph or SMILES), optionally with context (such as target info). The output is a scalar score predicting human approval. Approaches include:

Architecturally, the reward model can be a graph neural network (GNN) for molecular structures, or a transformer on SMILES. It may incorporate domain features (e.g. predicted TPSA, LogP) or embeddings from pre-trained chemical models. A separate invalid-structure penalizer can also be included: DrugGen, a recent RLHF system, uses an auxiliary model to score molecule validity, ensuring the generator avoids syntactically invalid SMILES ([16]).

Policy Optimization

The core RL algorithm in RLHF is often PPO (Proximal Policy Optimization) ([16]), though other policy gradient methods or evolutionary strategies can be used. The policy network starts from the pre-trained generative model. During training, batches of molecules are sampled and evaluated by the reward model; gradients are computed to nudge the policy towards higher rewards. A KL-penalty or replay buffer may be used to prevent catastrophic forgetting, keeping the model from drifting too far from the original chemical space it was trained on.

Notably, RL for molecules has been done on various representations:

In RLHF, most research thus far leverages language-type models (SMILES) because they integrate naturally with established RLHF frameworks. For instance, DrugGen extends a transformer called DrugGPT: it decodes protein sequences to propose binding molecules, then uses PPO with a reward model based on predicted binding affinity to tune generation quality ([16]).

Data and Expert Feedback

The success of RLHF hinges on high-quality feedback. In drug discovery:

The efficiency of data use is critical. Techniques like active learning (choosing which molecules to label for maximum information) or synthetic feedback (using surrogate models when experts are unavailable) can help. Chemists may rate molecules on multiple criteria (safety, novelty, ease of synthesis) which could form a multi-objective reward. Designing interfaces to capture rich feedback (beyond “A is better than B”) can further strengthen the reward model.

Technical Architecture for RLHF in Drug Design

An end-to-end RLHF system for molecular generation involves multiple software components and data flows. Figure 1 (conceptual) outlines a typical architecture:

Figure 1. Schematic of an RLHF-based drug design pipeline. The core components include (1) a pre-trained generative model (policy), (2) a human feedback interface, (3) a reward model, and (4) a reinforcement learning optimizer. Solid arrows indicate data flow; dashed arrows indicate feedback retraining.

[Large Pretrained Model] --(generate molecules)--> [Human Expert Interface]
 <--(collect feedback)-- [Human Expert Interface]
[Human Expert Interface] --(labels)--> [Reward Model]
[Large Pretrained Model] + [Reward Model] --(RL/Optimizer)--> [Fine-tuned Model]
  1. Base Generative Model: Often a transformer or recurrent network trained on large chemical libraries. This initial model ensures the agent “speaks the language” of chemistry (valid SMILES, common substructures). For example, MolGPT pre-trained on millions of known drug-like molecules serves as a starting policy ([7]). This model can be either a sequence model (SMILES/SELFIES) or a graph-based generator.
  2. Expert Feedback Interface: A web or desktop tool that displays AI-proposed molecules and captures expert judgments. It may show 2D chemical structures, predicted properties, and let the chemist rank or rate them. The interface securely records these labels for later model training (with user accounts for traceability).
  3. Reward Model: A neural network (e.g. GNN or transformer encoder) that takes a molecule (and optionally context) as input and outputs a scalar reward. It is trained using loss functions appropriate to the feedback format (binary cross-entropy for pairwise data, regression loss for scores). The reward model may include ensembling for uncertainty estimation to improve reliability.
  4. RL Optimizer: An implementation of PPO or similar that updates the generative model parameters. Key implementation details include:
  1. Evaluation and QA Module: Parallel to model updates, an evaluation suite checks generated molecules against metrics (see next section). It may include:

A crucial aspect of architecture is the data pipeline for retraining. Once sufficient new feedback is collected, the reward model must be retrained (often from scratch or fine-tuned) on the expanded dataset. The RL optimizer then resumes using the updated reward model. This loop may repeat multiple times: RLHF is inherently iterative.

In industry settings, this system would be implemented with a distributed architecture: e.g. a central server hosting the models and databases, a task queue for the RL jobs on GPU clusters, and a web service for experts to label. Data storage must ensure provenance (which expert labeled what, under which experiment conditions) for QA tracking.

Example (Insilico ReLEHF): Insilico’s platform Chemistry42 (a suite for generative chemistry) has integrated an RLHF program called ReLEHF. It provides an online interface for experts to score generated molecules from case studies (JAK3 inhibitor design, USP7 hit-expansion, etc.) ([2]). Their aim is to use this expert input to dynamically improve the underlying AI models. While specifics are proprietary, the concept follows the above architecture: the chemist feedback goes to a reward model, which refines the generative agent.

Quality Assurance and Validation

For any AI-driven drug discovery system, quality assurance (QA) is paramount. Unlike many consumer AI applications, errors in drug design can have severe consequences (failed trials, toxicity). Thus, RLHF pipelines must incorporate rigorous validation at multiple levels, combining ML best practices with pharmaceutical standards.

Model Development QA

If available, an external dataset (benchmarked preferences or domain rules) can validate the reward function’s generality.

Output-Level QA

Once molecules are generated by the (post-RLHF) model, they undergo a battery of checks. Key metrics and tests include:

Table 2 lists key evaluation metrics and desired ranges for drug-like generation.

Metric Description Target Qualities Example Thresholds
Validity % of outputs that are syntactically valid molecules > 95% (ideally ≈100%) > 90% for partially-learned
Uniqueness % of unique molecules among N generated High (depends on N; avoid collapse) > 50% for N=1000
Novelty % not matching any training-set molecule High (exploration encouraged) > 80%
Drug-likeness Rule-of-5 compliance, QED score, synthetic score Similar to known drug-like distributions e.g. QED > 0.5 (on 0-1 scale)
Activity Score Predicted binding affinity or probability of activity High for target of interest e.g. IC50 predicted low (nM)
Toxicity Absence of toxicophores or predicted organ toxicity None/minimal flagged PAINS alerts = 0
Diversity Average pairwise molecular distance Broad coverage; avoid clustering Depends on library size

In practice, the QA process often couples automated evaluation with expert review. High-scoring molecules can be sanity-checked by chemists before synthesis or biological testing. If errors slip through (e.g. persistent invalids due to grammar quirks), the model should be patched (more training, regex fixes).

Regulatory and Ethical QA

Pharmaceutical regulators are increasingly focusing on AI-model validation. The FDA, in collaboration with Health Canada and the UK’s MHRA, has established 10 guiding principles for Good Machine Learning Practice (GMLP), and the International Medical Device Regulators Forum (IMDRF) released its final GMLP document in January 2025 ([18]). In 2025, the FDA also published draft guidance on "Considerations for the Use of Artificial Intelligence to Support Regulatory Decision Making for Drug and Biological Products", while CDER and CBER collaborated with the EMA on 10 guiding principles for AI in drug and biological product development ([19]). These frameworks emphasize transparency, reproducibility, and lifecycle management ([4]). For RLHF in drug discovery, this means:

In summary, QA for RLHF-mediated drug discovery demands combining ML validation techniques with domain-specific pharmaceutical standards ([20]) ([4]). The development lifecycle should mirror that of software for medical devices: defined processes, thorough testing, and continual monitoring post-deployment (e.g. tracking if AI-generated leads fail in experimentation abnormally often).

Case Studies and Examples

1. Insilico Medicine – GENTRL and ReLEHF

Insilico Medicine has been a pioneer in applying generative AI to drug discovery. In 2019, they published GENTRL, a deep generative model (VAE) trained on millions of compounds and fine-tuned via RL to find DDR1 kinase inhibitors ([11]). GENTRL’s design objective encoded synthetic feasibility, novelty, and predicted activity ([11]). This pipeline led to the discovery of a potent clinical candidate (targeting fibrosis) in a remarkably short time.

More recently, Insilico introduced ReLEHF (Reinforcement Learning with Expert Human Feedback) in their Chemistry42 platform ([2]). ReLEHF is a community program where medicinal chemists rate AI-proposed structures from ongoing projects. For example, experts examine molecules from case studies (JAK3 inhibitor design, USP7 hit expansion, etc.) and leave feedback ([2]). The feedback trains reward models to further guide Chemistry42’s generative loop.

A landmark validation came in June 2025, when Insilico published Phase IIa results for rentosertib (INS018_055), a TNIK inhibitor for idiopathic pulmonary fibrosis (IPF) discovered and designed entirely using their AI platform, in Nature Medicine ([5]). In the GENESIS-IPF trial (71 patients, 22 sites), the 60 mg QD dose showed a mean improvement in forced vital capacity (FVC) of +98.4 mL versus −20.3 mL for placebo, with a manageable safety profile. This represents the industry’s first proof-of-concept clinical validation of AI-driven drug discovery. Insilico has begun regulatory discussions to advance rentosertib into larger cohorts.

Chemistry42 itself has continued to evolve in 2025–2026, with its ADMET profiling module fully rebuilt using a new training framework, adding 70+ new GPCR inhibitor models for off-target safety assessment. The platform now comprises 7 distinctive applications spanning molecular generation, free energy binding prediction, ADMET property prediction, kinase selectivity prediction, and retrosynthesis route screening ([21]). Insilico also received FDA IND clearance in January 2026 for ISM8969, an AI-designed NLRP3 inhibitor for inflammation and neurodegenerative disorders ([22]).

Key Takeaway: Industrial pipelines are evolving from purely automated optimization (GENTRL) toward hybrid human-AI loops (ReLEHF), and clinical results now validate the approach — rentosertib’s Phase IIa success marks the first AI-discovered drug to demonstrate clinical efficacy in a randomized trial.

2. DrugGen (Transformer + RL)

DrugGen exemplifies RLHF-inspired design. The authors fine-tuned a transformer (DrugGPT) on known drug-target interactions and then applied PPO-based RL using a complex reward model. The reward combined a predicted binding affinity (via PLAPT, a pre-trained protein-ligand affinity transformer) and an “invalid structure assessor” that penalizes chemically invalid SMILES. This human-like reward (inspired by what a chemist would desire: potent and valid molecules) led to dramatic improvements. DrugGen achieved 100% valid molecule generation, up from 95.5% in the unguided model, and produced molecules with higher predicted binding affinities (7.22 vs. 5.81 for DrugGPT). Originally an arXiv preprint, DrugGen was formally published in Scientific Reports in April 2025 ([6]), with the model publicly available on Hugging Face. Docking simulations further validated its ability to generate molecules for targets like FABP5 with superior scores compared to reference molecules. This result emphasizes the power of customizing the reward to reflect chemical sensibility.

3. MolRL-MGPT: Multi-Agent Collaboration

Hu et al. (2024) introduced MolRL-MGPT, a multi-agent RL approach where several GPT-based agents explore molecular space collaboratively ([7]). While not explicitly using human feedback, their method fosters diversity through agent competition. On the GuacaMol benchmark of de novo molecule generation, MolRL-MGPT showed promising results in producing diverse, high-quality candidates ([7]). This suggests that even in the absence of explicit human labels, multi-agent mechanisms can mimic some benefits of RLHF by avoiding narrow solutions.

4. NLP-to-Chemistry Transfer (ChatGPT-like)

The success of RLHF in large language models (e.g. OpenAI’s InstructGPT) is instructive. Ouyang et al. (2022) showed that even a much smaller model (1.3B parameters) could outperform a 175B GPT-3 by using RLHF, achieving better helpfulness, honesty, and harmlessness ([13]). By analogy, it suggests that domain-specific feedback (from chemists) can elevate a mid-size chemical language model to a level beyond a nominally larger one. Future drug-discovery systems may similarly transfer techniques from LLMs. Indeed, companies are exploring LLMs for chemistry queries or retrosynthesis; integrating RLHF could align them to expert reasoning.

5. Industry Partnerships and Platforms

Beyond specific models, news reports indicate broad industry adoption of AI-guided discovery. Eli Lilly’s TuneLab platform, launched in September 2025, provides biotech startups access to Lilly’s AI models trained on proprietary data obtained at a cost of over $1 billion, representing one of the industry’s most valuable datasets for drug discovery ([23]). The platform employs federated learning so that partners can leverage Lilly’s models without exposing proprietary data. In January 2026, Lilly and NVIDIA announced a co-innovation AI lab with up to $1 billion in investment over five years, integrating NVIDIA Clara foundation models into TuneLab workflows ([9]). While not explicitly RLHF, TuneLab’s goal of democratizing advanced generative and RL models for drug design aligns with the broader RLHF vision.

Similarly, Nabla Bio and Takeda expanded their AI partnership in October 2025 with a second multi-year collaboration worth over $1 billion in potential milestone payments ([24]). The partnership deploys Nabla’s proprietary Joint Atomic Model (JAM) platform for de novo antibody design across multiple targets, multispecifics, and custom therapeutics. This highlights that big pharma sees value in AI, raising the urgency for robust QA practices: industry analyses suggest AI implementation delivers 30–70% cost reductions in preclinical research and can compress early discovery timelines by 30–40% ([25]), but only if thoroughly validated.

6. Human-in-the-Loop Research Studies

Academic validations of human-in-the-loop design further support these approaches. Sundin et al. (2022) developed a framework where chemists directly optimize molecules through interactive RL (the user assesses each step) ([26]). Nahal et al. (2024) performed simulated and real HITL experiments, finding that refined human feedback progressively improved the property predictors and top molecules’ quality ([14]). These studies demonstrate that even small amounts of well-integrated human guidance can steers results favorably.

Overall, these case studies show that combining generative models with human expertise yields better outcomes than either alone. When designing an RLHF system, one can draw lessons such as: reward models should penalize unrealistic structures ([16]), maintain a diverse candidate pool ([7]), and use targeted expert input on key decision points ([2]) ([14]).

Data Analysis, Evidence, and Metrics

Quantitative evidence from the literature underscores RLHF’s impact on generative drug models. We summarize key findings:

To ground these results, we can report specific statistics:

However, not all evidence is purely numeric. Expert consensus and opinion pieces stress trust and alignment as major benefits. Clinical AI thought leaders assert that integrating human judgement in the loop is key to gain adoption. For instance, an editorial in Nature Biotech highlights that allowing chemists to vet AI outputs “streamlines discovery” ([11]).

We must also consider limitations and null results. If human feedback is inconsistent or sparse, RLHF can fail. One critical analysis warns that “better benchmarks” are needed for molecular generation, as many reported improvements do not translate to realistic drug tasks ([28]). This underscores our emphasis on robust QA – we need to verify that higher reward scores indeed correlate with therapeutic success.

In summary, empirical data show that RLHF-like strategies can substantially increase output quality metrics (validity, drug-likeness, target affinity) over unguided generative models ([16]) ([14]). The exact gains depend on the problem and feedback quality, but the trend is clear: human feedback gives models a more refined objective. Combining these quantitative analyses with expert insights provides a compelling case for RLHF, as we have aggregated throughout this report.

Implications, Challenges, and Future Directions

Technical Implications

The integration of RLHF into drug discovery revolutionizes experimental design. ML models become adaptive collaborators, not static tools. This raises several implications:

Research and Dataset Needs

Ethical and Regulatory Implications

Future Directions

Societal Impact

Conclusion

Reinforcement Learning from Human Feedback (RLHF) is an exciting frontier for drug discovery. By merging computational power with expert insight, it offers a route to more intelligent, reliable generative models. This report has delved deeply into the technical architectures and quality assurance considerations for RLHF-driven drug design. We reviewed the end-to-end pipeline, from pre-training and human labeling to reward modeling and policy optimization, emphasizing how each component must be engineered and validated for the critical application of drug discovery.

We surveyed empirical evidence and case studies that highlight RLHF’s potential. Key outcomes include vastly improved chemical validity and aligned molecular properties ([6]) ([14]) compared to unguided generation. Industry examples — from Insilico’s pioneering pipelines (with rentosertib’s Phase IIa success marking the first clinical proof-of-concept for AI-discovered drugs) to major pharma partnerships like Lilly–NVIDIA and Nabla Bio–Takeda — underscore that RLHF-based approaches have moved from theory to real-world clinical validation.

Quality assurance emerges as a central theme. Without rigorous validation – of models, data, and final compounds – the promise of RLHF could fall short or even backfire. Combining automated metrics (validity, diversity, predicted ADMET) with human oversight creates a multi-layer safeguard. Moreover, adhering to emerging AI standards and regulatory guidelines ([4]) will ensure that RLHF tools are developed responsibly.

Looking forward, we foresee RLHF becoming a standard element of the drug developer’s toolbox, akin to how high-throughput screening revolutionized lead optimization decades ago. The challenges are nontrivial – from collecting reliable feedback to scaling RL training – but the first successes suggest the rewards are commensurate. As synthetic biology and real-time patient data grow, RLHF could even personalize therapies by incorporating individual biological feedback.

In sum, RLHF marries the adaptability of learning algorithms with the wisdom of human scientists. Done right, it can expedite the journey from molecule design to life-saving medicine, while embedding the requisite caution and scrutiny at every step. The future of drug discovery will likely be written by teams that can harness both advanced AI and deep domain expertise in concert.

References

(Note: All references above include inline citations using bracketed style for clarity.)