TOMG-Bench: Evaluating LLMs on Text-based Open Molecule Generation (original) (raw)

Jiatong Li1 Junxian Li2∗ Yunqing Liu1 Dongzhan Zhou3 Qing Li1
1The Hong Kong Polytechnic University2Shanghai Jiao Tong University3Shanghai AI Lab
jiatong.li@connect.polyu.hk, lijunxian0531@sjtu.edu.cn, yunqing617.liu@connect.polyu.hk
zhoudongzhan@pjlab.org.cn, csqli@comp.polyu.edu.hk

Abstract

In this paper, we propose Text-based Open Molecule Generation Benchmark (TOMG-Bench), the first benchmark to evaluate the open-domain molecule generation capability of LLMs. TOMG-Bench encompasses a dataset of three major tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom). Each task further contains three subtasks, with each subtask comprising 5,000 test samples. Given the inherent complexity of open molecule generation, we have also developed an automated evaluation system that helps measure both the quality and the accuracy of the generated molecules. Our comprehensive benchmarking of 25 LLMs reveals the current limitations and potential areas for improvement in text-guided molecule discovery. Furthermore, with the assistance of OpenMolIns, a specialized instruction tuning dataset proposed for solving challenges raised by TOMG-Bench, Llama3.1-8B could outperform all the open-source general LLMs, even surpassing GPT-3.5-turbo by 46.5% on TOMG-Bench. Our codes and datasets are available through https://github.com/phenixace/TOMG-Bench.

1 Introduction

Molecule discovery plays a pivotal role in various scientific research fields, from pharmaceuticals Keiser et al. (2010) to materials science Higuchi et al. (2023). Normally, molecule discovery is a trial and error process Ekins (2024), which requires repetitive experimentation and data analysis Mattern and Grosser (2023). Due to the inefficiency of the traditional techniques, it usually takes more than 10 years to bring a new drug candidate into the market Lee et al. (2018).

With the development of machine learning techniques and the advent of Graph Neural Networks (GNNs) Wang et al. (2023), there has been a significant step forward. As molecules could be represented as graphs, GNN-based methods can capture the structural patterns of the molecule and make accurate predictions. With the assistance of GNNs, researchers could analyse the properties of molecules Cai et al. (2022) and generate new molecule candidates Jin et al. (2018). However, challenges still exist. GNN-based methods struggle to generalize to different tasks Chen et al. (2024), necessitating costly data collection and preparation for different downstream tasks. Moreover, these methods are constrained in their capacity to generate molecules with specific, customized properties, limiting their flexibility in molecular design Li et al. (2024c).

Refer to caption

Figure 1: Comparison of Text-Based Targeted Molecule Generation (a) v.s. Text-Based Open Molecule Generation (b).

In Contrast, Large Language Models (LLMs) have shown their great generalization capability Achiam et al. (2023) and could be easily adapted to different research fields. For instance, Cascella et al. (2023) utilizes ChatGPT for supporting clinical practice and Zhen et al. (2024) adopts LLMs as assistants for task planning in the field of Civil Engineering, showing the great potential of LLMs in scientific discovery.

As molecules can be represented as texts by Simplified Molecular Input Line Entry System (SMILES), a linear notation that encapsulates the structure of a chemical compound Weininger (1988), they can be processed and understood by LLMs, bridging the gap between molecules and natural languages. With advanced reasoning and in-context learning capabilities, LLMs are particularly adept at generalizing to the molecule domain Li et al. (2024b). The generalization capability makes LLMs a viable option in molecule discovery. Furthermore, by aligning molecules with textual data, LLMs can serve as powerful assistants to chemists Zhang et al. (2024); Li et al. (2024d). They could help interpret and generate chemical knowledge, suggest modifications to molecular structures, and even predict the properties and behaviours of compounds, which would potentially streamline the molecule discovery process, leading to breakthroughs in diverse research areas.

While the integration of LLMs into molecule discovery holds immense promise, the process of aligning molecules with textual data is challenging Li et al. (2024a). A significant challenge lies in the availability and diversity of datasets and benchmarks necessary for training and evaluation. Although the task of molecule-caption translation Edwards et al. (2022) is crucial for bridging the gap between the molecular and textual domains, it still has several limitations that need to be addressed:

On the one hand, there is a concern about the generalization of the molecule-caption translation task Li et al. (2024b). In real-world scenarios, molecule captions that describe molecular structures can be highly ambiguous, with multiple correct interpretations, while the current molecule-caption translation is actually a targeted generation task. In this case, these models often struggle to generalize to customized molecules, even for seemingly simple examples Li et al. (2024b). This suggests a fundamental mismatch between the molecular and textual spaces, raising questions about whether this task could truly guide LLMs well. On the other hand, a critical issue is the inability to propose new molecule structures. The ultimate goal of molecule discovery is not just to understand and describe existing chemical compounds but to innovate and discover new ones, particularly in the context of drug discovery, indicating that the current molecule-caption translation task and the corresponding evaluation metrics fall short in this regard. Therefore, addressing these challenges is essential for harnessing the full potential of LLMs in molecule discovery.

In our efforts to bridge the gap between the natural language and the molecular spaces and to further facilitate LLMs as chemist assistants in molecule discovery, we propose a novel benchmark, Text-based Open Molecule Generation Benchmark (TOMG-Bench). TOMG-Bench is designed to evaluate the open-domain generative capabilities of LLMs in the molecular domain through a series of structured instructions for molecule design and operations. As shown in Figure 1, different from the previous targeted molecule generation tasks, Text-based Open Molecule Generation does not set a specific target or enables LLMs to generate an exact matched molecule. Instead, we adopt chemical toolboxes like RDKit Landrum (2013) to test whether the generation meets the requirements. In other words, there can be multiple correct answers for a single question, and LLMs are only required to generate one of them. Notably, TOMG-Bench is meticulously categorized into three primary tasks, i.e., molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom). Each category contains three subtasks, and each subtask is composed of 5,000 test samples, providing a comprehensive and robust assessment of whether LLMs truly grasp the molecular space. Meanwhile, we also propose a different set of evaluation metrics to evaluate and rank the performance of LLMs, which considers both the accuracy and quality of the generated molecules. Moreover, we propose an instruction-tuning dataset, OpenMolIns, by extracting and reformatting molecules from an existing molecule database. OpenMolIns is structured across five distinct data levels (i.e., light, small, medium, large, and extra-large) to tailor different training purposes.

To encapsulate our contributions, they are primarily threefold:

In this section, we briefly review related work about developing Artificial Intelligences (AIs) in Molecule Discovery and, more specifically, in text-based molecule generation tasks.

2.1 Development of AIs in Molecule Discovery

Molecule discovery plays a pivotal role across numerous scientific fields, driving advancements in the development of drug discovery and material design Du et al. (2022). Thus, integrating artificial intelligence into molecule discovery has marked a transformative shift in the pharmaceutical landscape, significantly enhancing the efficiency and effectiveness of identifying and developing new therapeutic molecules. Recent advancements in machine learning (ML), deep learning (DL), and natural language processing (NLP) have enabled AI systems to analyze complex biological and chemical data more effectively than traditional methods Wigh et al. (2022); Wu et al. (2018); Zhou et al. (2023). For instance, MolReGPT Li et al. (2024b) leverages large language models (LLMs) like ChatGPT to learn molecule SMILES strings representation for molecule-caption translation tasks. Moreover, existing studies have explored advanced methods that utilize various AI techniques to further enhance molecule discovery processes, including Convolutional Neural Networks (CNNs) Peng and Zhao (2019); Le et al. (2019), Recurrent Neural Networks (RNNs) Grisoni et al. (2020); Popova et al. (2019), Graph Neural Networks (GNNs) Wang et al. (2023); Sun et al. (2022), and Transformer-based networks Xia et al. (2023); Balaji et al. (2023); Edwards et al. (2022).

2.2 Text-based Molecule Generation

Text-based Molecule Generation (Text2Mol) Edwards et al. (2021) has recently emerged as a transformative approach to molecule discovery. This task centres on retrieving molecules using natural language descriptions as search queries, requiring the creation of paired datasets of molecules and their corresponding textual representations. This enables the learning of a shared semantic embedding space for efficient retrieval. Early approaches leveraged transformer-based models like MolT5 Edwards et al. (2022), employing self-supervised learning on large datasets to generate high-quality Simplified Molecular Input Line Entry System (SMILES) strings from textual inputs. Subsequent advancements, such as KV-PLM Zeng et al. (2022), MoMu Su et al. (2022), and BioT5 Pei et al. (2023), integrated molecular graphs and biochemical text to improve both understanding and generation capabilities. 3D-MoLM Li et al. (2024e) further enhanced this by incorporating spatial configurations, leading to more accurate and geometrically valid molecular representations. The application of large language models (LLMs) like MolReGPT Li et al. (2024b) and ICMA Li et al. (2024a) as in-context learners has also shown significant promise. These models demonstrate the ability of LLMs to adaptively generate molecules by retrieving and leveraging relevant examples from the provided context. Most recently, MolReFlect Li et al. (2024c) underscored the importance of fine-grained alignment between molecular structures and their textual descriptions, utilizing a teacher-student training paradigm to capture these nuanced relationships effectively. Unlike this targeted generation task, in this paper, we propose a Text-based Open Molecule Generation task to enable LLMs to generate an exactly matched molecule rather than set a specific target.

3 TOMG-Bench

In this section, we propose TOMG-Bench to comprehensively assess the performance of LLMs in molecular space. Specifically, the benchmark is composed of three basic tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom). To ensure the integrity and effectiveness of TOMG-Bench, we have developed a robust set of evaluation metrics for different tasks. Additionally, we have created OpenMolIns, an instruction-tuning dataset aimed at enhancing the performance and adaptability of LLMs to the challenges presented by this benchmark.

Refer to caption

Figure 2: Data construction workflow and evaluation process of TOMG-Bench.

3.1 Dataset Categorization

The categorization of the TOMG-Bench dataset initially considers the inherent characteristics of molecule SMILES representation and the role of LLMs serving as the chemist’s assistant, namely helping the chemists to edit, optimize and customize molecules as they want. Following the difficulty of the tasks, we demonstrate the content of the three basic tasks as well as their corresponding subtasks:

MolEdit emerges as the most straightforward task among the three domains, as an existing molecule is already provided, and LLMs are only required to make modifications to it, which tests the molecular structure knowledge of LLMs. In this case, we have crafted three subtasks for MolEdit: AddComponent, DelComponent, and SubComponent. In AddComponent, LLMs are instructed to add a specific functional group to the given molecule, and DelComponent challenges LLMs to remove a specified functional group from the provided molecule, while SubComponent is a hybrid of the previous two subtasks, requiring LLMs to first remove a designated functional group and then introduce a new one as specified to the molecule.

MolOpt challenges LLMs to not only edit molecules but also to discern whether the modification will steer the molecule towards a desired optimization target. To assess this capability, we concentrate on three pivotal properties that are vital for molecule discovery: LogP (Octanol-water partition coefficient, a metric of lipophilicity), MR (molecular refractivity, a proxy for the molar refractive index), and QED (Quantitative Estimate of Druglikeness, an assessment of drug-like characteristics). These metrics offer critical information about the potential pharmacological attributes of the molecule, which could help chemists filter molecules as viable drug candidates.

MolCustom is the most challenging task, where we have established three subtasks: AtomNum, BondNum, and FunctionalGroup. For AtomNum, LLMs are tasked with generating molecules that adhere to a specified count and type of atoms. BondNum involves the creation of molecules with a defined number and type of bonds. In FunctionalGroup, LLMs must generate a molecule that includes functional groups as specified. These subtasks may appear straightforward, yet they are deceptively challenging. They demand that LLMs have a sophisticated understanding of molecular syntax to precisely generate molecules that meet the complex criteria set forth.

3.2 Dataset Construction

Previously, molecule-related datasets and tasks have been hindered by the scarcity of human annotations. For instance, ChEBI-20 Edwards et al. (2022), a dataset for the molecule-caption translation task, contains only 33,000 samples for training, whereas image captioning datasets like MS COCO Chen et al. (2015) have over 1,500,000 annotated captions on more than 330,000 images. Meanwhile, the annotation of molecules demands expertise from chemists and can sometimes require wet lab experiments, which are both expensive and time-consuming. In contrast, TOMG-Bench, as an open-domain generation task, does not rely on human annotations for construction. Instead, we focus on basic molecule structural properties and basic molecule operations that could be validated by chemical toolboxes to construct our dataset.

For MolCustom, we randomly generate 5,000 prompts as requests for each subtask, requiring different numbers and collections of atoms, bonds, and functional groups. For MolEdit and MolOpt, we sample molecules from a specific molecule database. Specifically, we select two molecule databases for this work: Zinc-250K Sterling and Irwin (2015), and PubChem Kim et al. (2019). Zinc-250K has 250,000 molecules, which is smaller than PubChem, which has 10 million molecules. To facilitate the fast calculation of metrics mentioned in Section 3.5, we choose Zinc-250K for sampling the test molecules in TOMG-Bench. Each subtask is allocated 5,000 test samples. After sampling, we utilize RDKit Landrum (2013), a molecular informatics toolbox, to collect basic molecule statistics. There are functions available to calculate the required characteristics, especially the structural patterns and chemical properties like LogP, MR, and QED values, which can then be integrated into our pre-defined task prompts. Further details will be provided in the Appendix A.

3.3 Metric Design

The evaluation of the TOMG-Bench is facilitated through a set of carefully designed metrics tailored to the specific tasks within the benchmark.

For the MolCustom task, which includes subtasks such as AtomNum, BondNum, and FunctionalGroup, the following metrics are employed:

For the MolEdit and MolOpt tasks, the following metrics are adopted:

Notably, the calculation of novelty and similarity metrics only considers valid molecules. Specifically, for the novelty metric, we assess the similarity between the generated molecule and those within the Zinc-250K database.

To comprehensively evaluate the average performance of LLMs on TOMG-Bench, we introduce a weighted average accuracy to rank the performance of LLMs. Considering that the novelty scores for MolCustom and similarity scores for MolEdit and MolOpt are also crucial metrics for evaluating the performance, especially the similarity scores, which help identify correct molecule editing operations. In this case, we adopt the novelty and similarity scores as the weights, and the weighted average accuracy can be computed as follows:

w⁢A⁢c⁢c¯=19⁢(∑t∈M⁢o⁢l⁢C⁢u⁢s⁢t⁢o⁢m(nt∗A⁢c⁢ct)+∑t∈{M⁢o⁢l⁢E⁢d⁢i⁢t,M⁢o⁢l⁢O⁢p⁢t}(st∗A⁢c⁢ct)),¯𝑤𝐴𝑐𝑐19subscript𝑡𝑀𝑜𝑙𝐶𝑢𝑠𝑡𝑜𝑚subscript𝑛𝑡𝐴𝑐subscript𝑐𝑡subscript𝑡𝑀𝑜𝑙𝐸𝑑𝑖𝑡𝑀𝑜𝑙𝑂𝑝𝑡subscript𝑠𝑡𝐴𝑐subscript𝑐𝑡\bar{w\!Acc}\!=\!\frac{1}{9}\!(\sum_{t\in\!\!M\!o\!l\!C\!u\!s\!t\!o\!m\!}\!(\!% n_{t}*Acc_{t}\!)\!+\sum_{t\in\!\{\!M\!o\!l\!E\!d\!i\!t,\!M\!o\!l\!O\!p\!t\!\}}% \!(\!s_{t}*Acc_{t}\!)),over¯ start_ARG italic_w italic_A italic_c italic_c end_ARG = divide start_ARG 1 end_ARG start_ARG 9 end_ARG ( ∑ start_POSTSUBSCRIPT italic_t ∈ italic_M italic_o italic_l italic_C italic_u italic_s italic_t italic_o italic_m end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∗ italic_A italic_c italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_t ∈ { italic_M italic_o italic_l italic_E italic_d italic_i italic_t , italic_M italic_o italic_l italic_O italic_p italic_t } end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∗ italic_A italic_c italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , (1)

where ntsubscript𝑛𝑡n_{t}italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the novelty score for the MolCustom tasks and stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the similarity score for the MolEdit and MolOpt tasks. A⁢c⁢ct𝐴𝑐subscript𝑐𝑡Acc_{t}italic_A italic_c italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the accuracy for each subtask t𝑡titalic_t. This weighted average accuracy, w⁢A⁢c⁢c¯¯𝑤𝐴𝑐𝑐\bar{wAcc}over¯ start_ARG italic_w italic_A italic_c italic_c end_ARG, provides a balanced measure of performance that considers both the accuracy and quality of the generation.

3.4 OpenMolIns: Instruction Tuning Dataset

In this section, we introduce OpenMolIns, a specialized dataset derived from the PubChem database to help LLMs get familiar with text-based open molecule generation via instruction tuning. This instruction tuning dataset is meticulously designed to ensure that the molecules it contains do not overlap with those in the Zinc-250K dataset to promote the generation of more novel molecular structures and avoid any potential data leakage that could compromise the integrity of the model performance.

We collect the instruction tuning dataset by introducing the samples of the nine subtasks in equal amounts. We still apply the RDKit toolbox to construct the instruction tuning dataset.

In the MolCustom domain, constructing the instruction tuning samples is rather straightforward. We calculate the molecular statistics and encapsulate them within prompts. For example, for AtomNum, we count all the atoms we would consider in the molecule, including their types and numbers. Then, we wrap these statistics with a random pre-defined prompt to construct the training sample. This method allows the generated molecules to better fit the distribution of molecular space.

For the MolEdit and MolOpt domains, we randomly select a functional group from the original molecule for addition or removal, utilizing the RDKit toolbox to execute these operations. Then, similarly, we wrap the original molecule and the edited molecule with a random pre-defined prompt to construct the training sample for MolEdit. In the case of MolOpt, we also evaluate the direction of the desired property changes by the functions in RDKit to determine whether the operation improves or decreases the property value, which is then altogether wrapped by the pre-defined prompts.

To investigate the impact of data scales on the performance of LLMs, we have established five distinct data levels tailored for different training purposes: light, small, medium, large, and xlarge, as illustrated in Section 3.5. Each level represents a different quantity of data, shown in Table 1, allowing us to analyze how the amount of training data influences the model’s ability to learn and generate or edit molecules effectively.

Table 1: Statisics of TOMG-Bench and OpenMolIns.

3.5 Statistics

In this section, we outline the basic statistics of TOMG-Bench along with the OpenMolIns dataset. Table 1 shows the details of the data size in TOMG-Bench as well as the OpenMolIns datset. For TOMG-Bench, we have three main tasks with nine subtasks in total, where each subtask contains 5,000 test samples. More details can be found in Appendix A.

For OpenMolIns, we have five distinct data scales: light, small, medium, large, and xlarge, ranging from 4,500 to 1,200,000 examples, which helps us investigate the data scaling law of applying LLMs to the Text-based Open Molecule Generation task. Notably, the nine subtasks within the TOMG-Bench are uniformly distributed in the OpenMolIns dataset.

4 Experiments

In this section, we present the experiment setup and results. Then, we illustrate our findings based on the observations.

4.1 Setup

4.1.1 Models

The models benchmarked are categorized into four groups: proprietary models, open-source general LLMs, open-source ChEBI-20 fine-tuned LLMs, and OpenMolIns fine-tuned LLMs.

Proprietary Models. This category includes LLMs that are only accessible via commercial API services. In this work, we benchmark GPT-4o, GPT-4-turbo, GPT-3.5-turbo Achiam et al. (2023), Claude-3.5 Anthropic (2024b), Claude-3 Anthropic (2024a), and Gemini-1.5-pro Deepmind (2024).

Open-source General LLMs. This group contains open-source LLMs that are tuned with the instruction following capability, which can be used for a wide range of tasks and applications. Specifically, we benchmark Llama-3-70B-Instruct, Llama-3-8B-Instruct, Llama-3.1-8B-Instruct, Llama-3.2-1B-Instruct Dubey et al. (2024), Mistral-7B-Instruct-v0.2 Jiang et al. (2023), Qwen2-7B-Instruct Yang et al. (2024), yi-1.5-9B Young et al. (2024), and chatglm-9B GLM et al. (2024).

Open-source ChEBI-20 Fine-tuned LLMs. LLMs fine-tuned on the ChEBI-20 dataset can grasp some extent of text-based molecule generation capability. In this case, our experiments also cover LLMs like MolT5-small, MolT5-base, MolT5-large Edwards et al. (2022), and BioT5-base Pei et al. (2023).

OpenMolIns Fine-tuned LLMs We further adopt LLMs like Galactica-125M Taylor et al. (2022), Llama3.2-1B-Instruct, and Llama-3.1-8B-Instruct on OpenMolIns dataset for instruction tuning. We specifically include the experiments on five distinct data sizes of OpenMolIns for Galactica-125M.

4.1.2 Implementation Details

We implement various scripts to facilitate the testing of the aforementioned models. For proprietary models, we adopt the OpenAI API 111https://platform.openai.com/docs/ framework. For open-source general LLMs, we utilize both the VLLM 222https://docs.vllm.ai framework and the OpenAI framework. For the remaining LLMs, we adopt the Hugging Face transformers library 333https://huggingface.co/docs/transformers/ for inference. Detailed hyper-parameters are demonstrated in Appendix B.

Furthermore, it is important to note that BioT5 is designed to use SELFIES as input instead of SMILES. Consequently, we convert the molecule SMILES strings into SELFIES format on BioT5.

4.2 Results

Refer to caption

Figure 3: The performance of models benchmarked in TOMG-Bench. In TOMG-Bench, LLMs are divided into 4 categories: Proprietary Models, Open-source General LLMs, Open-source ChEBI-20 Fine-tuned LLMs, and OpenMolIns Fine-tuned LLMs. Models whose parameters are known are plotted as dots, while models of unknown parameters are denoted as horizontal lines.

Figure 3 presents the performances and model sizes of different models benchmarked on TOMG-Bench, as well as the instruction-tuning performance of Galactica-125M on the five distinct data levels of OpenMolIns. More precise details are further illustrated in Appendix C.

4.3 Findings

Based on the above results, we observe the following key findings:

Text-based open molecule generation is challenging for LLMs. As illustrated in Figure 3, we have calculated the weighted average accuracy across all the nine subtasks. Among the LLMs benchmarked, Claude-3.5 stands out as the top performer, achieving a weighted average accuracy of 35.92%. Gemini-1.5-pro follows closely with a weighted average accuracy of 34.80%. These results underscore the considerable scope for improvement, even among the most advanced proprietary LLMs.

It is also worth noting that while the most advanced LLMs like Claude-3.5 and GPT-4o exhibit relatively strong performance in the MolEdit and MolOpt tasks, the more challenging MolCustom task still remains a challenge. In MolCustom tasks, no LLM has managed to achieve an accuracy exceeding 25% for a single subtask. This observation indicates that the generation of molecules from scratch demands a deep understanding of the molecular structural space, an area where current models are still striving to make significant strides.

Most powerful open-source general LLMs can already outperform GPT-3.5-turbo. In the TOMG-Bench, Llama-3-70B-Instruct has achieved an impressive weighted average accuracy of 23.93%, notably outperforming GPT-3.5-turbo, which scored 18.58%. Despite previously lagging behind proprietary models, open-source general LLMs have rapidly bridged the gap. The evolution of the Llama series, in particular, has been remarkable, finally surpassing GPT-3.5-turbo and demonstrating the fast development of open-source general LLMs.

More powerful LLMs inherit a better performance in TOMG-Bench.Across all the LLMs we benchmarked, a clear trend emerged: the more powerful the LLM is, the higher performance it can achieve on the TOMG-Bench. For instance, the GPT series has consistently demonstrated improved performance with each new iteration from GPT-3.5-turbo to GPT-4o.

Similarly, within the Llama-3 series, we could also observe that larger models tend to achieve superior results on the TOMG-Bench. These findings underscore a strong correlation between an LLM’s capabilities and its performance in our benchmark.

However, we encountered an unexpected anomaly with certain open-source LLMs. Notably, Qwen2-7B-Instruct, despite its impressive ability to solve mathematical problems and its size of 7 billion parameters, underperformed models with as few as 1 billion parameters. This result is particularly striking and suggests that the TOMG-Bench offers a unique and comprehensive evaluation that current races for LLMs may not have adequately addressed. This discovery also highlights the significance of the TOMG-Bench as a new benchmark for LLMs. It provides a broader and more diverse assessment that exposes potential blind spots in the development of LLMs.

ChEBI-20 dataset is insufficient for LLMs to master molecular structures and editing operations. The ChEBI-20 dataset and the associated molecule-caption translation task are designed to bridge the gap between molecular structures and textual descriptions. Despite this intention, LLMs trained on ChEBI-20 have demonstrated limited effectiveness in our TOMG-Bench benchmark. For instance, BioT5-base, which is claimed as the state-of-the-art (SOTA) model for text-based molecule generation on the ChEBI-20 dataset, only achieves a weighted average accuracy of 4.21% on the TOMG-Bench. In the MolEdit and MolOpt tasks, these models are unable to execute correct operations on provided molecules, resulting in disappointing similarity scores. Similarly, in the MolCustom task, which closely mirrors the text-based molecule generation task, the performance remains unsatisfactory, with no model achieving a score above 5% in a single subtask. This performance shortfall highlights a critical limitation of the ChEBI-20 dataset, as it lacks the data quantity and diversity necessary to effectively align molecules with textual descriptions.

In contrast, TOMG-Bench offers a more comprehensive and intricate evaluation framework for text-to-molecule generation. With a larger and more varied set of test examples, TOMG-Bench could robustly assess the capabilities of language models in translating textual descriptions into molecular structures. As such, it represents a significant advancement in the evaluation of text-based molecule generation.

OpenMolIns can enable LLMs to achieve better performance than the most powerful open source general LLMs.We have also developed OpenMolIns, an instruction-tuning dataset, to enhance LLMs’ proficiency in the tasks outlined in the TOMG-Bench. Across five distinct data scales, we observed a pronounced data scaling law: as the size of the corpus increases, the performance of LLMs also improves. In particular, for Galactica-125M, we assessed its capabilities comprehensively on both five distinct data scales. As shown in Figure 3, the outcomes were remarkable: Galactica-125M achieved a weighted average score of 25.73% on OpenMolIns-xlarge, surpassing even the 70B Llama-3-70B-Instruct and GPT-3.5-turbo, with only 125 million parameters. Meanwhile, the results of Galactica-125M show a clear data scaling law, denoting that LLMs are hungry for more molecule corpora to achieve better performance. Notably, OpenMolIns-large has also enabled Llama3.1-8B-Instruct to outperform all the existing open-source general LLMs in TOMG-Bench, showing the effectiveness of the dataset.

5 Conclusion

In this study, we introduce TOMG-Bench, the first benchmark designed to assess the capabilities of Large Language Models (LLMs) in the realm of open-domain molecule generation. Benchmarking 25 LLMs, TOMG-Bench highlights the limitations of existing targeted molecule generation tasks and demonstrates the potential of general LLMs in this domain. Additionally, through instruction tuning on our proposed OpenMolIns, LLMs exhibit significant potential on the TOMG-Bench, matching the performance of GPT-3.5-turbo. Our contributions not only lie in the development of a novel benchmark for molecular discovery but also provide a diverse indication of the capabilities of LLMs.

6 Limitations

Although TOMG-Bench is carefully designed and well-validated through our experiments, we still observe several limitations:

Prompt Diversity. Prompt diversity helps relieve the over-fitting of the instructions. While we adopt several different prompt templates and randomly choose from them, we still find the number of prompt templates is not enough to satisfy the prompt diversity.

Data Distribution. In our data construction process, we allocate distributions to atoms, bonds, and functional groups with the aim of making our benchmark more reflective of real-world distributions. Nevertheless, the distribution we use is largely empirical and may not be sufficiently accurate to reconstruct real-world scenarios accurately. This could potentially mask the true performance capabilities of LLMs in these specific tasks.

7 Acknowledgments

We thank all the reviewers for their insightful comments. We also thank GitHub Copilot for coding assistance and ChatGLM-4 for polishing the writing.

References

Appendix A Data Construction

In this section, we introduce the construction details of TOMG-Bench and OpenMolIns dataset, as well as the prompt templates.

A.1 MolEdit

For the molecule editing (MolEdit) task, we consider the common operations on modifying functional groups in a given molecule (i.e., add, drop, and substitute), which are simple tasks for human experts but challenging to LLMs. In this case, we further develop three corresponding subtasks: AddComponent, DelComponent, and SubComponent. Prompt templates for MolEdit are shown in Table 2. However, there are different kinds of functional groups, and some functional groups can play an important role in the molecule structure, such as connecting two separate parts of the molecule, which makes them unsuitable for these operations above as these operations will entirely change the structure of the molecule. In this case, we aim to make a slight change in the molecule structure and limit most of the functional groups we choose within the end groups.

Table 2: Prompt Templates for MolEdit

Table 3 presents the functional groups that are taken into account for AddComponent and DelComponent, along with their respective selection weights. To reflect the distribution of these functional groups in real-world scenarios, we have implemented a weighted random selection process for AddComponent, which ensures that less common functional groups are assigned a lower probability to be chosen, thereby refining the selection mechanism to better mirror practical occurrences.

Table 3: Functional Groups that are considered in AddComponent and DelComponent, as well as their weights to be selected in AddComponent.

For SubComponent, our focus is exclusively on end groups, which include hydroxyl, aldehyde, carboxyl, nitro, halo, nitrile, and thiol, which ensures that the editing operations are confined to substituting the existing functional group with another from this list, thereby maintaining the integrity of the molecule’s overall structure without altering it fundamentally.

A.2 MolOpt

Molecule optimization (MolOpt), designed to optimize molecular properties through the refinement of molecule structures, is not a brand-new task. Previously, GNN-based methods have been widely adopted in this task, while these methods can only help with one specific subtask at a time. In contrast, TOMG-Bench requires one single LLM to optimize molecules with different metrics and directions. In this work, we specifically focus on enhancing specific characteristics that are crucial for drug discovery and chemical synthesis, including LogP, MR, and QED. The prompt templates for MolOpt are illustrated in Table 4.

Table 4: Prompt Templates for MolOpt

LogP refers to the logarithm of the partition coefficient, which is a measure of a molecule’s hydrophilicity or lipophilicity. It is an important factor in determining a compound’s bioavailability and membrane permeability.

Molecular Refractivity (MR) is a measure of the molar refractive index, which provides insight into the molecular size and the degree of molecular branching. It is used to assess the overall shape and bulk of a molecule.

Quantitative Estimation of Drug-Likeness (QED) is a computational metric that evaluates the drug-likeness of a molecule based on a set of predefined rules. A higher QED score suggests a greater likelihood that the molecule will have favourable pharmacological properties.

A.3 MolCustom

To enable customized design of molecules, we think of three fundamental features for describing the molecule, including atoms, bonds, and functional groups. Given the specified category and number of atoms, bonds, and functional groups, LLMs should generate the molecule as we request. The prompt templates for MolCustom are shown in Table 5. Below, we present the construction details of the three subtasks for MolCustom:

Table 5: Prompt Templates for MolCustom

Table 6: Atoms that are considered in AtomNum, as well as their weights to be selected.

AtomNum. Table 6 shows the atoms we consider in AtomNum, as well as their weights to be selected. Notably, carbon, as the basic unit in organic chemicals, is a mandatory option. The number of carbon atoms ranges from 1 to 40, while the number of other selected atoms ranges from 1 to 5. This setting relieves the difficulty for generation, as LLMs could generate a carbon backbone first and attach the remaining atoms to the backbone one by one.

Table 7: Chemical bonds that are considered in BondNum, as well as their weights to be selected.

BondNum. Similarly, we select five different kinds of chemical bonds: single, double, triple, rotatable, and aromatic, as shown in Table 7. For the single bond, if selected, the number can vary from 1 to 50. For the aromatic bond, the number follows the rules of the formation of aromatic bonds, varying from 5 to 20. Moreover, the number of these remaining bonds, if selected, is specified from 1 to 5.

Table 8: Functional Groups that are considered in FunctionalGroup, as well as their weights to be selected.

FunctionalGroup. Lastly, we also specify functional groups in the molecule structure. Table 8 shows the range of functional groups and their weights that are taken into consideration.

Notably, in MolCustom, if not specified, LLMs can generate any number of these atoms, bonds, and functional groups. However, for these specified items, LLMs should strictly follow the requirements.

Appendix B Hyper Parameters

In this section, we illustrate the detailed parameters adopted in this work, as shown in Table 9.

Table 9: Hyper-parameters

Appendix C Detailed Results

In this section, we first show the leaderboard of TOMG-Bench in Table 10, where Claude-3.5 achieves first place with a weighted average accuracy of 35.92%. Notably, via instruction tuning on our OpenMolIns dataset, Llama-3.1-8B achieves 6th place, which outperforms all the existing open-source LLMs and is just behind Cladue-3.

Table 10: Leaderboard of TOMG-Benchmark.

Then, we present the detailed results of all the subtasks in Table 11, 12, and 13.

Table 11: Results on MolEdit. For each task, we highlight the best accuracy and underline the second best accuracy.

Table 12: Results on MolOpt. For each task, we highlight the best accuracy and underline the second best accuracy.

Table 13: Results on MolCustom. For each task, we highlight the best accuracy and underline the second best accuracy.