🌐 Website |
Code | 🤗 Data and Model | 🖥️ Demo | 📃 Paper
git clone git@github.com:blender-nlp/mCLM.git mamba env create -f environment.yml mamba activate mCLM pip install -e ./ python mCLM/scripts/chat_HF.py python mCLM/scripts/chat_HF.py --synth_only #constrains output to synthesis-guaranteed blocksWhat is the BBBP of <SMILES> Cc1ccc(cc1Nc2nccc(n2)c3cccnc3)NC(=O)c4ccc(cc4)CN5CCN(CC5)C </SMILES>?
The molecule is predicted to have a Blood Brain Barrier permeability (BBB) of 0.000000000000
Does [MOL] [3*]C(=O)CCCC^[2*]OC/C=C(/[1*])C^[3*]CCC=C(C)C [/MOL] exhibit strawberry smell?
Yes, this molecule has a strawberry smell. [...]
python mCLM/scripts/chat_HF_molgen.py python mCLM/scripts/chat_HF_molgen.py --synth_only #constrains output to synthesis-guaranteed blocksGenerate a molecule that has higher blood brain barrier permeability than [MOL] [3*]C(=O)CCCC^[2*]OC/C=C(/[1*])C^[3*]CCC=C(C)C [/MOL].
# SMILES Blocks 1 CCCCCC(=O)OCCCOC(=O)c1ccccc1 [3*]C(=O)CCCCC^[2*]OCCCO[1*]^[3*]C(=O)c1ccccc1 2 CCCCCOOCC=C(C)C(=O)CCC [3*]C(=O)CCC^[2*]OC/C=C(/[1*])C^[3*]OCCCCC 3 CCCCC(=O)C(C)=CCOCCC=C(C)C [3*]C(=O)CCCC^[2*]OC/C=C(/[1*])C^[3*]CCC=C(C)C 4 CCC(=O)C(C)=CCONc1ccc(S(N)(=O)=O)cc1 [3*]C(=O)CC^[2*]OC/C=C(/[1*])C^[3*]Nc1ccc(S(N)(=O)=O)cc1 5 CCC(=O)N(CC)CC [3*]N(CC)CC^[3*]C(=O)CC 6 CN(C)c1ccccc1 [3*]N(C)C^[3*]c1ccccc1 7 CCN(CC)C(C)=O [3*]N(CC)CC^[3*]C(C)=O 8 C=CC(=O)OCCCCOC(=O)CCCC [3*]C(=O)CCCC^[2*]OCCCCO[1*]^[3*]C(=O)C=C 9 CCCCCC(=O)OCCCOC(=O)c1ccco1 [3*]C(=O)CCCCC^[2*]OCCCO[1*]^[3*]C(=O)c1ccco1 10 CCCCCCCC(=O)OCCCOC(=O)CCCCCCC [3*]C(=O)CCCCCCC^[2*]OCCCO[1*]^[3*]C(=O)CCCCCCC
Generate a molecule that has higher blood brain barrier permeability than [MOL] [3*]C(=O)CCCC^[2*]OC/C=C(/[1*])C^[3*]CCC=C(C)C [/MOL]. (synthesis guaranteed flag)
# SMILES Blocks 1 CCCCC(=O)N1CC=C(c2ccccc2)CC1 [3*]C(=O)CCCC^[1*]N1CC=C([2*])CC1^[3*]c1ccccc1 2 CCC(=O)N1CC=C(c2ccccc2)CC1 [3*]C(=O)CC^[1*]N1CC=C([2*])CC1^[3*]c1ccccc1 3 CCC(=O)N(CC)CC [3*]N(CC)CC^[3*]C(=O)CC 4 CN(C)c1ccccc1 [3*]N(C)C^[3*]c1ccccc1 5 CCCC(=O)NCC1CCCO1 [3*]C(=O)CCC^[3*]NCC1CCCO1 6 CCN(CC)C(C)=O [3*]N(CC)CC^[3*]C(C)=O 7 CCCCCC(=O)N(C)CC(=O)c1ccc2c(c1)OCO2 [3*]C(=O)CCCCC^[1*]N(C)CC([2*])=O^[3*]c1ccc2c(c1)OCO2 8 CCCCCCCC(=O)C(=O)C1CCCO1 [3*]C(=O)CCCCCCC^[3*]C(=O)C1CCCO1 9 CCCCCC(=O)C(=O)C12CC3CC(CC(C3)C1)C2 [3*]C(=O)CCCCC^[3*]C(=O)C12CC3CC(CC(C3)C1)C2 10 CCCCC(=O)N(CC)CC(=O)C(=O)c1ccccc1 [3*]C(=O)CCCC^[1*]N(CC)CC([2*])=O^[3*]C(=O)c1ccccc1
Generate a molecule related to <SMILES> Cc1cc(C)nc(NS(=O)(=O)c2ccc(N)cc2)n1 </SMILES>.
# SMILES Blocks 1 CCN(CC)C(C)=O [3*]N(CC)CC^[3*]C(C)=O 2 CN(C)C(=O)Nc1ccc(Cl)cc1 [3*]N(C)C^[3*]C(=O)Nc1ccc(Cl)cc1 3 O=CC=Cc1ccccc1 [3*]c1ccccc1^[3*]C=CC=O 4 O=C(C=Cc1ccccc1)NCCO [3*]NCCO^[3*]C(=O)C=Cc1ccccc1 5 O=C(Nc1ccccc1)Nc1ccccc1 [3*]Nc1ccccc1^[3*]C(=O)Nc1ccccc1 6 Cc1cc(C)nc(NS(=O)(=O)c2ccc(N)cc2)n1 [3*]NS(=O)(=O)c1ccc(N)cc1^[3*]c1nc(C)cc(C)n1 7 CCCCCC(=O)OCCc1ccccc1 [3*]C(=O)CCCCC^[3*]OCCc1ccccc1 8 CCCCCCCCc1ccc(O)cc1 [3*]CCCCCCCC^[3*]c1ccc(O)cc1 9 CCCCCCCCCCCCCCCCC(=O)OCC@HOC(=O)CCCCCCCCCCCCCC [3*]C(=O)CCCCCCCCCCCCCCCC^[1*]OCC@HO[2*]^[3*]C(=O)CCCCCCCCCCCCCC 10 CCCCCCCCCCCCCCCC(=O)OC@HCOP(=O)([O-])OCCN+(C)C [3*]C(=O)CCCCCCCCCCCCCCC^[1*]OC@HCOP(=O)([O-])OCCN+(C)C^[3*]C(=O)CCCCCCCCCCCCC
- Note: Converting the HF model to the right dtype is very important for getting the right outputs:
model.to(torch.bfloat16).
The pretraining datasets and model are on Huggingface.
Training code is available, but not configured to be run with publicly available data format on HF.
To be released.
Updated training code based on the Huggingface dataset to be released. Files for this current version have not been made available.
PYTHONPATH=. srun python mCLM/scripts/main.py \ --pretrained_text_model LLMs/Qwen2.5-3B/ \ --pretrained_tokenizer LLMs/Qwen2.5-3B/ \ --check_val_every_n_steps 10000 \ --batch_size=4 --lr 2e-5 --mol_lr 2e-6 \ --ckpt_path ckpts/Qwen2.5-3B_TotalTop1k_Adaptor_splitLoss/ \ --version Qwen2.5-3B_Adaptor_splitLoss \ --max_epochs 1 \ --train_adapter \ --accumulate_grad_batches 4 \ --data_module TotalTopK \ --task TotalTop1k \ --num_warmup_steps 2000 \ --save_checkpoint_every_n_steps 1000 \ --instruction_data_path mCLM/data/instruction_onlyblocks_top_1k/ \ --synthetic_data_path mCLM/data/synthetic_onlyblocks_top_1k/ \ --downsample_tulu 0.1 \ --tokenizer_cache GNN_input_cache/Total.molecule_tokenizer.1k.pth \ --pretrained_embeddings final_embeddings/OnlyBlocks/Top1k/128_dim/ \PYTHONPATH=. python mCLM/scripts/main.py --pretrained_text_model LLMs/Qwen2.5-3B/ \ --pretrained_tokenizer LLMs/Qwen2.5-3B/ \ --check_val_every_n_steps 10000 \ --batch_size=4 --lr 2e-5 --mol_lr 2e-6 \ --ckpt_path ckpts/Pretraining/ --version Qwen2.5-3B_splitLoss \ --max_epochs 5 \ --accumulate_grad_batches 4 \ --data_module TotalTopK --task TotalTop1k \ --num_warmup_steps 2000 \ --save_checkpoint_every_n_steps 1000 \ --instruction_data_path data/instruction_onlyblocks_top_1k/ \ --synthetic_data_path data/synthetic_onlyblocks_top_1k/ \ --downsample_tulu 0.1 \ --load_GNN_ckpt ckpts_GNN/OnlyBlocks/128_dim/best_val_checkpoint.ckpt \ --tokenizer_cache GNN_input_cache/Total.molecule_tokenizer.1k.pth \ --pretrained_embeddings final_embeddings/OnlyBlocks/Top1k/128_dim/ \CUDA_LAUNCH_BLOCKING=1 PYTHONPATH=. srun python mCLM/scripts/main.py --pretrained_text_model LLMs/Qwen2.5-3B/ \ --pretrained_tokenizer LLMs/Qwen2.5-3B/ \ --check_val_every_n_steps 10000 \ --batch_size=4 --lr 2e-5 --mol_lr 2e-6 \ --ckpt_path ckpts/Output_Checkpoint/ \ --version YOUR_NAME_HERE \ --max_epochs 5 \ --finetune \ --accumulate_grad_batches 4 \ --data_module FinetuneTopK --task FinetuneTop1k \ --num_warmup_steps 5000 \ --save_checkpoint_every_n_steps 1000 \ --synthetic_data_path data/finetune_top_1k/ \ --tokenizer_cache GNN_input_cache/Total.molecule_tokenizer.1k.pth \ --load_ckpt ckpts/Pretraining/latest_checkpoint-epoch=04-step=129000.ckpt \ --pretrained_embeddings final_embeddings/OnlyBlocks/Top1k/128_dim/ \pretrained_text_model- A local download ofQwen/Qwen2.5-3Bpretrained_tokenizer- A local download ofQwen/Qwen2.5-3Bcheck_val_every_n_steps- Set how often to run validation metricsbatch_size- batch size per GPUlr- LLM learning ratemol_lr- Molecule encoder learning rateckpt_path- The output path for checkpointsversion- A name for WandBmax_epochsaccumulate_grad_batches- Number of gradient batches to accumulate before backprop.data_module- The data module to use.task- The name of your task for WandBnum_warmup_steps- Number of warmup steps for the learning rate.save_checkpoint_every_n_steps- How often to save backup checkpoints.instruction_data_path- Processed data from other projects.synthetic_data_path- Either the mCLM activity cliff data (pretraining) or finetuning data.tokenizer_cache- The cached inputs for the molecule tokenizer.load_ckpt- The checkpoint that is being finetuned.pretrained_embeddings- The location of the pretrained GNN embeddings.load_GNN_ckpt- The pretrained GNN checkpoint.downsample_tulu- A float to downsample Tulu3 in the pretraining data mixture.
There are three tokenizers used for the model:
- mCLM_tokenizer: This tokenizer is packaged separately. It is used to run the molecule tokenizer (without synthesis guarantees) to convert a SMILES string into our block notation and vice versa. Relevant functions are
join_fragmentsandget_blocks. - mCLM/tokenizer: This contains
MoleculeTokenizer, which is a class for converting molecule token IDs into torch-geometric data objects. It contains and manages a cache for these objects, allows data type conversion, and allows switching between base language models. - tokenizer: This is an
AutoTokenizerfrom HuggingFace.
You can use mCLM_tokenizer to get the blocks from a molecule SMILES. Then, use the MoleculeTokenizer and AutoTokenizer to interact with the model. mCLM.from_pretrained will already create those two for you! After you're done generating, you can use mCLM_tokenizer to convert back to SMILES. Please see chat_HF.py for an example.
If you found our work useful, please cite:
@misc{edwards2025mclmmodularchemicallanguage, title={mCLM: A Modular Chemical Language Model that Generates Functional and Makeable Molecules}, author={Carl Edwards and Chi Han and Gawon Lee and Thao Nguyen and Sara Szymkuć and Chetan Kumar Prasad and Bowen Jin and Jiawei Han and Ying Diao and Ge Liu and Hao Peng and Bartosz A. Grzybowski and Martin D. Burke and Heng Ji}, year={2025}, eprint={2505.12565}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2505.12565}, }