mCLM: A Modular Chemical Language Model that Generates Functional and Makeable Molecules

🌐 Website | Code | 🤗 Data and Model | 🖥️ Demo | 📃 Paper

Installation

git clone git@github.com:blender-nlp/mCLM.git mamba env create -f environment.yml mamba activate mCLM pip install -e ./

Chat

python mCLM/scripts/chat_HF.py python mCLM/scripts/chat_HF.py --synth_only #constrains output to synthesis-guaranteed blocks

What is the BBBP of <SMILES> Cc1ccc(cc1Nc2nccc(n2)c3cccnc3)NC(=O)c4ccc(cc4)CN5CCN(CC5)C </SMILES>?

The molecule is predicted to have a Blood Brain Barrier permeability (BBB) of 0.000000000000

Does [MOL] [3*]C(=O)CCCC^[2*]OC/C=C(/[1*])C^[3*]CCC=C(C)C [/MOL] exhibit strawberry smell?

Yes, this molecule has a strawberry smell. [...]

To only generate molecules:

python mCLM/scripts/chat_HF_molgen.py python mCLM/scripts/chat_HF_molgen.py --synth_only #constrains output to synthesis-guaranteed blocks

Generate a molecule that has higher blood brain barrier permeability than [MOL] [3*]C(=O)CCCC^[2*]OC/C=C(/[1*])C^[3*]CCC=C(C)C [/MOL].

# SMILES Blocks

1 CCCCCC(=O)OCCCOC(=O)c1ccccc1 [3*]C(=O)CCCCC^[2*]OCCCO[1*]^[3*]C(=O)c1ccccc1

2 CCCCCOOCC=C(C)C(=O)CCC [3*]C(=O)CCC^[2*]OC/C=C(/[1*])C^[3*]OCCCCC

3 CCCCC(=O)C(C)=CCOCCC=C(C)C [3*]C(=O)CCCC^[2*]OC/C=C(/[1*])C^[3*]CCC=C(C)C

4 CCC(=O)C(C)=CCONc1ccc(S(N)(=O)=O)cc1 [3*]C(=O)CC^[2*]OC/C=C(/[1*])C^[3*]Nc1ccc(S(N)(=O)=O)cc1

5 CCC(=O)N(CC)CC [3*]N(CC)CC^[3*]C(=O)CC

6 CN(C)c1ccccc1 [3*]N(C)C^[3*]c1ccccc1

7 CCN(CC)C(C)=O [3*]N(CC)CC^[3*]C(C)=O

8 C=CC(=O)OCCCCOC(=O)CCCC [3*]C(=O)CCCC^[2*]OCCCCO[1*]^[3*]C(=O)C=C

9 CCCCCC(=O)OCCCOC(=O)c1ccco1 [3*]C(=O)CCCCC^[2*]OCCCO[1*]^[3*]C(=O)c1ccco1

10 CCCCCCCC(=O)OCCCOC(=O)CCCCCCC [3*]C(=O)CCCCCCC^[2*]OCCCO[1*]^[3*]C(=O)CCCCCCC

Generate a molecule that has higher blood brain barrier permeability than [MOL] [3*]C(=O)CCCC^[2*]OC/C=C(/[1*])C^[3*]CCC=C(C)C [/MOL]. (synthesis guaranteed flag)

# SMILES Blocks

1 CCCCC(=O)N1CC=C(c2ccccc2)CC1 [3*]C(=O)CCCC^[1*]N1CC=C([2*])CC1^[3*]c1ccccc1

2 CCC(=O)N1CC=C(c2ccccc2)CC1 [3*]C(=O)CC^[1*]N1CC=C([2*])CC1^[3*]c1ccccc1

3 CCC(=O)N(CC)CC [3*]N(CC)CC^[3*]C(=O)CC

4 CN(C)c1ccccc1 [3*]N(C)C^[3*]c1ccccc1

5 CCCC(=O)NCC1CCCO1 [3*]C(=O)CCC^[3*]NCC1CCCO1

6 CCN(CC)C(C)=O [3*]N(CC)CC^[3*]C(C)=O

7 CCCCCC(=O)N(C)CC(=O)c1ccc2c(c1)OCO2 [3*]C(=O)CCCCC^[1*]N(C)CC([2*])=O^[3*]c1ccc2c(c1)OCO2

8 CCCCCCCC(=O)C(=O)C1CCCO1 [3*]C(=O)CCCCCCC^[3*]C(=O)C1CCCO1

9 CCCCCC(=O)C(=O)C12CC3CC(CC(C3)C1)C2 [3*]C(=O)CCCCC^[3*]C(=O)C12CC3CC(CC(C3)C1)C2

10 CCCCC(=O)N(CC)CC(=O)C(=O)c1ccccc1 [3*]C(=O)CCCC^[1*]N(CC)CC([2*])=O^[3*]C(=O)c1ccccc1

Generate a molecule related to <SMILES> Cc1cc(C)nc(NS(=O)(=O)c2ccc(N)cc2)n1 </SMILES>.

# SMILES Blocks

1 CCN(CC)C(C)=O [3*]N(CC)CC^[3*]C(C)=O

2 CN(C)C(=O)Nc1ccc(Cl)cc1 [3*]N(C)C^[3*]C(=O)Nc1ccc(Cl)cc1

3 O=CC=Cc1ccccc1 [3*]c1ccccc1^[3*]C=CC=O

4 O=C(C=Cc1ccccc1)NCCO [3*]NCCO^[3*]C(=O)C=Cc1ccccc1

5 O=C(Nc1ccccc1)Nc1ccccc1 [3*]Nc1ccccc1^[3*]C(=O)Nc1ccccc1

6 Cc1cc(C)nc(NS(=O)(=O)c2ccc(N)cc2)n1 [3*]NS(=O)(=O)c1ccc(N)cc1^[3*]c1nc(C)cc(C)n1

7 CCCCCC(=O)OCCc1ccccc1 [3*]C(=O)CCCCC^[3*]OCCc1ccccc1

8 CCCCCCCCc1ccc(O)cc1 [3*]CCCCCCCC^[3*]c1ccc(O)cc1

9 CCCCCCCCCCCCCCCCC(=O)OCC@HOC(=O)CCCCCCCCCCCCCC [3*]C(=O)CCCCCCCCCCCCCCCC^[1*]OCC@HO[2*]^[3*]C(=O)CCCCCCCCCCCCCC

10 CCCCCCCCCCCCCCCC(=O)OC@HCOP(=O)([O-])OCCN+(C)C [3*]C(=O)CCCCCCCCCCCCCCC^[1*]OC@HCOP(=O)([O-])OCCN+(C)C^[3*]C(=O)CCCCCCCCCCCCC

Note: Converting the HF model to the right dtype is very important for getting the right outputs: model.to(torch.bfloat16).

Data

The pretraining datasets and model are on Huggingface.

Training the model

Training code is available, but not configured to be run with publicly available data format on HF.

Training the GNN

To be released.

Pretraining

Updated training code based on the Huggingface dataset to be released. Files for this current version have not been made available.

Pretraining Adapter

PYTHONPATH=. srun python mCLM/scripts/main.py \ --pretrained_text_model LLMs/Qwen2.5-3B/ \ --pretrained_tokenizer LLMs/Qwen2.5-3B/ \ --check_val_every_n_steps 10000 \ --batch_size=4 --lr 2e-5 --mol_lr 2e-6 \ --ckpt_path ckpts/Qwen2.5-3B_TotalTop1k_Adaptor_splitLoss/ \ --version Qwen2.5-3B_Adaptor_splitLoss \ --max_epochs 1 \ --train_adapter \ --accumulate_grad_batches 4 \ --data_module TotalTopK \ --task TotalTop1k \ --num_warmup_steps 2000 \ --save_checkpoint_every_n_steps 1000 \ --instruction_data_path mCLM/data/instruction_onlyblocks_top_1k/ \ --synthetic_data_path mCLM/data/synthetic_onlyblocks_top_1k/ \ --downsample_tulu 0.1 \ --tokenizer_cache GNN_input_cache/Total.molecule_tokenizer.1k.pth \ --pretrained_embeddings final_embeddings/OnlyBlocks/Top1k/128_dim/ \

Main Pretraining

PYTHONPATH=. python mCLM/scripts/main.py --pretrained_text_model LLMs/Qwen2.5-3B/ \ --pretrained_tokenizer LLMs/Qwen2.5-3B/ \ --check_val_every_n_steps 10000 \ --batch_size=4 --lr 2e-5 --mol_lr 2e-6 \ --ckpt_path ckpts/Pretraining/ --version Qwen2.5-3B_splitLoss \ --max_epochs 5 \ --accumulate_grad_batches 4 \ --data_module TotalTopK --task TotalTop1k \ --num_warmup_steps 2000 \ --save_checkpoint_every_n_steps 1000 \ --instruction_data_path data/instruction_onlyblocks_top_1k/ \ --synthetic_data_path data/synthetic_onlyblocks_top_1k/ \ --downsample_tulu 0.1 \ --load_GNN_ckpt ckpts_GNN/OnlyBlocks/128_dim/best_val_checkpoint.ckpt \ --tokenizer_cache GNN_input_cache/Total.molecule_tokenizer.1k.pth \ --pretrained_embeddings final_embeddings/OnlyBlocks/Top1k/128_dim/ \

Finetuning

CUDA_LAUNCH_BLOCKING=1 PYTHONPATH=. srun python mCLM/scripts/main.py --pretrained_text_model LLMs/Qwen2.5-3B/ \ --pretrained_tokenizer LLMs/Qwen2.5-3B/ \ --check_val_every_n_steps 10000 \ --batch_size=4 --lr 2e-5 --mol_lr 2e-6 \ --ckpt_path ckpts/Output_Checkpoint/ \ --version YOUR_NAME_HERE \ --max_epochs 5 \ --finetune \ --accumulate_grad_batches 4 \ --data_module FinetuneTopK --task FinetuneTop1k \ --num_warmup_steps 5000 \ --save_checkpoint_every_n_steps 1000 \ --synthetic_data_path data/finetune_top_1k/ \ --tokenizer_cache GNN_input_cache/Total.molecule_tokenizer.1k.pth \ --load_ckpt ckpts/Pretraining/latest_checkpoint-epoch=04-step=129000.ckpt \ --pretrained_embeddings final_embeddings/OnlyBlocks/Top1k/128_dim/ \

pretrained_text_model - A local download of Qwen/Qwen2.5-3B
pretrained_tokenizer - A local download of Qwen/Qwen2.5-3B
check_val_every_n_steps - Set how often to run validation metrics
batch_size - batch size per GPU
lr - LLM learning rate
mol_lr - Molecule encoder learning rate
ckpt_path - The output path for checkpoints
version - A name for WandB
max_epochs
accumulate_grad_batches - Number of gradient batches to accumulate before backprop.
data_module - The data module to use.
task - The name of your task for WandB
num_warmup_steps - Number of warmup steps for the learning rate.
save_checkpoint_every_n_steps - How often to save backup checkpoints.
instruction_data_path - Processed data from other projects.
synthetic_data_path - Either the mCLM activity cliff data (pretraining) or finetuning data.
tokenizer_cache - The cached inputs for the molecule tokenizer.
load_ckpt - The checkpoint that is being finetuned.
pretrained_embeddings - The location of the pretrained GNN embeddings.
load_GNN_ckpt - The pretrained GNN checkpoint.
downsample_tulu - A float to downsample Tulu3 in the pretraining data mixture.

Why so many tokenizers?

There are three tokenizers used for the model:

mCLM_tokenizer: This tokenizer is packaged separately. It is used to run the molecule tokenizer (without synthesis guarantees) to convert a SMILES string into our block notation and vice versa. Relevant functions are join_fragments and get_blocks.
mCLM/tokenizer: This contains MoleculeTokenizer, which is a class for converting molecule token IDs into torch-geometric data objects. It contains and manages a cache for these objects, allows data type conversion, and allows switching between base language models.
tokenizer: This is an AutoTokenizer from HuggingFace.

You can use mCLM_tokenizer to get the blocks from a molecule SMILES. Then, use the MoleculeTokenizer and AutoTokenizer to interact with the model. mCLM.from_pretrained will already create those two for you! After you're done generating, you can use mCLM_tokenizer to convert back to SMILES. Please see chat_HF.py for an example.

Citation

If you found our work useful, please cite:

@misc{edwards2025mclmmodularchemicallanguage, title={mCLM: A Modular Chemical Language Model that Generates Functional and Makeable Molecules}, author={Carl Edwards and Chi Han and Gawon Lee and Thao Nguyen and Sara Szymkuć and Chetan Kumar Prasad and Bowen Jin and Jiawei Han and Ying Diao and Ge Liu and Hao Peng and Bartosz A. Grzybowski and Martin D. Burke and Heng Ji}, year={2025}, eprint={2505.12565}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2505.12565}, }

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
mCLM		mCLM
mCLM_tokenizer		mCLM_tokenizer
resources		resources
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mCLM: A Modular Chemical Language Model that Generates Functional and Makeable Molecules

Installation

Chat

To only generate molecules:

Data

Training the model

Training the GNN

Pretraining

Pretraining Adapter

Main Pretraining

Finetuning

Why so many tokenizers?

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

#	SMILES	Blocks
1	CCCCCC(=O)OCCCOC(=O)c1ccccc1	[3]C(=O)CCCCC^[2]OCCCO[1]^[3]C(=O)c1ccccc1
2	CCCCCOOCC=C(C)C(=O)CCC	[3]C(=O)CCC^[2]OC/C=C(/[1])C^[3]OCCCCC
3	CCCCC(=O)C(C)=CCOCCC=C(C)C	[3]C(=O)CCCC^[2]OC/C=C(/[1])C^[3]CCC=C(C)C
4	CCC(=O)C(C)=CCONc1ccc(S(N)(=O)=O)cc1	[3]C(=O)CC^[2]OC/C=C(/[1])C^[3]Nc1ccc(S(N)(=O)=O)cc1
5	CCC(=O)N(CC)CC	[3]N(CC)CC^[3]C(=O)CC
6	CN(C)c1ccccc1	[3]N(C)C^[3]c1ccccc1
7	CCN(CC)C(C)=O	[3]N(CC)CC^[3]C(C)=O
8	C=CC(=O)OCCCCOC(=O)CCCC	[3]C(=O)CCCC^[2]OCCCCO[1]^[3]C(=O)C=C
9	CCCCCC(=O)OCCCOC(=O)c1ccco1	[3]C(=O)CCCCC^[2]OCCCO[1]^[3]C(=O)c1ccco1
10	CCCCCCCC(=O)OCCCOC(=O)CCCCCCC	[3]C(=O)CCCCCCC^[2]OCCCO[1]^[3]C(=O)CCCCCCC

#	SMILES	Blocks
1	CCCCC(=O)N1CC=C(c2ccccc2)CC1	[3]C(=O)CCCC^[1]N1CC=C([2])CC1^[3]c1ccccc1
2	CCC(=O)N1CC=C(c2ccccc2)CC1	[3]C(=O)CC^[1]N1CC=C([2])CC1^[3]c1ccccc1
3	CCC(=O)N(CC)CC	[3]N(CC)CC^[3]C(=O)CC
4	CN(C)c1ccccc1	[3]N(C)C^[3]c1ccccc1
5	CCCC(=O)NCC1CCCO1	[3]C(=O)CCC^[3]NCC1CCCO1
6	CCN(CC)C(C)=O	[3]N(CC)CC^[3]C(C)=O
7	CCCCCC(=O)N(C)CC(=O)c1ccc2c(c1)OCO2	[3]C(=O)CCCCC^[1]N(C)CC([2])=O^[3]c1ccc2c(c1)OCO2
8	CCCCCCCC(=O)C(=O)C1CCCO1	[3]C(=O)CCCCCCC^[3]C(=O)C1CCCO1
9	CCCCCC(=O)C(=O)C12CC3CC(CC(C3)C1)C2	[3]C(=O)CCCCC^[3]C(=O)C12CC3CC(CC(C3)C1)C2
10	CCCCC(=O)N(CC)CC(=O)C(=O)c1ccccc1	[3]C(=O)CCCC^[1]N(CC)CC([2])=O^[3]C(=O)c1ccccc1

#	SMILES	Blocks
1	CCN(CC)C(C)=O	[3]N(CC)CC^[3]C(C)=O
2	CN(C)C(=O)Nc1ccc(Cl)cc1	[3]N(C)C^[3]C(=O)Nc1ccc(Cl)cc1
3	O=CC=Cc1ccccc1	[3]c1ccccc1^[3]C=CC=O
4	O=C(C=Cc1ccccc1)NCCO	[3]NCCO^[3]C(=O)C=Cc1ccccc1
5	O=C(Nc1ccccc1)Nc1ccccc1	[3]Nc1ccccc1^[3]C(=O)Nc1ccccc1
6	Cc1cc(C)nc(NS(=O)(=O)c2ccc(N)cc2)n1	[3]NS(=O)(=O)c1ccc(N)cc1^[3]c1nc(C)cc(C)n1
7	CCCCCC(=O)OCCc1ccccc1	[3]C(=O)CCCCC^[3]OCCc1ccccc1
8	CCCCCCCCc1ccc(O)cc1	[3]CCCCCCCC^[3]c1ccc(O)cc1
9	CCCCCCCCCCCCCCCCC(=O)OCC@HOC(=O)CCCCCCCCCCCCCC	[3]C(=O)CCCCCCCCCCCCCCCC^[1]OCC@HO[2]^[3]C(=O)CCCCCCCCCCCCCC
10	CCCCCCCCCCCCCCCC(=O)OC@HCOP(=O)([O-])OCCN+(C)C	[3]C(=O)CCCCCCCCCCCCCCC^[1]OC@HCOP(=O)([O-])OCCN+(C)C^[3*]C(=O)CCCCCCCCCCCCC

Folders and files

Latest commit

History

Repository files navigation

mCLM: A Modular Chemical Language Model that Generates Functional and Makeable Molecules

Installation

Chat

To only generate molecules:

Data

Training the model

Training the GNN

Pretraining

Pretraining Adapter

Main Pretraining

Finetuning

Why so many tokenizers?

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages