We introduce ESPO (ELBO-based Sequence-level Policy Optimization), a principled reinforcement learning framework for dLLMs.
Unlike traditional autoregressive RL methods (e.g., GRPO) that rely on token-level likelihoods, ESPO views the entire sequence generation as a single action and leverages the ELBO as a tractable proxy for sequence-level likelihood. This design resolves the fundamental mismatch between RL and the non-autoregressive nature of dLLMs.
ESPO introduces:
- Sequence-level optimization for diffusion LLMs via the ELBO objective.
- Per-token normalized ratio estimation and robust KL regularization for stable large-scale training.
- Consistent gains across math, coding, and planning benchmarks.
├── espo/ │ ├── espo_train.py # Entry point for diffusion RL training │ ├── espo_trainer.py # Sequence-level trainer implementation │ ├── configs.py # Configuration schemas and defaults │ ├── rewards.py # Reward functions for different tasks │ ├── data_utils.py # Data loading and processing utilities │ └── utils/ # Helpers (code exec, routing, logging) ├── recipes/ │ ├── train.yaml # Base training configuration │ ├── run_*.sh # Task-specific launch scripts (GSM8K, Math, etc.) │ ├── process_data.py # Dataset preprocessing for coding tasks │ ├── accelerate_configs/ # accelerate/distributed launch presets ├── dataset/ # Datasets for Sudoku and Countdown tasks ├── eval/ # Evaluation scripts for Sudoku and Countdown tasks env=espo conda create -n $env python=3.11 -y conda activate $env pip install setuptools pip install flash-attn==2.8.0.post1 --no-build-isolation # Install this project. Extras available: [code] pip install -e ".[code]"We split a hard subset based on AceCode-89k.:
cd recipes python process_data.py --dataset_path "TIGER-Lab/AceCode-89K" --output_path "./acecode_hard.jsonl" --difficulty "hard"For sandboxed code execution, we recommend E2B. Set E2B_API_KEY to your token and specify code_provider="e2b" in the config. A local fallback code_provider="local" is available but not recommended for untrusted code.
We release ESPO-fine-tuned checkpoints built on LLaDA-8B-Instruct.
ESPO-Code is released as a full fine-tuned model (no LoRA). ESPO-GSM8K, ESPO-Math, ESPO-Countdown, and ESPO-Sudoku are provided as LoRA adapters, which can be loaded on top of the base LLaDA-8B-Instruct model for lightweight and efficient fine-tuning.
Quick Start:
from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel from eval.generate_utils import generate base_model_path = 'GSAI-ML/LLaDA-8B-Instruct' peft_model_path = 'GSAI-ML/ESPO-Math' tokenizer = AutoTokenizer.from_pretrained(base_model_path) model = AutoModelForCausalLM.from_pretrained( base_model_path, trust_remote_code=True,torch_dtype="bfloat16", device_map="cuda") peft_model = PeftModel.from_pretrained(model, peft_model_path, device_map="cuda") prompt = "The point $(0,0)$ is reflected over the vertical line $x=1$. When its image is then reflected over the line $y=2$, what is the resulting point?\n\nWrite your answer in the form $(x, y)$ where $x$ and $y$ are real numbers." messages = [{"role": "user", "content": prompt}] input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda") output_ids = generate(peft_model, input_ids,tokenizer, steps=128, gen_length=256, temperature=0.9,remasking="low_confidence",) output_text = tokenizer.batch_decode(output_ids[:, input_ids.shape[1]:], skip_special_tokens=True)[0] print(output_text)We provide RL training scripts in the recipes directory.
Example:
bash recipes/run_demo_llada.shFor LLaDA-8B-Instruct, we include both a demo script and task-specific scripts for GSM8K, Math, Countdown, Sudoku, and Coding. For Dream-7B-Instruct, we provide a demo script (run_demo_dream.sh), and all task scripts can be easily adapted by modifying the parameters in this demo. You may also customize settings directly in the corresponding run_*.sh files or in train.yaml.
For large batch sizes or full-parameter finetuning (e.g., code tasks), enable gradient_checkpointing: true in the config. For LLaDA models, add the following methods in the end of modeling_llada.py:
... def tie_weights(self): if self.config.weight_tying: self.model.transformer.ff_out = self.model.transformer.wte # Add the following codes in modeling_llada.py. ## Begin @property def supports_gradient_checkpointing(self): return True def gradient_checkpointing_enable(self, gradient_checkpointing_kwargs=None): self.model.set_activation_checkpointing("fine_grained") def gradient_checkpointing_disable(self): self.model.set_activation_checkpointing(None) ## End # Register the model so that it is available for transformer pipelines, auto-loading, etc. AutoModel.register(LLaDAConfig, LLaDAModelLM)Use fine_grained strategy balances speed and memory.
For Sudoku and Countdown tasks, we provide evaluation scripts in eval/run_eval.sh. It works for both Dream-7B-Instruct and LLaDA-8B-Instruct.
For GSM8K, Math, and Code tasks, we use the official evaluation scripts from the respective model codebases:
- LLaDA-8B-Instruct: use their official evaluation script based on opencompass.
- Dream-7B-Instruct: use their official evaluation script based on lm-eval.
We thank the following open-source efforts:
- Models: LLaDA, Dream
- RL/eval codebases: d1, Diffu-coder, Open-R1, Opencompass
If you find ESPO useful in your research, please consider to cite our paper:
@article{ou2025principledrldiffusionllms, title={Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective}, author={Jingyang Ou and Jiaqi Han and Minkai Xu and Shaoxuan Xu and Jianwen Xie and Stefano Ermon and Yi Wu and Chongxuan Li}, journal={arXiv preprint arXiv:2512.03759}, year={2025}, }
