RAGEN (Reasoning AGENT) is a flexible RL framework for training reasoning agents.
We develop diagnostics to understand how agent RL training works , and how to fix hidden issues.
Looking for the V1 README? Please take a look here.
- 2026.3.12. We are excited to release RAGEN V2! We introduce a systematic study of reasoning collapse in agent RL and lightweight interventions for stable training. See the v2 paper.
- 2025.4.20. RAGEN V1 paper published on arXiv.
- 2025.1.27. Initial RAGEN release. Post.
RAGEN is built around StarPO (State-Thinking-Actions-Reward Policy Optimization), a unified RL framework for training multi-turn, trajectory-level agents with flexible control over reasoning processes, reward assignment mechanisms, and prompt-rollout structures.
RAGEN is flexible with:
- StarPO framework. Unified optimization for multi-turn agents, supporting both trajectory-level and turn-wise training.
- 10 built-in environments. Sokoban, FrozenLake, WebShop, DeepCoder, SearchQA, Lean, Bandit, Countdown, MetaMathQA, Sudoku.
- Gym-compatible interface. Easy to add custom environments.
RAGEN V2 additionally introduces:
- SNR-Adaptive Filtering (V2). Lightweight rollout filtering based on reward variance to mitigate noisy gradient updates.
- Reasoning collapse diagnostics (V2). Mutual information proxy metrics to detect and monitor template collapse during training.
The StarPO (State-Thinking-Action-Reward Policy Optimization) framework with two interleaved stages: rollout stage and update stage. The LLM generates reasoning-guided actions to interact with the environment, collecting trajectory-level rewards to jointly optimize reasoning and action strategies.
MDP Formulation. Agent-environment interactions are formulated as Markov Decision Processes (MDPs) where states and actions are token sequences, allowing LLMs to reason over environment dynamics. The objective is to maximize expected cumulative rewards across multiple interaction turns.
Rollout Stage. Given an initial state, the LLM generates multiple trajectories. At each step, the model produces a reasoning-guided action: <think>...</think><ans> action </ans>. The environment returns feedback (reward and next state).
Update Stage. StarPO optimizes entire trajectories using importance sampling. It supports:
- PPO. Token-level advantage estimation via a value function over trajectories.
- GRPO. Normalized reward assigned to the full trajectory.
Entropy alone cannot detect template collapse, where reasoning appears diverse within a single input but becomes input-agnostic across inputs. RAGEN V2 decomposes reasoning quality into two axes:
- Within-input diversity: Conditional Entropy H(Z|X)
- Cross-input distinguishability: Mutual Information I(X;Z)
SNR-Adaptive Filtering uses reward variance as a lightweight proxy to select high-signal prompts each iteration, directly addressing the root cause of template collapse.
2026.3.12. RAGEN V2 is released! Check out our v2 paper.
Older updates
2025.5.8. Official Documentation released.
2025.5.2. A tracking document for logging minor codebase updates is released.
2025.4.20. RAGEN V1 paper published. Codebase restructured: veRL integrated as a submodule; architecture decomposed into three modules — Environment State Manager, Context Manager, and Agent Proxy.
2025.3.13. RAGEN codebase refactoring underway. See the developing branch.
2025.3.8. KL term issue in veRL fixed. Default advantage estimator changed to GAE (PPO) for more stable training.
2025.1.27. Initial RAGEN release. Post.
git clone https://github.com/mll-lab-nu/RAGEN.git cd RAGEN conda create -n ragen python=3.12 -y && conda activate ragen bash scripts/setup_ragen.shUse bash scripts/setup_ragen.sh --with-search to include the search environment. For WebShop, see docs/experiment_webshop_release.md.
RAGEN V2 diagnoses agent behavior along two axes — within-input diversity (Conditional Entropy) and cross-input distinguishability (Mutual Information) — yielding four distinct reasoning regimes:
Left: Input-driven reasoning adapts to the current state; templated reasoning produces nearly identical responses across different inputs. Right: Four reasoning regimes along two axes — conditional entropy H(Z|X) (within-input diversity) and mutual information I(X;Z) (input dependence). Template collapse (high entropy, low MI) is invisible to existing entropy-based metrics.
Train (no filter, default):
python train.py --config-name _2_sokobanTrain with SNR-Adaptive Filtering (V2, Top-p):
python train.py --config-name _2_sokoban \ actor_rollout_ref.rollout_filter_strategy=top_p \ actor_rollout_ref.rollout.rollout_filter_value=0.9Evaluate:
python -m ragen.llm_agent.agent_proxy --config-name _2_sokobanSNR-Adaptive Filtering consistently improves training across algorithms, model scales, and modalities (green = gain from filtering):
See the Rollout Filtering Guide for more filtering strategies (Top-k, linear mode, etc.).
We are actively developing the next generation of RAGEN infrastructure and diagnostics, targeting a release in late March 2026.
Infrastructure
- Async rollout engine
- HTTP-based environment interface
- Layered Env Wrapper
- Optional environment dependencies
Diagnostics & Training Quality
- Expanded benchmark suite to stress-test diagnostics across diverse, real-world agent tasks
- Extended MI diagnostic dashboard, including richer WandB visualizations for entropy, MI proxy, and gradient decomposition over training
- RL training metrics guide, including a practitioner's blog on how to read training signals (reward distribution, entropy, MI, gradient norms) and act on them before committing to a full run
Framework
- Update full documentation for RAGEN V2
- Multi-modal agent support (building upon VAGEN)
- Public leaderboard for benchmark results
- Full Documentation (We will release an updated version soon.)
- Rollout Filtering Guide
- MI Metrics Reference
- Adding Custom Environments — Gym-compatible interface, see
config/envs.yamland documentation - Experiment reproduction: Main Table | Intervention Sweep | FrozenLake | Sokoban Gradient | Search | DeepCoder | WebShop
- ROLL: Efficient Scaling Library for RL with LLMs
- VAGEN: Training Visual Agents with multi-turn RL
- Search-R1: Train LLMs to reason and call a search engine with RL
- ZeroSearch: Incentivize LLM search capability without searching
- Agent-R1: Training Powerful LLM Agents with End-to-End RL
- OpenManus-RL: RL tuning for LLM agents
- MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs
- s3: Efficient Yet Effective Search Agent Training via RL
Zihan Wang*, Kangrui Wang*, Qineng Wang*, Pingyue Zhang*, Linjie Li*, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li
*Equal Contribution.
We thank the DeepSeek team for early conceptual inspirations. We are grateful to the veRL team for infrastructure support. We thank the TinyZero team for discoveries that informed our initial exploration. We appreciate insightful discussions with Han Liu, Xinyu Xing, Li Erran Li, John Schulman, Akari Asai, Eiso Kant, Lu Lu, Runxin Xu, Huajian Xin, Zijun Liu, Weiyi Liu, Weimin Wu, Yibo Wen, Jiarui Liu, Lorenzo Xiao, Ishan Mukherjee, Anabella Isaro, Haosen Sun, How-Yeh Wan, Lester Xue, Matthew Khoriaty, Haoxiang Sun, Jiajun Liu.
For RAGEN V2, we additionally thank Yuxiang Lin and Kyunghyun Cho for their support.
@misc{ragen-v2, title={RAGEN-V2: Understanding Reasoning Collapse in LLM Agent Reinforcement Learning}, author={Zihan Wang and Chi Gui and Xing Jin and Qineng Wang and Licheng Liu and Kangrui Wang and Shiqi Chen and Linjie Li and Zhengyuan Yang and Pingyue Zhang and Yiping Lu and Jiajun Wu and Li Fei-Fei and Lijuan Wang and Yejin Choi and Manling Li}, year={2026}, url={https://ragen-ai.github.io/v2}, }@misc{ragen, title={RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning}, author={Zihan Wang and Kangrui Wang and Qineng Wang and Pingyue Zhang and Linjie Li and Zhengyuan Yang and Xing Jin and Kefan Yu and Minh Nhat Nguyen and Licheng Liu and Eli Gottlieb and Yiping Lu and Kyunghyun Cho and Jiajun Wu and Li Fei-Fei and Lijuan Wang and Yejin Choi and Manling Li}, year={2025}, eprint={2504.20073}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2504.20073}, }


