RAGEN: Training Agents by Reinforcing Reasoning

Diagnose agent failure modes. Make your RL training better.

RAGEN (Reasoning AGENT) is a flexible RL framework for training reasoning agents.

We develop diagnostics to understand how agent RL training works , and how to fix hidden issues.

Looking for the V1 README? Please take a look here.

News

2026.3.12. We are excited to release RAGEN V2! We introduce a systematic study of reasoning collapse in agent RL and lightweight interventions for stable training. See the v2 paper.
2025.4.20. RAGEN V1 paper published on arXiv.
2025.1.27. Initial RAGEN release. Post.

About

RAGEN is built around StarPO (State-Thinking-Actions-Reward Policy Optimization), a unified RL framework for training multi-turn, trajectory-level agents with flexible control over reasoning processes, reward assignment mechanisms, and prompt-rollout structures.

RAGEN is flexible with:

StarPO framework. Unified optimization for multi-turn agents, supporting both trajectory-level and turn-wise training.
10 built-in environments. Sokoban, FrozenLake, WebShop, DeepCoder, SearchQA, Lean, Bandit, Countdown, MetaMathQA, Sudoku.
Gym-compatible interface. Easy to add custom environments.

RAGEN V2 additionally introduces:

SNR-Adaptive Filtering (V2). Lightweight rollout filtering based on reward variance to mitigate noisy gradient updates.
Reasoning collapse diagnostics (V2). Mutual information proxy metrics to detect and monitor template collapse during training.

Algorithm

StarPO: Reinforcing Reasoning via Trajectory-Level Optimization

The StarPO (State-Thinking-Action-Reward Policy Optimization) framework with two interleaved stages: rollout stage and update stage. The LLM generates reasoning-guided actions to interact with the environment, collecting trajectory-level rewards to jointly optimize reasoning and action strategies.

MDP Formulation. Agent-environment interactions are formulated as Markov Decision Processes (MDPs) where states and actions are token sequences, allowing LLMs to reason over environment dynamics. The objective is to maximize expected cumulative rewards across multiple interaction turns.

Rollout Stage. Given an initial state, the LLM generates multiple trajectories. At each step, the model produces a reasoning-guided action: <think>...</think><ans> action </ans>. The environment returns feedback (reward and next state).

Update Stage. StarPO optimizes entire trajectories using importance sampling. It supports:

PPO. Token-level advantage estimation via a value function over trajectories.
GRPO. Normalized reward assigned to the full trajectory.

V2: Diagnosing Template Collapse

Entropy alone cannot detect template collapse, where reasoning appears diverse within a single input but becomes input-agnostic across inputs. RAGEN V2 decomposes reasoning quality into two axes:

Within-input diversity: Conditional Entropy H(Z|X)
Cross-input distinguishability: Mutual Information I(X;Z)

SNR-Adaptive Filtering uses reward variance as a lightweight proxy to select high-signal prompts each iteration, directly addressing the root cause of template collapse.

Update Log

2026.3.12. RAGEN V2 is released! Check out our v2 paper.

Older updates

2025.5.8. Official Documentation released.

2025.5.2. A tracking document for logging minor codebase updates is released.

2025.4.20. RAGEN V1 paper published. Codebase restructured: veRL integrated as a submodule; architecture decomposed into three modules — Environment State Manager, Context Manager, and Agent Proxy.

2025.3.13. RAGEN codebase refactoring underway. See the developing branch.

2025.3.8. KL term issue in veRL fixed. Default advantage estimator changed to GAE (PPO) for more stable training.

2025.1.27. Initial RAGEN release. Post.

Getting Started

git clone https://github.com/mll-lab-nu/RAGEN.git cd RAGEN conda create -n ragen python=3.12 -y && conda activate ragen bash scripts/setup_ragen.sh

Use bash scripts/setup_ragen.sh --with-search to include the search environment. For WebShop, see docs/experiment_webshop_release.md.

The Four Reasoning Regimes

RAGEN V2 diagnoses agent behavior along two axes — within-input diversity (Conditional Entropy) and cross-input distinguishability (Mutual Information) — yielding four distinct reasoning regimes:

Left: Input-driven reasoning adapts to the current state; templated reasoning produces nearly identical responses across different inputs. Right: Four reasoning regimes along two axes — conditional entropy H(Z|X) (within-input diversity) and mutual information I(X;Z) (input dependence). Template collapse (high entropy, low MI) is invisible to existing entropy-based metrics.

Train (no filter, default):

python train.py --config-name _2_sokoban

Train with SNR-Adaptive Filtering (V2, Top-p):

python train.py --config-name _2_sokoban \ actor_rollout_ref.rollout_filter_strategy=top_p \ actor_rollout_ref.rollout.rollout_filter_value=0.9

Evaluate:

python -m ragen.llm_agent.agent_proxy --config-name _2_sokoban

SNR-Adaptive Filtering consistently improves training across algorithms, model scales, and modalities (green = gain from filtering):

See the Rollout Filtering Guide for more filtering strategies (Top-k, linear mode, etc.).

Future Plans

We are actively developing the next generation of RAGEN infrastructure and diagnostics, targeting a release in late March 2026.

Infrastructure

Async rollout engine
HTTP-based environment interface
Layered Env Wrapper
Optional environment dependencies

Diagnostics & Training Quality

Expanded benchmark suite to stress-test diagnostics across diverse, real-world agent tasks
Extended MI diagnostic dashboard, including richer WandB visualizations for entropy, MI proxy, and gradient decomposition over training
RL training metrics guide, including a practitioner's blog on how to read training signals (reward distribution, entropy, MI, gradient norms) and act on them before committing to a full run

Framework

Update full documentation for RAGEN V2
Multi-modal agent support (building upon VAGEN)
Public leaderboard for benchmark results

Documentation

Full Documentation (We will release an updated version soon.)
Rollout Filtering Guide
MI Metrics Reference
Adding Custom Environments — Gym-compatible interface, see config/envs.yaml and documentation
Experiment reproduction: Main Table | Intervention Sweep | FrozenLake | Sokoban Gradient | Search | DeepCoder | WebShop

Awesome Work Powered or Inspired by RAGEN

ROLL: Efficient Scaling Library for RL with LLMs
VAGEN: Training Visual Agents with multi-turn RL
Search-R1: Train LLMs to reason and call a search engine with RL
ZeroSearch: Incentivize LLM search capability without searching
Agent-R1: Training Powerful LLM Agents with End-to-End RL
OpenManus-RL: RL tuning for LLM agents
MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs
s3: Efficient Yet Effective Search Agent Training via RL

Contributors

Zihan Wang*, Kangrui Wang*, Qineng Wang*, Pingyue Zhang*, Linjie Li*, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li

*Equal Contribution.

Acknowledgements

We thank the DeepSeek team for early conceptual inspirations. We are grateful to the veRL team for infrastructure support. We thank the TinyZero team for discoveries that informed our initial exploration. We appreciate insightful discussions with Han Liu, Xinyu Xing, Li Erran Li, John Schulman, Akari Asai, Eiso Kant, Lu Lu, Runxin Xu, Huajian Xin, Zijun Liu, Weiyi Liu, Weimin Wu, Yibo Wen, Jiarui Liu, Lorenzo Xiao, Ishan Mukherjee, Anabella Isaro, Haosen Sun, How-Yeh Wan, Lester Xue, Matthew Khoriaty, Haoxiang Sun, Jiajun Liu.

For RAGEN V2, we additionally thank Yuxiang Lin and Kyunghyun Cho for their support.

Star History

Citation

@misc{ragen-v2, title={RAGEN-V2: Understanding Reasoning Collapse in LLM Agent Reinforcement Learning}, author={Zihan Wang and Chi Gui and Xing Jin and Qineng Wang and Licheng Liu and Kangrui Wang and Shiqi Chen and Linjie Li and Zhengyuan Yang and Pingyue Zhang and Yiping Lu and Jiajun Wu and Li Fei-Fei and Lijuan Wang and Yejin Choi and Manling Li}, year={2026}, url={https://ragen-ai.github.io/v2}, }

@misc{ragen, title={RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning}, author={Zihan Wang and Kangrui Wang and Qineng Wang and Pingyue Zhang and Linjie Li and Zhengyuan Yang and Xing Jin and Kefan Yu and Minh Nhat Nguyen and Licheng Liu and Eli Gottlieb and Yiping Lu and Kyunghyun Cho and Jiajun Wu and Li Fei-Fei and Lijuan Wang and Yejin Choi and Manling Li}, year={2025}, eprint={2504.20073}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2504.20073}, }

Name		Name	Last commit message	Last commit date
Latest commit History 582 Commits
cases		cases
config		config
docs		docs
external		external
gradient_analysis		gradient_analysis
patches/verl_checkpoint_resharding		patches/verl_checkpoint_resharding
public		public
ragen		ragen
scripts		scripts
tests		tests
verl @ d62da49		verl @ d62da49
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
RAGEN-v2.pdf		RAGEN-v2.pdf
RAGEN.pdf		RAGEN.pdf
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAGEN: Training Agents by Reinforcing Reasoning

Diagnose agent failure modes. Make your RL training better.

News

About

Algorithm

StarPO: Reinforcing Reasoning via Trajectory-Level Optimization

V2: Diagnosing Template Collapse

Update Log

Getting Started

The Four Reasoning Regimes

Future Plans

Documentation

Awesome Work Powered or Inspired by RAGEN

Contributors

Acknowledgements

Star History

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAGEN: Training Agents by Reinforcing Reasoning

Diagnose agent failure modes. Make your RL training better.

News

About

Algorithm

StarPO: Reinforcing Reasoning via Trajectory-Level Optimization

V2: Diagnosing Template Collapse

Update Log

Getting Started

The Four Reasoning Regimes

Future Plans

Documentation

Awesome Work Powered or Inspired by RAGEN

Contributors

Acknowledgements

Star History

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages