This module provides the core functionality for agent practice within the Youtu-Agent framework, implemented with Training-Free Group Relative Policy Optimization (GRPO). Training-Free GRPO is a cost-effective solution that enhances agent performance without any LLM parameter updates by leveraging group relative semantic advantages and iteratively distilling high-quality experiential knowledge.
utu/practice/ ├── __init__.py # Module exports ├── training_free_grpo.py # Main orchestrator ├── rollout_manager.py # Rollout execution and batch processing ├── experience_updater.py # Experience processing and integration ├── data_manager.py # Dataset management ├── utils.py # Configuration parsing and utilities detailed below ├── dataset/ # Dataset storage directory └── verify/ # Verification functions Configurations: The system uses a hierarchical configuration approach: TrainingFreeGRPOConfig loaded from config files in configs/practice/ directory. It is a unified configuration class with experiment ID and references to the following practice, data, and evaluation configs:
PracticeArguments: Practice-specific parameters (epochs, batch size, GRPO settings, restart behavior)DataArguments: Data processing parameters like practice dataset nameEvalConfig: Evaluation configuration loaded separately from configuration files inconfigs/eval/directory
Utilities:
TaskRecorder: Records practice progress, experiences, and statisticsparse_training_free_grpo_config(): Configuration parser supporting YAML files with command-line overrides
Here are step-by-step instructions to run the training-free GRPO process for agent practice and evaluation. Please follow each step carefully to ensure proper setup and execution. For built-in example workflows on math reasoning and web searching, please refer to the 📝 Example Workflows.
Please refer to the QuickStart documentation for environment setup and installation of Youtu-Agent framework. If you are new to Python, please refer to Begginer's QuickStart Guide for detailed instructions.
Please install all dependencies by:
uv sync --all-extras source .venv/bin/activateFor tracing and monitoring the practice and evaluation process, you can enable Phoenix tracing by following the Tracing & Monitoring Documentation. For simplicity, you can start a local Phoenix server by running:
pip install arize-phoenix phoenix serve # Remember to set the environment variables in a `.env` file at the project root: # PHOENIX_ENDPOINT=http://127.0.0.1:6006/v1/traces # PHOENIX_PROJECT_NAME=Youtu-AgentBefore running the training-free GRPO process, you need to prepare your datasets for practice and evaluation by uploading them to the local database. There are two ways to upload and prepare data:
If you want to load datasets from HuggingFace to the database, you can reference the provided example script, scripts/data/process_training_free_GRPO_data.py. You can use this script directly for our built-in datasets, or adapt it to create your own upload code for other HuggingFace datasets:
# Example: Load built-in datasets for reproducing paper experiments python scripts/data/process_training_free_GRPO_data.pyThis example script will download and upload the following built-in datasets to the local database:
- AIME24: AIME 2024 competition problems
- AIME25: AIME 2025 competition problems
- DAPO-Math-17k: 17k math problems from DAPO dataset
- AFM_web_RL: Web agent reinforcement learning dataset
- WebWalkerQA: Web navigation n-answering dataset
Important: When uploading datasets, make sure that each DatasetSample includes the required fields: dataset, source, question, and answer, specifically, ensure to set the source field as "training_free_grpo" for proper data processing.
If you want to use your own custom datasets, you can upload them from local files using the upload script:
# Upload custom dataset from local file python scripts/data/upload_dataset.py \ --file_path path/to/your_dataset.jsonl \ --dataset_name YourDatasetImportant: Ensure your local data file contains samples with all required fields: dataset, source, question, and answer. Each sample should be a dictionary with the following structure:
{ "dataset": "YourDataset", # Name of your dataset "source": "training_free_grpo", # Must be "training_free_grpo" for both practice and evaluation data "question": "What is 2+2?", # The question or prompt "answer": "4" # The expected answer, set to None if not applicable }Training-Free GRPO features a flexible and configurable reward calculation system through customizable verification functions. This modular design allows you to define domain-specific evaluation criteria that accurately assess agent performance across diverse tasks, ranging from mathematical reasoning requiring symbolic verification to web search tasks needing semantic understanding with LLMs.
Verification functions must be placed in the utu/practice/verify/ directory and follow this interface:
from utu.db import EvaluationSample def verify_func(sample: EvaluationSample, timeout_score: float = 0, **kwargs) -> dict: """ Verify the correctness of an agent response. Args: sample: EvaluationSample containing: - raw_question: The original question, corresponding to the `question` field in DatasetSample - correct_answer: Ground truth answer, corresponding to the `answer` field in DatasetSample - response: Agent's final response to evaluate - other metadata fields timeout_score: Score to assign when verification times out **kwargs: Additional arguments passed by the processor, including: - llm: LLM client for verification that requires LLM judgment, please refer to `utu/practice/verify/webwalker.py` for example usage. Returns: dict: { "reward": float, # ranges from 0.0 to 1.0 "reasoning": str | None # extra details provided for experience extraction in training-free GRPO } """ # Your verification logic here passThe module provides two built-in verification functions for reproducing paper results:
- Math Verification (
utu/practice/verify/math.py)
- Verification function:
verify_func - Purpose: Evaluates mathematical problem solutions using symbolic math verification
- Dependencies: Requires
math_verifypackage for expression extraction and comparison - Logic: Compares extracted mathematical expressions from agent response with ground truth using both LaTeX and expression extraction
- Web Search Verification (
utu/practice/verify/webwalker.py)
- Verification function:
verify_func - Purpose: Evaluates web search task responses using LLM-based judgement
- Logic: Uses an LLM judge (passed via
llmparameter in kwargs) to compare agent response with ground truth
To create custom verification functions:
- Create a Python file in
utu/practice/verify/ - Implement a function matching the interface above. If LLM is needed, use kwargs['llm'] to access the LLM client. Refer to
utu/practice/verify/webwalker.pyfor example. - Add any required dependencies to your environment (e.g.,
uv pip install <package_name>)
Example for a simple string matching verifier:
# utu/practice/verify/str_match.py from utu.db import EvaluationSample def string_match_verify(sample: EvaluationSample, timeout_score: float = 0, **kwargs) -> dict: """Simple string matching verification.""" if sample.correct_answer.lower() == sample.response.lower(): return {"reward": 1.0, "reasoning": None} return {"reward": 0.0, "reasoning": None}The Training-Free GRPO module uses a hierarchical configuration system that prioritizes YAML configuration files. There are three main configuration directories, in which you can find built-in configurations or create your own custom configurations:
configs/agents/ # For agent configurations ├── practice/ # Built-in simple agents for practice │ ├── math_agent.yaml # Built-in math agent config │ ├── math_practice_agent.yaml # Built-in math agent config after practice via training-free GRPO │ ├── web_agent.yaml # Built-in web search agent config │ └── web_practice_agent.yaml # Built-in web search agent config after practice via training-free GRPO configs/eval/ # For evaluation configurations ├── math/ │ ├── math_AIME24.yaml # Built-in math evaluation settings for AIME24 │ ├── math_AIME25.yaml # Built-in math evaluation settings for AIME25 │ ├── math_practice_AIME24.yaml # Built-in math evaluation settings for AIME24 after practice via training-free GRPO │ └── math_practice_AIME25.yaml # Built-in math evaluation settings for AIME25 after practice via training-free GRPO ├── web/ │ ├── web.yaml # Built-in web evaluation settings for WebWalkerQA │ └── web_practice.yaml # Built-in web evaluation settings for WebWalkerQA after practice via training-free GRPO configs/practice/ # For practice configurations ├── math_reasoning.yaml # Built-in configs for math practice ├── web_search.yaml # Built-in configs for web search practice Ensure you have an agent configuration file in configs/agents/practice directory. You can use built-in agents or create your own custom agent configuration. For more details, please refer to Agent Configuration.
Please create the evaluation configuration file in configs/eval/my_domain directory:
# configs/eval/my_domain/my_eval.yaml # @package _global_ defaults: - /agents/practice/my_agent@agent # Reference your agent config - _self_ exp_id: "my_eval" # Experiment ID for evaluation # Evaluation dataset configuration data: dataset: "MyEvalDataset" # Name of evaluation dataset in database type: "single" # Dataset type # Evaluation settings concurrency: 64 # Parallel evaluation processes pass_k: 3 # Pass@K evaluation metric # Verification function settings verify_filename: "my_verify.py" # Custom verification function file verify_func_name: "my_verify_func" # Function name in the file # Optional: Custom judge model settings for LLM-based verification judge_model: model_provider: type: ${oc.env:JUDGE_LLM_TYPE} # define in environment variable (.env) for security model: ${oc.env:JUDGE_LLM_MODEL} # define in environment variable (.env) for security base_url: ${oc.env:JUDGE_LLM_BASE_URL} api_key: ${oc.env:JUDGE_LLM_API_KEY} model_params: temperature: 0.5Please create practice configuration** in configs/practice/ directory:
# configs/practice/my_practice.yaml # @package _global_ defaults: - /eval/my_domain/my_eval@evaluation # Reference your evaluation config - _self_ exp_id: "my_practice" # Experiment ID for practice # Practice Arguments practice: epochs: 5 # Number of epochs for practice batch_size: 32 # Samples per batch grpo_n: 3 # Rollouts per GRPO group rollout_concurrency: 64 # Parallel rollout processes rollout_temperature: 0.7 # LLM temperature during rollouts task_timeout: 3600 # Timeout per task (seconds) do_eval: false # Enable evaluation during Training-Free GRPO eval_strategy: "epoch" # Evaluate at end of each epoch / step, unused if do_eval is false restart_step: null # Use cache for all steps agent_objective: | # Briefly introduce the working agent to the learning process for experience extraction. Here is an example for a math reasoning agent: input: A math question that could falls into any areas of mathematics output: A step-by-step reasoning process that leads to the final answer learning_objective: | # Briefly describe the learning goal and expected experiences for the practice process. Here is an example for a math reasoning agent: Help the agent to improve the solving capability on math questions by extracting general and concise guidelines. num_experiences_per_query: 1 # Number of experiences to extract per query # Data Arguments data: practice_dataset_name: "MyPracticeDataset" # Name of practice dataset in databaseAfter preparation is complete, use the scripts/run_eval.py script with configuration files to evaluate the baseline agent before training-free GRPO:
python scripts/run_eval.py \ --config_name my_domain/my_eval # Your evaluation config nameAfter evaluating the baseline, we can proceed to the training-free GRPO process. The system now primarily uses YAML configuration files located in configs/practice/ directory, with optional command-line parameter overrides:
# Using the configuration file python scripts/run_training_free_GRPO.py \ --config_name my_practice # Your practice config name # Using configuration file with custom parameter overrides the config python scripts/run_training_free_GRPO.py \ --config_name my_practice \ --experiment_name my_practice \ --epochs 5 \ --batch_size 64To control the restart behavior and caching mechanism, use the --restart_step argument:
# Complete restart (no caching) python scripts/run_training_free_GRPO.py \ --config_name my_practice \ --restart_step 0 # Resume from cached results (default behavior) python scripts/run_training_free_GRPO.py \ --config_name my_practice \ --restart_step null # Partial restart: use cache for steps 0-2, restart from step 3 python scripts/run_training_free_GRPO.py \ --config_name my_practice \ --restart_step 3The practice process will generates:
- Enhanced Agent Configuration: A new YAML file with integrated experiences
- Tracing Logs: A detailed log of the training-free GRPO process and progress through Phoenix if enabled, including the following key metrics:
- The detailed trajectories during rollouts
- The step-by-step outputs of experience extraction
- The extracted experiences and the statistics at each step
- Evaluation performance (if
do_evalis enabled) at each epoch or step
- Experience Records: Structured records of learned experiences in the database (configurable in the environment variables of
.envfile)
When practice completes, you'll receive a path to the enhanced agent configuration file (e.g., configs/agents/practice/my_practice_agent.yaml). Next, you can evaluate the enhanced agent configuration on your evaluation datasets. Similarly, create a new evaluation configuration file that references the enhanced agent configuration:
# configs/eval/my_domain/my_practice.yaml # @package _global_ defaults: - /agents/practice/my_practice_agent@agent # Use your enhanced config after practice - _self_ exp_id: "my_practice_eval" # Evaluation dataset configuration data: dataset: "MyEvalDataset" # Name of evaluation dataset in database type: "single" # Dataset type # Evaluation settings concurrency: 64 # Parallel evaluation processes pass_k: 3 # Pass@K evaluation metric # Verification function settings verify_filename: "my_verify.py" # Custom verification function file verify_func_name: "my_verify_func" # Function name in the file # Optional: Custom judge model settings for LLM-based verification judge_model: model_provider: type: ${oc.env:JUDGE_LLM_TYPE} # define in environment variable (.env) for security model: ${oc.env:JUDGE_LLM_MODEL} # define in environment variable (.env) for security base_url: ${oc.env:JUDGE_LLM_BASE_URL} api_key: ${oc.env:JUDGE_LLM_API_KEY} model_params: temperature: 0.5Then run the evaluation with the enhanced agent configuration:
python scripts/run_eval.py \ --config_name my_domain/my_practice # Your evaluation config name for the enhanced agentStep 0. Environment Setup: Besides the common environment setup in Step 0, please install the math-verify package for math verification:
uv pip install math-verifyStep 1. Data Preparation: Use built-in datasets.
python scripts/data/process_training_free_GRPO_data.pyStep 2. Setup the Verification Functions: The built-in math verification function is already provided in utu/practice/verify/math.py.
Step 3. Prepare Configurations: For deepseek-chat LLM from DeepSeek official API, you can use the built-in math agent, evaluation and practice configurations, i.e., configs/agents/practice/math_agent.yaml, configs/eval/math/*.yaml and configs/practice/math_reasoning.yaml, respectively. Note that for different LLMs, the optimal configurations may vary. You can try to modify these configuration accordingly.
Step 4. Evaluating Baseline: Use the built-in math evaluation configuration.
python scripts/run_eval.py --config_name math/math_AIME24 python scripts/run_eval.py --config_name math/math_AIME25Step 5. Run Training-Free GRPO: Use the built-in math reasoning practice configuration.
python scripts/run_training_free_GRPO.py --config_name math_reasoningAfter practice, the enhanced agent configuration will be automatically saved as configs/agents/practice/math_practice_agent.yaml.
Step 6. Evaluate the Enhanced Agent: Use the built-in math evaluation configuration with the enhanced agent.
python scripts/run_eval.py --config_name math/math_practice_AIME24 python scripts/run_eval.py --config_name math/math_practice_AIME25Step 0. Environment Setup: Besides the common environment setup in in Step 0, please set up additional environment variables in .env for web search APIs (e.g., Serper, Jina).
SERPER_API_KEY=replace-with-your-serper-api-key JINA_API_KEY=replace-with-your-jina-api-keyStep 1. Data Preparation: Use built-in datasets.
python scripts/data/process_training_free_GRPO_data.pyStep 2. Setup the Verification Functions: The built-in web search verification function is already provided in utu/practice/verify/webwalker.py.
Step 3. Prepare Configurations: To run with DeepSeek-V3.2-Exp, you can use the built-in web search agent, evaluation and practice configurations, i.e., configs/agents/practice/web_agent.yaml, configs/eval/web/web.yaml and configs/practice/web_search.yaml, respectively. Note that for different LLMs, the optimal configurations may vary. You can try to modify these configuration accordingly.
Step 4. Evaluating Baseline: Use the built-in web evaluation configuration.
python scripts/run_eval.py --config_name web/webStep 5. Run Training-Free GRPO: Use the built-in web searching practice configuration.
python scripts/run_training_free_GRPO.py --config_name web_searchAfter practice, the enhanced agent configuration will be automatically saved as configs/agents/practice/web_practice_agent.yaml.
Step 6. Evaluate the Enhanced Agent: Use the built-in web evaluation configuration with the enhanced agent.
python scripts/run_eval.py --config_name web/web_practiceIf you find this work useful, please consider citing:
@misc{training_free_grpo, title={Training-Free Group Relative Policy Optimization}, author={Tencent Youtu Lab}, year={2025}, eprint={2510.08191}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.08191}, } @misc{youtu-agent-2025, title={Youtu-agent: A Simple yet Powerful Agent Framework}, author={Tencent Youtu Lab}, year={2025}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/TencentCloudADP/youtu-agent}}, }