Skip to content

cxcscmu/General-AgentBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Benchmark Test-Time Scaling of General LLM Agents

Xiaochuan Li¹, Ryan Ming¹, Pranav Setlur¹, Abhijay Paladugu¹, Andy Tang¹, Hao Kang¹,

Shuai Shao², Rong Jin², Chenyan Xiong¹

¹ Language Technology Institute, Carnegie Mellon University
² Meta

arXiv License Website

❓ Is Test-Time Scaling as Effective as You Think?

Sequential test-time scaling
(a) Sequential test-time scaling
Parallel test-time scaling
(b) Parallel test-time scaling
  • Sequential scaling hits a context ceiling: performance initially improves with more interaction turns, but then plateaus and even declines as the growing context destabilizes the agent.
  • Parallel scaling suffers from a verifiability gap: while pass@K grows steadily with more samples, self-choice accuracy remains nearly flat — agents cannot reliably identify the correct trajectory among candidates.

🔍 Overview

  • We introduce General AgentBench, a benchmark that provides a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool-use domains. Evaluation of ten leading LLM agents reveals a substantial performance degradation when moving from domain-specific evaluations to this general-agent setting.

Performance comparison between specialized-agent and general-agent settings.
Top: Absolute performance. Bottom: Relative performance degradation under the general-agent setting.

  • Using General AgentBench, we systematically study test-time scaling behaviors under sequential scaling (iterative interaction) and parallel scaling (sampling multiple trajectories). We find that neither scaling methodology yields effective performance improvements in practice, due to two fundamental limitations: context ceiling in sequential scaling and verification gap in parallel scaling.

🏗️ Repository Structure

Directory Description
general_agent/ General agent system — unified MCP-based framework that connects a single agent to all benchmark tools simultaneously
benchmarks/ Individual benchmark implementations (tau2-bench, mcp-bench, swebench, terminalbench, mathhay, deepresearch, etc.)
benchmarks/instructions/ Setup and running guides for each individual benchmark

✅ Getting Started

🎁 API Sources

General AgentBench may call external APIs for (1) LLM inference and (2) benchmark tools.

LLM inference (via LiteLLM)

We use LiteLLM-style model strings (a.k.a. “model routes”) to specify which API/provider a run uses. If a single model has multiple routes listed, it means we have used all of those routes in different runs/stages.

Model Source
Qwen3-235B AWS Bedrock
Qwen3-Next Hugging Face via Together (gateway)
OpenAI-oss-120B Hugging Face via Novita (gateway)
Gemini-2.5-Flash Google Gemini API
Gemini-2.5-Pro Google Gemini API
Claude-Haiku-4.5 AWS Bedrock (Anthropic)
Claude-Sonnet-4.5 AWS Bedrock (Anthropic)
DeepSeek-R1 Hugging Face via Novita + Together (gateways)
DeepSeek-V3.2 Hugging Face via Novita + Fireworks (gateways) + AWS Bedrock Converse
GPT-5 OpenAI API

For the exact LiteLLM model routes used in experiments, see general_agent/scripts/models.py.

Tooling APIs

Benchmark External API
search Serper (Google Search API wrapper)

📚 Citation

If you find this work or code useful, please consider citing:

coming soon

📝 License

This project is released under the MIT License. See LICENSE for details.

About

Benchmark Test-Time Scaling of General LLM Agents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors