Benchmark Test-Time Scaling of General LLM Agents

Xiaochuan Li¹, Ryan Ming¹, Pranav Setlur¹, Abhijay Paladugu¹, Andy Tang¹, Hao Kang¹,

Shuai Shao², Rong Jin², Chenyan Xiong¹

¹ Language Technology Institute, Carnegie Mellon University
² Meta

❓ Is Test-Time Scaling as Effective as You Think?

(a) Sequential test-time scaling

(b) Parallel test-time scaling

Sequential scaling hits a context ceiling: performance initially improves with more interaction turns, but then plateaus and even declines as the growing context destabilizes the agent.
Parallel scaling suffers from a verifiability gap: while pass@K grows steadily with more samples, self-choice accuracy remains nearly flat — agents cannot reliably identify the correct trajectory among candidates.

🔍 Overview

We introduce General AgentBench, a benchmark that provides a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool-use domains. Evaluation of ten leading LLM agents reveals a substantial performance degradation when moving from domain-specific evaluations to this general-agent setting.

Performance comparison between specialized-agent and general-agent settings.
Top: Absolute performance. Bottom: Relative performance degradation under the general-agent setting.

Using General AgentBench, we systematically study test-time scaling behaviors under sequential scaling (iterative interaction) and parallel scaling (sampling multiple trajectories). We find that neither scaling methodology yields effective performance improvements in practice, due to two fundamental limitations: context ceiling in sequential scaling and verification gap in parallel scaling.

🏗️ Repository Structure

Directory	Description
`general_agent/`	General agent system — unified MCP-based framework that connects a single agent to all benchmark tools simultaneously
`benchmarks/`	Individual benchmark implementations (tau2-bench, mcp-bench, swebench, terminalbench, mathhay, deepresearch, etc.)
`benchmarks/instructions/`	Setup and running guides for each individual benchmark

✅ Getting Started

Run the general agent system: see general_agent/scripts/ for experiment scripts and general_agent/README.md for details.
Run individual benchmarks independently: see benchmarks/instructions/ for per-benchmark setup and usage guides.

🎁 API Sources

General AgentBench may call external APIs for (1) LLM inference and (2) benchmark tools.

LLM inference (via LiteLLM)

We use LiteLLM-style model strings (a.k.a. “model routes”) to specify which API/provider a run uses. If a single model has multiple routes listed, it means we have used all of those routes in different runs/stages.

Model	Source
Qwen3-235B	AWS Bedrock
Qwen3-Next	Hugging Face via Together (gateway)
OpenAI-oss-120B	Hugging Face via Novita (gateway)
Gemini-2.5-Flash	Google Gemini API
Gemini-2.5-Pro	Google Gemini API
Claude-Haiku-4.5	AWS Bedrock (Anthropic)
Claude-Sonnet-4.5	AWS Bedrock (Anthropic)
DeepSeek-R1	Hugging Face via Novita + Together (gateways)
DeepSeek-V3.2	Hugging Face via Novita + Fireworks (gateways) + AWS Bedrock Converse
GPT-5	OpenAI API

For the exact LiteLLM model routes used in experiments, see general_agent/scripts/models.py.

Tooling APIs

Benchmark	External API
`search`	Serper (Google Search API wrapper)

📚 Citation

If you find this work or code useful, please consider citing:

coming soon

📝 License

This project is released under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
benchmarks		benchmarks
general_agent		general_agent
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmark Test-Time Scaling of General LLM Agents

❓ Is Test-Time Scaling as Effective as You Think?

🔍 Overview

🏗️ Repository Structure

✅ Getting Started

🎁 API Sources

LLM inference (via LiteLLM)

Tooling APIs

📚 Citation

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Benchmark Test-Time Scaling of General LLM Agents

❓ Is Test-Time Scaling as Effective as You Think?

🔍 Overview

🏗️ Repository Structure

✅ Getting Started

🎁 API Sources

LLM inference (via LiteLLM)

Tooling APIs

📚 Citation

📝 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages