Evaluating AI Agents for Automated Research from Re-Discovery to New-Discovery
Quick Start | How It Works | Domains | Leaderboard | Add Your Agent
ResearchClawBench is a benchmark that measures whether AI coding agents can independently conduct scientific research β from reading raw data to producing publication-quality reports β and then rigorously evaluates the results against real human-authored papers.
Unlike benchmarks that test coding ability or factual recall, ResearchClawBench asks: given the same data and tools a human researcher had, can an AI agent arrive at the same (or better) scientific conclusions?
| π Two-Stage Pipeline Autonomous research + rigorous peer-review-style evaluation | π§ͺ 40 Real-Science Tasks 10 disciplines, complete datasets from published papers | ποΈ Expert-Annotated Data Tasks, checklists & datasets curated by domain experts | π€ Multi-Agent Support Claude Code, Codex CLI, OpenClaw, Nanobot & custom agents |
| π Re-Discovery to New-Discovery 50 = match the paper, 70+ = surpass it | π Fine-Grained Checklist Per-item keywords, weights & reasoning | π‘ Live Streaming UI Watch agents code, plot & write in real-time | π Lightweight Dependencies Pure Flask + vanilla JS, no heavy frameworks |
- 2026-03-20 π Added Nanobot as a new agent β ultra-lightweight OpenClaw alternative with reliable multi-step tool execution. Agent config moved to
agents.jsonfor easy customization. - 2026-03-19 π Initial release with Claude Code, Codex CLI, and OpenClaw support. 40 tasks across 10 scientific domains.
ResearchClawBench.mp4
Most AI benchmarks evaluate what models know. We evaluate what agents can do.
- Real science, not toy problems. 40 tasks sourced from published papers across 10 disciplines, each with complete experimental datasets.
- Two-stage pipeline. Autonomous research first, rigorous evaluation second β just like peer review.
- Fine-grained, multimodal scoring. A weighted checklist with text and image criteria, judged by an LLM acting as a strict peer reviewer.
- Agent-agnostic. Ships with first-class support for Claude Code, Codex CLI, and OpenClaw. Bring your own agent in one line.
- From Re-Discovery to New-Discovery. Scoring above 50 means matching the original paper; above 70 means surpassing it. The frontier is wide open.
Every task in ResearchClawBench is built through a rigorous, expert-driven pipeline to ensure scientific validity and reproducibility:
flowchart TD A["π High-Quality Paper Collection\n(Target Paper)"] --> B["π§βπ¬ Human Expert Extraction\n(Core Task Instructions)"] B --> C["π Evaluation Checklist\n(Criteria + Keywords + Weights)"] B --> D["π Data & Related Work Collection\n(Datasets + Reference Papers)"] C --> E["β
Human Reproduction & Validation\n(Verify checklist is reproducible)"] D --> E style A fill:#e0f2fe,stroke:#0284c7,stroke-width:2px style B fill:#fef3c7,stroke:#f59e0b,stroke-width:2px style C fill:#fce7f3,stroke:#ec4899,stroke-width:2px style D fill:#f0fdf4,stroke:#22c55e,stroke-width:2px style E fill:#f5f3ff,stroke:#8b5cf6,stroke-width:2px -
High-Quality Paper Collection β Domain experts select recent, high-impact publications with clear methodology and reproducible results across 10 scientific disciplines.
-
Expert Task Extraction β Human experts read each paper and distill the core research task into structured instructions, identifying the key scientific question, input data, and expected outputs.
-
Checklist Design β Experts create a fine-grained evaluation checklist with weighted criteria (text and image items), each with specific technical keywords that a judge must verify.
-
Data & Related Work Collection β The original datasets used in the paper are gathered, along with relevant reference materials, to form a self-contained research workspace.
-
Human Reproduction & Validation β Human researchers independently reproduce the paper's results using only the provided data and instructions, verifying that every checklist item is achievable. This ensures the benchmark is fair and the checklist is grounded in reality.
ResearchClawBench operates in two distinct stages:
flowchart LR subgraph Stage1["Stage 1 — Auto Research"] A["Raw Data\n+ Instructions"] --> B["AI Agent\n(autonomous)"] B --> C["Code\n+ Figures\n+ Report"] end subgraph Stage2["Stage 2 — Evaluation"] C --> D["LLM Judge"] E["Target Paper\n+ Checklist"] --> D D --> F["Per-Item Scores\n+ Reasoning"] end style Stage1 fill:#f0f4ff,stroke:#3b82f6,stroke-width:2px style Stage2 fill:#fff7ed,stroke:#f59e0b,stroke-width:2px The AI agent receives a workspace containing raw datasets, reference materials, and task instructions. It must independently:
- Explore the data and understand the research question
- Write code to analyze, model, and visualize the data
- Produce a research report (
report/report.md) with figures, methodology, results, and discussion
No hand-holding. No chain-of-thought hints. The agent works in its own sandboxed workspace with full tool access β just like a real researcher.
Once the agent finishes, its report is evaluated against the original published paper using a fine-grained checklist. The judge receives the task instructions, the AI report, and the checklist criteria β then scores each item using a dual-mode rubric:
flowchart TD subgraph Inputs I["INSTRUCTIONS.md\n(task background)"] R["Agent Report\n(text + figures)"] CL["Checklist\n(from target paper)"] end I & R & CL --> J["Multimodal LLM Judge"] J --> DET{"Determine\nEvaluation Mode"} DET -->|"Quantitative\nresults"| OBJ["Mode A: Objective\n(Metric Optimization)"] DET -->|"Qualitative\nreasoning"| SUB["Mode B: Subjective\n(Mechanism Analysis)"] OBJ --> SO["Score by metric\naccuracy vs paper"] SUB --> SS["Score by evidence\nstrength vs paper"] SO & SS --> T["Per-Item Scores\n+ Reasoning\nβ Weighted Total"] style Inputs fill:#f0f4ff,stroke:#3b82f6,stroke-width:2px style J fill:#fef3c7,stroke:#f59e0b,stroke-width:2px style OBJ fill:#dbeafe,stroke:#3b82f6,stroke-width:2px style SUB fill:#fce7f3,stroke:#ec4899,stroke-width:2px style T fill:#f0fdf4,stroke:#22c55e,stroke-width:2px Each checklist item includes:
- Specific criteria extracted from the paper's key contributions
- Technical keywords the judge must verify (e.g., "ROC-AUC improvement", "Monte Carlo integration")
- Weight reflecting the item's importance
- Type β
textfor methodology/findings,imagefor figure comparison (multimodal vision)
The judge automatically determines which evaluation mode applies to each item, then scores it with the corresponding rubric (see below).
For checklist items involving specific numerical results, metrics, or quantitative outcomes:
| Score | Meaning |
|---|---|
| 0 | Criterion completely absent |
| 1β10 | Mentioned but no quantitative results provided |
| 11β20 | Results given but methodology has fundamental errors |
| 21β30 | Significant methodological flaws; metrics deviate severely |
| 31β40 | Methodology mostly correct but metrics notably worse than the paper |
| 41β50 | Metrics roughly comparable to the paper |
| 51β60 | Metrics slightly better than the paper |
| 61β70 | Metrics clearly better than the paper |
| 71β80 | Methodology and metrics both substantially improved |
| 81β90 | Metrics dramatically surpass the paper |
| 91β100 | Breakthrough results far exceeding the paper |
For checklist items involving theoretical explanations, mechanistic insights, or interpretive analysis:
| Score | Meaning |
|---|---|
| 0 | Criterion completely absent |
| 1β10 | Mentioned only with vague, generic statements |
| 11β20 | Some description but no substantive analysis |
| 21β30 | Analysis attempted but evidence insufficient or logic has gaps |
| 31β40 | Correct direction but lacks depth; key arguments missing |
| 41β50 | Analysis depth and rigor comparable to the paper |
| 51β60 | More supporting evidence provided than the paper |
| 61β70 | More complete logical chain and more rigorous argumentation |
| 71β80 | Significantly deeper analysis with novel insights |
| 81β90 | Analysis depth far exceeds the paper |
| 91β100 | Original contributions with breakthrough insights |
Strict by design. The judge is highly skeptical of AI-generated content β plausible-sounding claims must be backed by concrete evidence. Longer reports do not score higher. Substance over style.
Each domain contains 4 carefully curated tasks with complete experimental data from real published research:
| Domain | Example Topics | Data Types |
|---|---|---|
| Astronomy | Black hole superradiance, Bayesian stellar inference | .dat, .csv |
| Chemistry | GNN molecular prediction, protein-ligand docking | .pdb, .sdf, .csv |
| Earth | Glacier mass balance, climate datasets | .csv, multi-region series |
| Energy | Battery degradation, renewable energy modeling | .xlsx, time series |
| Information | NLP benchmarks, deep learning analysis | .pdf, .tex, .ipynb |
| Life | Nanopore sequencing, genomic analysis | .csv, .xlsx |
| Material | Materials property prediction, pretrained models | .pt, .csv |
| Math | Multi-agent pathfinding, optimization | .json, .npy, grid maps |
| Neuroscience | Neural decoding, brain signal processing | .csv, .h5, .yaml |
| Physics | Quantum geometry, superfluid stiffness | .h5, .json, .csv |
40 tasks total β each a self-contained research challenge selected from high-quality human-authored publications, spanning the full spectrum from data analysis to novel scientific insight.
git clone https://github.com/InternScience/ResearchClawBench.git cd ResearchClawBench pip install -r evaluation/requirements.txtCreate evaluation/.env with your scoring model credentials:
OPENAI_API_KEY=sk-xxx OPENAI_BASE_URL=https://api.openai.com/v1 SCORER_MODEL=gpt-5.1python -m evaluationOpen http://localhost:5000 β browse tasks, pick an agent, hit Start Run, and watch the research happen live.
After a run completes, switch to the Evaluation tab and click Score. The multimodal LLM judge evaluates each checklist item and returns per-item scores with reasoning.
ResearchClawBench ships with built-in support for four frontier coding agents:
Agent configuration is stored in evaluation/agents.json. To add a new agent, simply append an entry:
{ "my_agent": { "label": "My Agent", "icon": "M", "logo": "/static/logos/my_agent.svg", "cmd": "my-agent run -m <PROMPT> -w <WORKSPACE>" } }| Placeholder | Replaced With | Notes |
|---|---|---|
<PROMPT> | Prompt content (via file path or $(cat ...)) | Required. For -p style flags, replaced with file path; otherwise replaced with "$(cat 'path')" to pass content |
<WORKSPACE> | Absolute path to the workspace directory | Optional. Only replaced if present in cmd |
The prompt injected into <PROMPT> is auto-generated from evaluation/instructions_tmpl.py, which combines a unified agent persona (autonomous execution guidelines, workspace sandbox rules) with task-specific instructions (description, data files, deliverables). All agents receive the exact same prompt β no code changes required, just edit the JSON file and restart the server.
You can view the leaderboard on our Website, which is updated in real time.
The built-in dashboard aggregates the best score per (task, agent) pair and displays:
- Frontier chart β best score per task across all agents
- Leaderboard table β clickable cells linking to individual runs
- Per-task breakdown β view any agent's report, code, and score reasoning
The frontier represents the state of the art β every point above 50 is uncharted territory where AI surpasses human researchers on that specific task.
ResearchClawBench/ βββ evaluation/ # Core evaluation framework β βββ server.py # Flask API + SSE streaming β βββ run_task.py # Workspace setup + agent subprocess β βββ score.py # Multimodal LLM scoring engine β βββ config.py # Paths, constants, loads agents.json β βββ agents.json # Agent presets (add your own here) β βββ instructions_tmpl.py # Unified prompt template for all agents β βββ utils.py # File tree, path safety, discovery β βββ static/app.js # Single-file frontend β βββ templates/index.html # Entry point βββ tasks/ # 40 research tasks β βββ Astronomy_000/ β β βββ task_info.json # Task description + data manifest β β βββ data/ # Raw experimental datasets β β βββ related_work/ # Reference papers β β βββ target_study/ # Paper + checklist + images β βββ Chemistry_000/ β βββ ... # 10 domains x 4 tasks βββ workspaces/ # Generated at runtime (gitignored) We welcome contributions in several forms β see CONTRIBUTING.md for detailed guidelines.
- New tasks β Add research challenges in existing or new domains
- New agents β Add presets for emerging coding agents
- Bug reports β Open an issue
π§ Email: xu_wanghan@sjtu.edu.cn
π¬ Community:
If you would like to cite our work, please use the following BibTeX.
@article{xu2025probing, title={Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows}, author={Xu, Wanghan and Zhou, Yuhao and Zhou, Yifan and Cao, Qinglong and Li, Shuo and Bu, Jia and Liu, Bo and Chen, Yixin and He, Xuming and Zhao, Xiangyu and others}, journal={arXiv preprint arXiv:2512.16969}, year={2025} }




