- Notifications
You must be signed in to change notification settings - Fork 44
Description
Summary
Benchpark is a very helpful way to bring order to complex experiment and benchmarking setups.
Many teams now benchmark LLM and RAG workloads in addition to classic HPC codes.
I maintain WFGY, an MIT-licensed framework with a 16-problem RAG and agent failure map:
- WFGY ProblemMap (RAG + agents, 16 problems)
https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md
WFGY has around 1.5k GitHub stars, and the ProblemMap has been referenced by:
- Harvard MIMS Lab ToolUniverse
- QCRI LLM Lab Multimodal RAG Survey
- University of Innsbruck Rankify project
I think there is a natural fit between Benchpark and this taxonomy, especially for LLM or search-heavy workloads.
Why it might be useful
When evaluating LLM / RAG systems as workloads, people care about:
- Latency and throughput.
- GPU and memory usage.
- But also the structure of failures.
For example:
- A workload that fails mostly through
wrong chunk retrievedhas a different profile
than one that fails throughreasoning collapse after correct retrieval.
If a Benchpark experiment could optionally attach WFGY problem labels, users would gain:
- A standard way to describe what their LLM workload is actually stressing.
- Better comparability across clusters and configurations.
Proposal
If you think this is within scope, I would propose:
-
Add an optional “WFGY tag” field to relevant LLM / RAG experiments
For example:
wfgy_problem_tags:
- No.3 ghost-match retrieval
- No.7 cross-document confusion
-
Short documentation note
A small section in the docs, for LLM workloads, that:
- Links to the WFGY ProblemMap.
- Explains that users can tag experiments with problem codes if they want a richer failure description.
This is entirely optional and does not change any core Benchpark logic. It is just extra metadata that some users may find valuable.
Contribution
I am happy to:
- Draft the documentation text in your style.
- Suggest one or two minimal example experiments that show how the tags look in practice.
- Open a small PR if this feature request is accepted.