[Proposal] Optional WFGY 16-problem RAG diagnostics as a workload profile in Benchpark

Summary

Benchpark is a very helpful way to bring order to complex experiment and benchmarking setups.
Many teams now benchmark LLM and RAG workloads in addition to classic HPC codes.

I maintain WFGY, an MIT-licensed framework with a 16-problem RAG and agent failure map:

WFGY ProblemMap (RAG + agents, 16 problems)
https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md

WFGY has around 1.5k GitHub stars, and the ProblemMap has been referenced by:

Harvard MIMS Lab ToolUniverse
QCRI LLM Lab Multimodal RAG Survey
University of Innsbruck Rankify project

I think there is a natural fit between Benchpark and this taxonomy, especially for LLM or search-heavy workloads.

Why it might be useful

When evaluating LLM / RAG systems as workloads, people care about:

Latency and throughput.
GPU and memory usage.
But also the structure of failures.

For example:

A workload that fails mostly through wrong chunk retrieved has a different profile
than one that fails through reasoning collapse after correct retrieval.

If a Benchpark experiment could optionally attach WFGY problem labels, users would gain:

A standard way to describe what their LLM workload is actually stressing.
Better comparability across clusters and configurations.

Proposal

If you think this is within scope, I would propose:

Add an optional “WFGY tag” field to relevant LLM / RAG experiments

For example:

wfgy_problem_tags:
- No.3 ghost-match retrieval
- No.7 cross-document confusion
Short documentation note

A small section in the docs, for LLM workloads, that:
- Links to the WFGY ProblemMap.
- Explains that users can tag experiments with problem codes if they want a richer failure description.

This is entirely optional and does not change any core Benchpark logic. It is just extra metadata that some users may find valuable.

Contribution

I am happy to:

Draft the documentation text in your style.
Suggest one or two minimal example experiments that show how the tags look in practice.
Open a small PR if this feature request is accepted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Optional WFGY 16-problem RAG diagnostics as a workload profile in Benchpark #1253

Summary

Why it might be useful

Proposal

Contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Proposal] Optional WFGY 16-problem RAG diagnostics as a workload profile in Benchpark #1253

Description

Summary

Why it might be useful

Proposal

Contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions