Skip to content

[Proposal] Optional WFGY 16-problem RAG diagnostics as a workload profile in Benchpark #1253

@onestardao

Description

@onestardao

Summary

Benchpark is a very helpful way to bring order to complex experiment and benchmarking setups.
Many teams now benchmark LLM and RAG workloads in addition to classic HPC codes.

I maintain WFGY, an MIT-licensed framework with a 16-problem RAG and agent failure map:

WFGY has around 1.5k GitHub stars, and the ProblemMap has been referenced by:

  • Harvard MIMS Lab ToolUniverse
  • QCRI LLM Lab Multimodal RAG Survey
  • University of Innsbruck Rankify project

I think there is a natural fit between Benchpark and this taxonomy, especially for LLM or search-heavy workloads.


Why it might be useful

When evaluating LLM / RAG systems as workloads, people care about:

  • Latency and throughput.
  • GPU and memory usage.
  • But also the structure of failures.

For example:

  • A workload that fails mostly through wrong chunk retrieved has a different profile
    than one that fails through reasoning collapse after correct retrieval.

If a Benchpark experiment could optionally attach WFGY problem labels, users would gain:

  • A standard way to describe what their LLM workload is actually stressing.
  • Better comparability across clusters and configurations.

Proposal

If you think this is within scope, I would propose:

  1. Add an optional “WFGY tag” field to relevant LLM / RAG experiments

    For example:

    wfgy_problem_tags:

    • No.3 ghost-match retrieval
    • No.7 cross-document confusion
  2. Short documentation note

    A small section in the docs, for LLM workloads, that:

    • Links to the WFGY ProblemMap.
    • Explains that users can tag experiments with problem codes if they want a richer failure description.

This is entirely optional and does not change any core Benchpark logic. It is just extra metadata that some users may find valuable.


Contribution

I am happy to:

  • Draft the documentation text in your style.
  • Suggest one or two minimal example experiments that show how the tags look in practice.
  • Open a small PR if this feature request is accepted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions