feat: add PushUpBench video repetition counting benchmark by anonymous-atom · Pull Request #1262 · EvolvingLMMs-Lab/lmms-eval

anonymous-atom · 2026-03-21T08:05:31Z

Summary

Add PushUpBench (Li et al., "Your VLM is not good at counting pushups") as an lmms-eval task using the Hugging Face dataset anonymousatom/pushupbench.
Project page: https://anonymousse123456.github.io/pushupbench/
PushUpBench evaluates vision-language models on exercise repetition counting in long-form workout videos (avg. 36.7s). 227 test samples across diverse fitness content from 11 creators.
The best frontier model (Gemini 3 Flash) achieves 42.1% exact accuracy; open-source 4B models score ~6–12%, on par with supervised baselines. The paper shows training on counting transfers to general video understanding (MVBench +2.15, PerceptionTest +1.88, TVBench +4.54).

In scope

New task folder lmms_eval/tasks/pushupbench/ with:
- utils.py: pushupbench_doc_to_visual, pushupbench_doc_to_text, pushupbench_doc_to_target, pushupbench_process_results, and aggregation functions for all 4 metrics
- _default_template_yaml: dataset path, generation kwargs (max_new_tokens: 4096 to support thinking/reasoning models), lmms_eval_specific_kwargs
- pushupbench.yaml: task config with 4 metrics

Out of scope

No changes to existing tasks, models, or eval harness outside lmms_eval/tasks/pushupbench/.
No new dependencies beyond existing HF datasets usage.

Metrics

Metric	Direction	Description
`exact_match`	↑	Prediction matches any value in ground truth count list (%)
`mae`	↓	Mean Absolute Error between prediction and primary ground truth
`obo`	↑	Off-By-One accuracy — prediction within ±1 of any ground truth (%)
`r_squared`	↑	R² coefficient of determination (outliers with \|error\|>50 excluded)

Validation

Full evaluation run on Qwen3-VL-4B-Thinking with vllm backend (RTX 5090, fps=5):

python -m lmms_eval \ --model vllm_generate \ --model_args model=Qwen/Qwen3-VL-4B-Thinking,tensor_parallel_size=1,gpu_memory_utilization=0.9,max_model_len=32768,fps=5 \ --tasks pushupbench \ --batch_size 1 \ --output_path results/pushupbench \ --log_samples --log_samples_suffix pushupbench

Output Table

Metric	Value
exact_match	11.89%
mae	14.39
obo	22.91%
r_squared	-0.09

227/227 samples processed successfully. Dataset auto-downloaded from HuggingFace, videos resolved via create_link: True.

Risk / Compatibility

Additive only: new task files under lmms_eval/tasks/pushupbench/. No changes to existing task or model APIs.

Type of Change

New benchmark/task

Add PushUpBench, a benchmark for evaluating vision-language models on exercise repetition counting in videos. 227 workout video samples with exact match, MAE, off-by-one, and R² metrics. Dataset: https://huggingface.co/datasets/anonymousatom/pushupbench

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add PushUpBench video repetition counting benchmark#1262

feat: add PushUpBench video repetition counting benchmark#1262
anonymous-atom wants to merge 1 commit intoEvolvingLMMs-Lab:mainfrom
anonymous-atom:pushupbench

anonymous-atom commented Mar 21, 2026 •

edited

Loading

Labels

1 participant

Conversation

anonymous-atom commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!