Skip to content

feat: add PushUpBench video repetition counting benchmark#1262

Open
anonymous-atom wants to merge 1 commit intoEvolvingLMMs-Lab:mainfrom
anonymous-atom:pushupbench
Open

feat: add PushUpBench video repetition counting benchmark#1262
anonymous-atom wants to merge 1 commit intoEvolvingLMMs-Lab:mainfrom
anonymous-atom:pushupbench

Conversation

@anonymous-atom
Copy link

@anonymous-atom anonymous-atom commented Mar 21, 2026

Summary

  • Add PushUpBench (Li et al., "Your VLM is not good at counting pushups") as an lmms-eval task using the Hugging Face dataset anonymousatom/pushupbench.
  • Project page: https://anonymousse123456.github.io/pushupbench/
  • PushUpBench evaluates vision-language models on exercise repetition counting in long-form workout videos (avg. 36.7s). 227 test samples across diverse fitness content from 11 creators.
  • The best frontier model (Gemini 3 Flash) achieves 42.1% exact accuracy; open-source 4B models score ~6–12%, on par with supervised baselines. The paper shows training on counting transfers to general video understanding (MVBench +2.15, PerceptionTest +1.88, TVBench +4.54).

In scope

  • New task folder lmms_eval/tasks/pushupbench/ with:
    • utils.py: pushupbench_doc_to_visual, pushupbench_doc_to_text, pushupbench_doc_to_target, pushupbench_process_results, and aggregation functions for all 4 metrics
    • _default_template_yaml: dataset path, generation kwargs (max_new_tokens: 4096 to support thinking/reasoning models), lmms_eval_specific_kwargs
    • pushupbench.yaml: task config with 4 metrics

Out of scope

  • No changes to existing tasks, models, or eval harness outside lmms_eval/tasks/pushupbench/.
  • No new dependencies beyond existing HF datasets usage.

Metrics

Metric Direction Description
exact_match Prediction matches any value in ground truth count list (%)
mae Mean Absolute Error between prediction and primary ground truth
obo Off-By-One accuracy — prediction within ±1 of any ground truth (%)
r_squared R² coefficient of determination (outliers with |error|>50 excluded)

Validation

Full evaluation run on Qwen3-VL-4B-Thinking with vllm backend (RTX 5090, fps=5):

python -m lmms_eval \ --model vllm_generate \ --model_args model=Qwen/Qwen3-VL-4B-Thinking,tensor_parallel_size=1,gpu_memory_utilization=0.9,max_model_len=32768,fps=5 \ --tasks pushupbench \ --batch_size 1 \ --output_path results/pushupbench \ --log_samples --log_samples_suffix pushupbench 

Output Table

Metric Value
exact_match 11.89%
mae 14.39
obo 22.91%
r_squared -0.09

227/227 samples processed successfully. Dataset auto-downloaded from HuggingFace, videos resolved via create_link: True.

Risk / Compatibility

  • Additive only: new task files under lmms_eval/tasks/pushupbench/. No changes to existing task or model APIs.

Type of Change

  • New benchmark/task
Add PushUpBench, a benchmark for evaluating vision-language models on exercise repetition counting in videos. 227 workout video samples with exact match, MAE, off-by-one, and R² metrics. Dataset: https://huggingface.co/datasets/anonymousatom/pushupbench
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant