SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation

This repository includes the brief introduction, dataset link, and source code for SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation (ICASSP 2026).

Abstract

This work addresses the lack of multimodal generative models capable of producing high-quality videos with spatially aligned audio. While recent advancements in generative models have been successful in video generation, they often overlook the spatial alignment between audio and visuals, which is essential for immersive experiences. To tackle this problem, we establish a new research direction in benchmarking the Spatially Aligned Audio-Video Generation (SAVG) task. We introduce a spatially aligned audio-visual dataset, whose audio and video data are curated based on whether sound events are onscreen or not. We also propose a new alignment metric that aims to evaluate the spatial alignment between audio and video. Then, using the dataset and metric, we benchmark two types of baseline methods: one is based on a joint audio-video generation model, and the other is a two-stage method that combines a video generation model and a video-to-audio generation model. Our experimental results demonstrate that gaps exist between the baseline methods and the ground truth in terms of video and audio quality, as well as spatial alignment between the two modalities.

Data Examples:

Spatial AV-Align Metric:

Generated Videos

demo_joint_1.mp4

demo_joint_2.mp4

demo_two-stage_1.mp4

demo_two-stage_2.mp4

SAVGBench Dataset

The SAVGBench dataset can be downloaded from Zenodo. We use the development set for training and the evaluation set for evaluation. Note that this dataset has been used in SVG Challenge 2024.

demo_gt_1.mp4

demo_gt_2.mp4

Getting Started - Joint Method / Evaluation Metrics

Git clone

You can use git clone and move into the directory.

git clone https://github.com/SonyResearch/SAVGBench.git cd SAVGBench

Download pretrained models

You can download the pretrained models from Google Drive.

unzip YOUR_DOWNLOAD_PATH/SAVGBench_PretrainedModels_V1.zip

Set pretrained models for inference of joint baseline method

You need to move the pt files of the joint baseline method.

mkdir joint_method/pretrained_models mv YOUR_DOWNLOAD_PATH/SAVGBench_PretrainedModels_V1/model330011.pt joint_method/pretrained_models/ mv YOUR_DOWNLOAD_PATH/SAVGBench_PretrainedModels_V1/model_SR_mmdiff_120000.pt joint_method/pretrained_models/

Set pretrained models for Spatial AV-Align metric

You need to move the pth files of the object detection and sound event localization and detection (SELD) models, which are used in the computation of Spatial AV-Align.

mv YOUR_DOWNLOAD_PATH/SAVGBench_PretrainedModels_V1/yolox_tiny_8x8_300e_coco_20211124_171234-b4047906.pth av_spatial_evaluation/object_detection_svg_infer/ mkdir -p av_spatial_evaluation/stereo_seld_infer/data/model_monitor/20240912162834/ mv YOUR_DOWNLOAD_PATH/SAVGBench_PretrainedModels_V1/params_swa_20240912162834_0040000.pth av_spatial_evaluation/stereo_seld_infer/data/model_monitor/20240912162834/

Prepare python environment

The inference and metric computation has been tested on python 3.10.19 and torch 2.1.2. For example, you can use a miniforge environment.

conda create --name savgbench python=3.10 -y conda activate savgbench # conda install -c conda-forge ffmpeg # if needed pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121 pip install -r requirements_joint_and_metric.txt pip install openmim==0.3.9 mim install mmengine==0.10.7 mmcv==2.1.0 mmdet==3.3.0 # mmdetection for object detection: https://mmdetection.readthedocs.io/en/latest/get_started.html

Run evaluation script (including inference and metric computation)

Before running the evaluation script ./evaluation_script.sh, please modify the paths in the script to use correct paths.

The inference script ./joint_method/run_baseline.sh is called from the evaluation script. You may configure batch size and the GPUs to be used from this file.

You can run the evaluation script.

./evaluation_script.sh VIDEO_OUTPUT_DIR VIDEO_REFERENCE_DIR NUM_GPUS

Please use absolute paths. For example, to run on 1 GPU:

./evaluation_script.sh /home/USER_DIR/SAVGBench/result_outputs/ /home/USER_DIR/SAVGBench/SAVGBench_Dataset_Evaluation/video_eval/ 1

It takes around 30 minutes to generate 96 samples (default). Then you can check the evaluation results at ./results.out.

AV spatial each component: num_TP: 2382, num_FN: 2320, num_TP + num_FN: 4702 AV_final_score: 0.5065929391748192 {'fvd': 1218.8477783203125, 'kvd': 55.60466751187866, 'fad': 9.357064962387085} Temporal AV-Align: 0.6680675666494543

The results may vary depending on random seeds etc, but you can get similar results to the ones reported. Please note that the FAD score is better than the paper's value since a bug in the FAD computation was fixed.

Run training scripts of joint method

See joint_method/TRAINING_JOINT.md.

Citation

If you found this repository useful, please consider citing

@inproceedings{shimada2026savgbench, title={SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation}, author={Shimada, Kazuki and Simon, Christian and Shibuya, Takashi and Takahashi, Shusuke and Mitsufuji, Yuki}, booktitle={Proc. of IEEE ICASSP}, year={2026} }

References

This repository is built on the following papers and the open source repositories.

Joint Method and Video/Audio Quality Metrics: MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
Temporal AV-Align Metric: Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
av_quality_evaluation		av_quality_evaluation
av_spatial_evaluation		av_spatial_evaluation
evaluations		evaluations
fig		fig
joint_method		joint_method
.gitignore		.gitignore
README.md		README.md
evaluate_quality_diversity.py		evaluate_quality_diversity.py
evaluation_script.sh		evaluation_script.sh
requirements_joint_and_metric.txt		requirements_joint_and_metric.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation

Abstract

Generated Videos

SAVGBench Dataset

Getting Started - Joint Method / Evaluation Metrics

Git clone

Download pretrained models

Set pretrained models for inference of joint baseline method

Set pretrained models for Spatial AV-Align metric

Prepare python environment

Run evaluation script (including inference and metric computation)

Run training scripts of joint method

Citation

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation

Abstract

Generated Videos

SAVGBench Dataset

Getting Started - Joint Method / Evaluation Metrics

Git clone

Download pretrained models

Set pretrained models for inference of joint baseline method

Set pretrained models for Spatial AV-Align metric

Prepare python environment

Run evaluation script (including inference and metric computation)

Run training scripts of joint method

Citation

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages