Skip to content

SonyResearch/SAVGBench

Repository files navigation

SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation

This repository includes the brief introduction, dataset link, and source code for SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation (ICASSP 2026).

Abstract

This work addresses the lack of multimodal generative models capable of producing high-quality videos with spatially aligned audio. While recent advancements in generative models have been successful in video generation, they often overlook the spatial alignment between audio and visuals, which is essential for immersive experiences. To tackle this problem, we establish a new research direction in benchmarking the Spatially Aligned Audio-Video Generation (SAVG) task. We introduce a spatially aligned audio-visual dataset, whose audio and video data are curated based on whether sound events are onscreen or not. We also propose a new alignment metric that aims to evaluate the spatial alignment between audio and video. Then, using the dataset and metric, we benchmark two types of baseline methods: one is based on a joint audio-video generation model, and the other is a two-stage method that combines a video generation model and a video-to-audio generation model. Our experimental results demonstrate that gaps exist between the baseline methods and the ground truth in terms of video and audio quality, as well as spatial alignment between the two modalities.

Data Examples:

Spatial AV-Align Metric:

Generated Videos

demo_joint_1.mp4
demo_joint_2.mp4
demo_two-stage_1.mp4
demo_two-stage_2.mp4

SAVGBench Dataset

The SAVGBench dataset can be downloaded from Zenodo. We use the development set for training and the evaluation set for evaluation. Note that this dataset has been used in SVG Challenge 2024.

demo_gt_1.mp4
demo_gt_2.mp4

Getting Started - Joint Method / Evaluation Metrics

Git clone

You can use git clone and move into the directory.

git clone https://github.com/SonyResearch/SAVGBench.git cd SAVGBench

Download pretrained models

You can download the pretrained models from Google Drive.

unzip YOUR_DOWNLOAD_PATH/SAVGBench_PretrainedModels_V1.zip

Set pretrained models for inference of joint baseline method

You need to move the pt files of the joint baseline method.

mkdir joint_method/pretrained_models mv YOUR_DOWNLOAD_PATH/SAVGBench_PretrainedModels_V1/model330011.pt joint_method/pretrained_models/ mv YOUR_DOWNLOAD_PATH/SAVGBench_PretrainedModels_V1/model_SR_mmdiff_120000.pt joint_method/pretrained_models/

Set pretrained models for Spatial AV-Align metric

You need to move the pth files of the object detection and sound event localization and detection (SELD) models, which are used in the computation of Spatial AV-Align.

mv YOUR_DOWNLOAD_PATH/SAVGBench_PretrainedModels_V1/yolox_tiny_8x8_300e_coco_20211124_171234-b4047906.pth av_spatial_evaluation/object_detection_svg_infer/ mkdir -p av_spatial_evaluation/stereo_seld_infer/data/model_monitor/20240912162834/ mv YOUR_DOWNLOAD_PATH/SAVGBench_PretrainedModels_V1/params_swa_20240912162834_0040000.pth av_spatial_evaluation/stereo_seld_infer/data/model_monitor/20240912162834/

Prepare python environment

The inference and metric computation has been tested on python 3.10.19 and torch 2.1.2. For example, you can use a miniforge environment.

conda create --name savgbench python=3.10 -y conda activate savgbench # conda install -c conda-forge ffmpeg # if needed pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121 pip install -r requirements_joint_and_metric.txt pip install openmim==0.3.9 mim install mmengine==0.10.7 mmcv==2.1.0 mmdet==3.3.0 # mmdetection for object detection: https://mmdetection.readthedocs.io/en/latest/get_started.html

Run evaluation script (including inference and metric computation)

Before running the evaluation script ./evaluation_script.sh, please modify the paths in the script to use correct paths.

The inference script ./joint_method/run_baseline.sh is called from the evaluation script. You may configure batch size and the GPUs to be used from this file.

You can run the evaluation script.

./evaluation_script.sh VIDEO_OUTPUT_DIR VIDEO_REFERENCE_DIR NUM_GPUS

Please use absolute paths. For example, to run on 1 GPU:

./evaluation_script.sh /home/USER_DIR/SAVGBench/result_outputs/ /home/USER_DIR/SAVGBench/SAVGBench_Dataset_Evaluation/video_eval/ 1

It takes around 30 minutes to generate 96 samples (default). Then you can check the evaluation results at ./results.out.

AV spatial each component: num_TP: 2382, num_FN: 2320, num_TP + num_FN: 4702 AV_final_score: 0.5065929391748192 {'fvd': 1218.8477783203125, 'kvd': 55.60466751187866, 'fad': 9.357064962387085} Temporal AV-Align: 0.6680675666494543 

The results may vary depending on random seeds etc, but you can get similar results to the ones reported. Please note that the FAD score is better than the paper's value since a bug in the FAD computation was fixed.

Run training scripts of joint method

See joint_method/TRAINING_JOINT.md.

Citation

If you found this repository useful, please consider citing

@inproceedings{shimada2026savgbench, title={SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation}, author={Shimada, Kazuki and Simon, Christian and Shibuya, Takashi and Takahashi, Shusuke and Mitsufuji, Yuki}, booktitle={Proc. of IEEE ICASSP}, year={2026} }

References

This repository is built on the following papers and the open source repositories.

  1. Joint Method and Video/Audio Quality Metrics: MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

  2. Temporal AV-Align Metric: Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

About

SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors