This repository includes the brief introduction, dataset link, and source code for SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation (ICASSP 2026).
This work addresses the lack of multimodal generative models capable of producing high-quality videos with spatially aligned audio. While recent advancements in generative models have been successful in video generation, they often overlook the spatial alignment between audio and visuals, which is essential for immersive experiences. To tackle this problem, we establish a new research direction in benchmarking the Spatially Aligned Audio-Video Generation (SAVG) task. We introduce a spatially aligned audio-visual dataset, whose audio and video data are curated based on whether sound events are onscreen or not. We also propose a new alignment metric that aims to evaluate the spatial alignment between audio and video. Then, using the dataset and metric, we benchmark two types of baseline methods: one is based on a joint audio-video generation model, and the other is a two-stage method that combines a video generation model and a video-to-audio generation model. Our experimental results demonstrate that gaps exist between the baseline methods and the ground truth in terms of video and audio quality, as well as spatial alignment between the two modalities.
demo_joint_1.mp4
demo_joint_2.mp4
demo_two-stage_1.mp4
demo_two-stage_2.mp4
The SAVGBench dataset can be downloaded from Zenodo. We use the development set for training and the evaluation set for evaluation. Note that this dataset has been used in SVG Challenge 2024.
demo_gt_1.mp4
demo_gt_2.mp4
You can use git clone and move into the directory.
git clone https://github.com/SonyResearch/SAVGBench.git cd SAVGBenchYou can download the pretrained models from Google Drive.
unzip YOUR_DOWNLOAD_PATH/SAVGBench_PretrainedModels_V1.zipYou need to move the pt files of the joint baseline method.
mkdir joint_method/pretrained_models mv YOUR_DOWNLOAD_PATH/SAVGBench_PretrainedModels_V1/model330011.pt joint_method/pretrained_models/ mv YOUR_DOWNLOAD_PATH/SAVGBench_PretrainedModels_V1/model_SR_mmdiff_120000.pt joint_method/pretrained_models/You need to move the pth files of the object detection and sound event localization and detection (SELD) models, which are used in the computation of Spatial AV-Align.
mv YOUR_DOWNLOAD_PATH/SAVGBench_PretrainedModels_V1/yolox_tiny_8x8_300e_coco_20211124_171234-b4047906.pth av_spatial_evaluation/object_detection_svg_infer/ mkdir -p av_spatial_evaluation/stereo_seld_infer/data/model_monitor/20240912162834/ mv YOUR_DOWNLOAD_PATH/SAVGBench_PretrainedModels_V1/params_swa_20240912162834_0040000.pth av_spatial_evaluation/stereo_seld_infer/data/model_monitor/20240912162834/The inference and metric computation has been tested on python 3.10.19 and torch 2.1.2. For example, you can use a miniforge environment.
conda create --name savgbench python=3.10 -y conda activate savgbench # conda install -c conda-forge ffmpeg # if needed pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121 pip install -r requirements_joint_and_metric.txt pip install openmim==0.3.9 mim install mmengine==0.10.7 mmcv==2.1.0 mmdet==3.3.0 # mmdetection for object detection: https://mmdetection.readthedocs.io/en/latest/get_started.htmlBefore running the evaluation script ./evaluation_script.sh, please modify the paths in the script to use correct paths.
The inference script ./joint_method/run_baseline.sh is called from the evaluation script. You may configure batch size and the GPUs to be used from this file.
You can run the evaluation script.
./evaluation_script.sh VIDEO_OUTPUT_DIR VIDEO_REFERENCE_DIR NUM_GPUSPlease use absolute paths. For example, to run on 1 GPU:
./evaluation_script.sh /home/USER_DIR/SAVGBench/result_outputs/ /home/USER_DIR/SAVGBench/SAVGBench_Dataset_Evaluation/video_eval/ 1It takes around 30 minutes to generate 96 samples (default). Then you can check the evaluation results at ./results.out.
AV spatial each component: num_TP: 2382, num_FN: 2320, num_TP + num_FN: 4702 AV_final_score: 0.5065929391748192 {'fvd': 1218.8477783203125, 'kvd': 55.60466751187866, 'fad': 9.357064962387085} Temporal AV-Align: 0.6680675666494543 The results may vary depending on random seeds etc, but you can get similar results to the ones reported. Please note that the FAD score is better than the paper's value since a bug in the FAD computation was fixed.
See joint_method/TRAINING_JOINT.md.
If you found this repository useful, please consider citing
@inproceedings{shimada2026savgbench, title={SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation}, author={Shimada, Kazuki and Simon, Christian and Shibuya, Takashi and Takahashi, Shusuke and Mitsufuji, Yuki}, booktitle={Proc. of IEEE ICASSP}, year={2026} }This repository is built on the following papers and the open source repositories.
-
Joint Method and Video/Audio Quality Metrics: MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
-
Temporal AV-Align Metric: Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

