ScaleKV: Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression

Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression
Kunjun Li, Zigeng Chen, Cheng-Yen Yang, Jenq-Neng Hwang
University of Washington，National University of Singapore

💡 Introduction

We propose Scale-Aware KV Cache (ScaleKV), a novel KV Cache compression framework tailored for VAR’s next-scale prediction paradigm. ScaleKV leverages on two critical observations: varying cache demands across transformer layers and distinct attention patterns at different scales. Based on these insights, we categorizes transformer layers into two functional groups termed drafters and refiners, implementing adaptive cache management strategies based on these roles and optimize multi-scale inference by identifying each layer's function at every scale, enabling adaptive cache allocation that aligns with specific computational demands of each layer. On Infinity-8B, it achieves 10x memory reduction from 85 GB to 8.5 GB with negligible quality degradation (GenEval score remains at 0.79 and DPG score marginally decreases from 86.61 to 86.49).

🔥Updates

🔥 May 26, 2025: Our paper is available now!
🔥 May 25, 2025: Code repo is released! Arxiv paper will come soon!

🔧 Installation:

Reequirements

pip install -r requirements.txt

Model Checkpoints

Download google flan-t5-xl:

pip install -U huggingface_hub huggingface-cli download google/flan-t5-xl --local-dir ./weights/flan-t5-xl

Download Infinity-2B:

huggingface-cli download FoundationVision/Infinity --include "infinity_2b_reg.pth" --local-dir ./weights/ huggingface-cli download FoundationVision/Infinity --include "infinity_vae_d32reg.pth" --local-dir ./weights/

Download Infinity-8B:

huggingface-cli download FoundationVision/Infinity --include "infinity_8b_weights/**" --local-dir ./weights/infinity_8b_weights huggingface-cli download FoundationVision/Infinity --include "infinity_vae_d56_f8_14_patchify.pth" --local-dir ./weights/

⚡ Quick Start:

Sample images with ScaleKV-Compressed Infinity-8B (10% KV Cache):

python infer_8B.py

Sample images with ScaleKV-Compressed Infinity-2B (10% KV Cache):

python infer_2B.py

⚡ Sample & Evaluations

Sampling 5000 images from COCO-2017 captions with Infinity-8B.

torchrun --nproc_per_node=$N_GPUS scripts/sample_8b.py

Sample images with ScaleKV compressed Infinity-8B (10% KV Cache):

torchrun --nproc_per_node=$N_GPUS scripts/sample_kv_8b.py

After you sample all the images, you can calculate PSNR, LPIPS and FID with:

python scripts/compute_metrics.py --input_root0 samples/gt_8b --input_root1 samples/scalekv_8b

Sampling 5000 images from COCO captions with Infinity-2B.

torchrun --nproc_per_node=$N_GPUS scripts/sample_2b.py

torchrun --nproc_per_node=$N_GPUS scripts/sample_kv_2b.py

python scripts/compute_metrics.py --input_root0 samples/gt_2b --input_root1 samples/scalekv_2b

📚 Key Results

Acknowlegdement

Thanks to Infinity for their wonderful work and codebase!

Citation

If our research assists your work, please give us a star ⭐ or cite us using:

@article{li2025scalekv, title={Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression}, author={Li, Kunjun and Chen, Zigeng and Yang, Cheng-Yen and Hwang, Jenq-Neng}, journal={Advances in Neural Information Processing Systems}, year={2025} }

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
budgets		budgets
infinity		infinity
scripts		scripts
tools		tools
LICENSE		LICENSE
README.md		README.md
enable_scale_kv_2b.py		enable_scale_kv_2b.py
enable_scale_kv_8b.py		enable_scale_kv_8b.py
infer_2b.py		infer_2b.py
infer_8b.py		infer_8b.py
requirements.txt		requirements.txt
scale_kv.py		scale_kv.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ScaleKV: Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression

💡 Introduction

🔥Updates

🔧 Installation:

Reequirements

Model Checkpoints

⚡ Quick Start:

⚡ Sample & Evaluations

Sampling 5000 images from COCO-2017 captions with Infinity-8B.

Sampling 5000 images from COCO captions with Infinity-2B.

📚 Key Results

Acknowlegdement

Citation

About

Uh oh!

Releases

Packages

Languages

License

StargazerX0/ScaleKV

Folders and files

Latest commit

History

Repository files navigation

ScaleKV: Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression

💡 Introduction

🔥Updates

🔧 Installation:

Reequirements

Model Checkpoints

⚡ Quick Start:

⚡ Sample & Evaluations

Sampling 5000 images from COCO-2017 captions with Infinity-8B.

Sampling 5000 images from COCO captions with Infinity-2B.

📚 Key Results

Acknowlegdement

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages