StableAnimator

StableAnimator: High-Quality Identity-Preserving Human Image Animation
Shuyuan Tu¹, Zhen Xing¹, Xintong Han³, Zhi-Qi Cheng⁴, Qi Dai², Chong Luo², Zuxuan Wu¹
[¹Fudan University; ²Microsoft Research Asia; ³Huya Inc; ⁴Carnegie Mellon University]

Pose-driven Human image animations generated by StableAnimator, showing its power to synthesize high-fidelity and ID-preserving videos. All animations are directly synthesized by StableAnimator without the use of any face-related post-processing tools, such as the face-swapping tool FaceFusion or face restoration models like GFP-GAN and CodeFormer.

Comparison results between StableAnimator and state-of-the-art (SOTA) human image animation models highlight the superior performance of StableAnimator in delivering high-fidelity, identity-preserving human image animation.

Overview

An overview of the framework of StableAnimator.

Current diffusion models for human image animation struggle to ensure identity (ID) consistency. This paper presents StableAnimator, the first end-to-end ID-preserving video diffusion framework, which synthesizes high-quality videos without any post-processing, conditioned on a reference image and a sequence of poses. Building upon a video diffusion model, StableAnimator contains carefully designed modules for both training and inference striving for identity consistency. In particular, StableAnimator begins by computing image and face embeddings with off-the-shelf extractors, respectively and face embeddings are further refined by interacting with image embeddings using a global content-aware Face Encoder. Then, StableAnimator introduces a novel distribution-aware ID Adapter that prevents interference caused by temporal layers while preserving ID via alignment. During inference, we propose a novel Hamilton-Jacobi-Bellman (HJB) equation-based optimization to further enhance the face quality. We demonstrate that solving the HJB equation can be integrated into the diffusion denoising process, and the resulting solution constrains the denoising path and thus benefits ID preservation. Experiments on multiple benchmarks show the effectiveness of StableAnimator both qualitatively and quantitatively.

News

[2024-11-26]:🔥 The project page, code, technical report and a basic model checkpoint are released. Further training codes, data pre-processing codes, the evaluation dataset and StableAnimator-pro will be released very soon. Stay tuned!

To-Do List

StableAnimator-basic
Inference Code
Evaluation Samples
Evaluation Dataset
Training Code
Data Pre-Processing Code (Skeleton Extraction and Human Face Mask Extraction)
StableAnimator-pro
Inference Code with HJB-based Face Optimization

Quickstart

For the basic version of the model checkpoint, it supports generating videos at a 576x1024 or 512x512 resolution. If you encounter insufficient memory issues, you can appropriately reduce the number of animated frames.

Environment setup

Recommend python 3+ with torch 2.x are validated with an Nvidia V100 GPU. We recommend you to utilize the docker image 2.1.0-cuda11.8-cudnn8-devel or deeptimhe/ubuntu20.04-cuda11.3.1-python3.8-pytorch1.12:orig-sing-pytorch3d0.7.2. Follow the commands below to install all the dependencies of StableAnimator:

pip install -r requirements.txt conda install xformers -c xformers -y pip install onnxruntime-gpu==1.17.0 --index-url=https://pkgs.dev.azure.com/onnxruntime/onnxruntime/_packaging/onnxruntime-cuda-12/pypi/simple

Download weights

If you experience connection issues with Hugging Face, you can utilize the mirror endpoint by setting the environment variable: export HF_ENDPOINT=https://hf-mirror.com. Please download weights manually as follows:

cd StableAnimator/ mkdir checkpoints

All the weights should be organized in models as follows

checkpoints/ ├── DWPose │   ├── dw-ll_ucoco_384.onnx │   └── yolox_l.onnx ├──Animation │   ├── pose_net.pth │   ├── face_encoder.pth │   └── unet.pth ├──SVD │   └── stable-video-diffusion-img2vid-xt │      ├── feature_extractor │      ├── image_encoder │      ├── scheduler │      ├── unet │      ├── vae │      ├── model_index.json │      ├── svd_xt.safetensors │      └── svd_xt_image_decoder.safetensors

Download DWPose pretrained model: dwpose
Download the pre-trained checkpoint of StableAnimator from Huggingface
Download the SVD pretrained model: SVD

Evaluation Samples

The evaluation samples presented in the paper can be downloaded from OneDrive. Please download evaluation samples manually as follows:

cd StableAnimator/ mkdir inference

All the evaluation samples should be organized as follows:

inference/ ├── case-1 │   ├── poses │   ├── faces │   └── reference.png ├── case-2 │   ├── poses │   ├── faces │   └── reference.png ├── case-3 │   ├── poses │   ├── faces │   └── reference.png

It is worth noting that the data pre-processing codes, including human skeleton extraction and human face extraction, will be released very soon. Stay tuned!

Model inference

A sample configuration for testing is provided as command_basic_infer.sh. You can also easily modify the various configurations according to your needs.

bash command_basic_infer.sh

StableAnimator supports human image animation at two different resolution settings: 512x512 and 576x1024. You can modify "--width" and "--height" in command_basic_infer.sh to set the resolution of the animation. "--output_dir" in command_basic_infer.sh refers to the saved path of the generated animation. "--validation_control_folder" and "--validation_image" in command_basic_infer.sh refer to the paths of the given pose sequence and the reference image, respectively. "--pretrained_model_name_or_path" in command_basic_infer.sh is the path of pretrained SVD. "posenet_model_name_or_path", "face_encoder_model_name_or_path", and "unet_model_name_or_path" in command_basic_infer.sh refer to paths of pretrained StableAnimator weights. If you have enough GPU resources, you can increase the value (4=>8=>16) of "--decode_chunk_size" in command_basic_infer.sh to promote the temporal smoothness of the animation.

Tips: if your GPU memory is limited, you can reduce the number of animated frames. This command will generate two files: animated_images and animated_images.gif. If you want to obtain the high quality MP4 file, we recommend you to leverage ffmpeg on the animated_images as follows:

cd animated_images ffmpeg -framerate 20 -i frame_%d.png -c:v libx264 -crf 10 -pix_fmt yuv420p /path/animation.mp4

"-framerate" refers to the fps setting. "-crf" indicates the quality of the generated MP4 file, with smaller values corresponding to higher quality.

VRAM requirement and Runtime

For the 15s demo video, the 16-frame basic model requires 18GB VRAM and finishes in 12 minutes on a 4090 GPU.

The minimum VRAM requirement for the 16-frame U-Net model is 10GB; however, the VAE decoder demands 16GB. You have the option to run the VAE decoder on CPU.

Contact

If you have any suggestions or find our work helpful, feel free to contact me

Email: francisshuyuan@gmail.com

If you find our work useful, please consider giving a star to this github repository and citing it:

@article{tu2024stableanimator, title={StableAnimator: High-Quality Identity-Preserving Human Image Animation}, author={Shuyuan Tu and Zhen Xing and Xintong Han and Zhi-Qi Cheng and Qi Dai and Chong Luo and Zuxuan Wu}, journal={arXiv preprint arXiv:2411.17697}, year={2024} }

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
animation		animation
assets/figures		assets/figures
README.md		README.md
command_basic_infer.sh		command_basic_infer.sh
inference_basic.py		inference_basic.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StableAnimator

Overview

News

To-Do List

Quickstart

Environment setup

Download weights

Evaluation Samples

Model inference

VRAM requirement and Runtime

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

StableAnimator

Overview

News

To-Do List

Quickstart

Environment setup

Download weights

Evaluation Samples

Model inference

VRAM requirement and Runtime

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages