Improving Progressive Generation with Decomposable Flow Matching
Moayed Haji-Ali*, Willi Menapace*, Ivan Skorokhodov, Arpit Sahni, Sergey Tulyakov, Vicente Ordonez, Aliaksandr Siarohin
Snap Research & Rice University
One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers
Learn how dynamic compute allocation in DiTs can accelerate convergence by up to 2.5Γ while enabling a single model to flexibly operate across a wide range of inference compute budgets. This code repo also contains the training and inference scripts for ELIT.
TL;DR: Decomposable Flow Matching (DFM) is a simple framework to progressively generate visual modalities scale-by-scale, achieving up to 50% faster convergence compared to Flow Matching. DFM applies flow matching independently at each level of a multi-scale representation (e.g., a Laplacian pyramid) in an end-to-end fashion, staying compatible with standard flow-matching pipelines while improving quality and convergence speed.
This repo provides a reimplementation of DFM on top of SiT, following REPA setup. The architecture does not exactly follow the one used in the paper and results might be different. Below, we provide comparison between SiT and DFM produced using this repo.
Decomposable Flow Matching (DFM) combines multiscale decomposition with Flow Matching. DFM progressively synthesizes different representation scales by generating the coarse-structure scale first and incrementally refining it with finer scales.
- DFM Architecture: Modifies the DiT architecture to use per-scale patchification and timestep-embedding layers while keeping the core DiT architecture untouched.
- DFM Training: Samples the stage count from a categorical distribution, draws each stage flow-timestep from a logit-normal distribution biased toward lower noise in early stages, and trains one DiT backbone to jointly predict all stage-wise velocities.
- DFM Inference: Denoises the coarsest stage first for Tβ steps and activates next stages after the previous ones reach a predetermined per-stage noise threshold.
| Method | FID | sFID | IS | Precision | Recall |
|---|---|---|---|---|---|
| SiT-XL/2 | 33.24 | 8.55 | 48.22 | 0.308 | 0.581 |
| DFM-SiT-XL/2 | 18.27 | 6.51 | 85.50 | 0.452 | 0.557 |
Pretrained checkpoints of the above experiments will be released soon.
conda create -n dfm python=3.9 -y conda activate dfm pip install -r requirements.txtDownload ImageNet. Then run the following processing and VAE latent extraction scripts.
# Convert raw ImageNet data to a ZIP archive at 256x256 resolution python dataset_tools.py convert \ --source=[YOUR_DOWNLOAD_PATH]/ILSVRC/Data/CLS-LOC/train \ --dest=[TARGET_PATH]/images \ --resolution=256x256 \ --transform=center-crop-dhariwal# Convert the pixel data to VAE latents python dataset_tools.py encode \ --source=[TARGET_PATH]/images \ --dest=[TARGET_PATH]/vae-sdHere, YOUR_DOWNLOAD_PATH is the directory where you downloaded the dataset, and TARGET_PATH is the directory where you will save the preprocessed images and corresponding compressed latent vectors. This directory will be used for your experiment scripts.
Training uses the unified train.py script with YAML configuration files or CLI arguments. Update data_dir in the config to point to your data directory.
# From CLI args accelerate launch train.py --model [MODEL_NAME] --exp-name [EXP_NAME] --data-dir [DATA_DIR] # Or from YAML config accelerate launch train.py --config [CONFIG_PATH] --data-dir [DATA_DIR]where [MODEL_NAME] can be specificed as SiT or DFM-SiT baselines (e.g SiT-XL/2 or DFM-SiT-XL/2)
Sample training configurations can be found in experiments/train
# From CLI args accelerate launch train.py --model DFM-SiT-XL/2 --exp-name dfm-sit-xl-2-256px --data-dir [DATA_DIR] # Or from YAML config accelerate launch train.py --config experiments/train/dfm_sit_b_256.yaml --data-dir [DATA_DIR]The main DFM-specific options to adjust are:
| Parameter | Description | Default |
|---|---|---|
model | Model architecture: SiT-B/2, SiT-XL/2, DFM-SiT-B/2, DFM-SiT-XL/2 etc. | β |
stages_count | Number of stages in DFM | 2 |
stage_weights | Sampling weights of each stage during training | [0.9, 0.1] |
num_steps_per_scale | Number of inference steps for each stage | [40, 10] |
stage_sampling_thresholds | Noise threshold where next stage generation is initialized | [0.1] |
Please refer to the paper for guidelines on choosing DFM hyperparameters.
Sampling uses the unified generate.py script with DDP:
# From CLI args torchrun --nproc_per_node=8 generate.py \ --model SiT-B/2 --ckpt exps/sit-b-2-256px/checkpoints/0400000.pt # Or from YAML config torchrun --nproc_per_node=8 generate.py \ --config experiments/generation/sit_b_256.yaml \ --ckpt exps/sit-b-2-256px/checkpoints/0400000.pt# From CLI args torchrun --nproc_per_node=8 generate.py \ --model DFM-SiT-B/2 --ckpt exps/dfm-sit-b-2-256px/checkpoints/0400000.pt # Or from YAML config torchrun --nproc_per_node=8 generate.py \ --config experiments/generation/dfm_sit_b_256.yaml \ --ckpt exps/dfm-sit-b-2-256px/checkpoints/0400000.ptWe provide evaluation scripts in experiments/evaluation/ that generate samples and compute FID, sFID, IS, Precision, and Recall.
bash experiments/evaluation/eval_dfm_sit_b_256.shThis will generate samples under the results/ directory and an .npz file which can be used for evaluation. To run the reference TensorFlow evaluation on ImageNet, we use the ADM evaluation suite.
Note: Please make sure that the model hyperparameters match the training ones and refer to the paper for guidelines on choosing DFM inference hyperparameters.
This code is mainly built upon REPA. We thank the authors for open-sourcing their codebase.
@article{dfm, title={Improving Progressive Generation with Decomposable Flow Matching}, author={Moayed Haji-Ali and Willi Menapace and Ivan Skorokhodov and Arpit Sahni and Sergey Tulyakov and Vicente Ordonez and Aliaksandr Siarohin}, journal={arXiv preprint arXiv:2506.19839}, year={2025} }