DreamEngine is a unified framework that integrates multimodal encoders like QwenVL with diffusion models through a two-stage training approach, enabling advanced text-image interleaved control and achieving state-of-the-art performance in generating images with complex, concept-merged inputs.
demo.mp4
Updates:
- 2025-03-03: Release checkpoint and a demo for text-guided object fusion.
bash setup.sh # setup the paths in demo.py python src/scripts/eval/demo.py
If you feel the work helpful, please kindly cite
@misc{chen2025multimodalrepresentationalignmentimage, title={Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think}, author={Liang Chen and Shuai Bai and Wenhao Chai and Weichu Xie and Haozhe Zhao and Leon Vinci and Junyang Lin and Baobao Chang}, year={2025}, eprint={2502.20172}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2502.20172}, }