This repo contains the code for Guiding Instruction-based Image Editing via Multimodal Large Language Models (ICLR'24 Spotlight)
MGIE is an implementation of
"Guiding Instruction-based Image Editing via Multimodal Large Language Models"
Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, and Zhe Gan
in International Conference on Learning Representations (ICLR) 2024
Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation via LMs. We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE). MGIE learns to derive expressive instructions and provides explicit guidance. The editing model jointly captures this visual imagination and performs manipulation through end-to-end training.
conda create -n mgie python=3.10 -y conda activate mgie conda update -n base -c defaults conda setuptools -y conda install -c conda-forge git git-lfs ffmpeg vim htop ninja gpustat -y conda clean -a -y pip install -U pip cmake cython==0.29.36 pydantic==1.10 numpy pip install -U gdown pydrive2 wget jupyter jupyterlab jupyterthemes ipython pip install -U sentencepiece transformers diffusers tokenizers datasets gradio==3.37 accelerate evaluate git+https://github.com/openai/CLIP.git pip install -U https://download.pytorch.org/whl/cu113/torch-1.12.0%2Bcu113-cp310-cp310-linux_x86_64.whl https://download.pytorch.org/whl/cu113/torchvision-0.13.0%2Bcu113-cp310-cp310-linux_x86_64.whl https://download.pytorch.org/whl/cu113/torchaudio-0.12.0%2Bcu113-cp310-cp310-linux_x86_64.whl pip install -U deepspeed # git clone this repo cd ml-mgie git submodule update --init --recursive cd LLaVA pip install -e . pip install -U https://download.pytorch.org/whl/cu113/torch-1.12.0%2Bcu113-cp310-cp310-linux_x86_64.whl https://download.pytorch.org/whl/cu113/torchvision-0.13.0%2Bcu113-cp310-cp310-linux_x86_64.whl https://download.pytorch.org/whl/cu113/torchaudio-0.12.0%2Bcu113-cp310-cp310-linux_x86_64.whl pip install -U ninja flash-attn==1.0.2 pip install -U pydrive2 gdown wget cd .. cp mgie_llava.py LLaVA/llava/model/llava.py cp mgie_train.py LLaVA/llava/train/train.py Put official LLaVA-7B in _ckpt/LLaVA-7B-v1 and download pre-trained ckpt (on IPr2Pr + MagicBrush) in _ckpt/mgie_7b
demo.ipynb Notices: Apple's rights in the attached weight differentials are hereby licensed under the CC-BY-NC license. Apple makes no representations with regards to LLaMa or any other third party software, which are subject to their own terms.
Download CLIP-filtered IPr2Pr and process (including summarized expressive instruction) in _data
process_data.ipynb There are examples to help prepare the data
Put Vicuna-7B and LLaVA-7B in _ckpt/vicuna-7b-v1.1 and _ckpt/LLaVA-7B-v1
WANDB_DISABLED='true' torchrun --nnodes=1 --nproc_per_node=8 --master_port=7122 LLaVA/llava/train/train_mem.py --model_name_or_path ./_ckpt/vicuna-7b-v1.1 --version v1 --vision_tower openai/clip-vit-large-patch14 --mm_vision_select_layer -2 --mm_use_im_start_end True --bf16 True --output_dir _snapshot/mgie --num_train_epochs 40 --per_device_train_batch_size 4 --per_device_eval_batch_size 2 --dataloader_num_workers 2 --gradient_accumulation_steps 1 --evaluation_strategy 'no' --save_strategy 'steps' --save_steps 2000 --save_total_limit 10 --learning_rate 5e-4 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type 'cosine' --logging_steps 1 --tf32 True --model_max_length 512 --gradient_checkpointing True --lazy_preprocess True Extract trained ckpt in _ckpt/mgie_7b
extract_ckpt.ipynb Run our demo
demo.ipynb @inproceedings{fu2024mgie, author = {Tsu-Jui Fu and Wenze Hu and Xianzhi Du and William Yang Wang and Yinfei Yang, and Zhe Gan}, title = {{Guiding Instruction-based Image Editing via Multimodal Large Language Models}}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2024} } - LLaVA: the codebase we built upon

