Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention [WACV 2023 bib]
Zineng Tang*, Jaemin Cho*, Jie Lei, Mohit Bansal
Learning vision-language representation by iterative latent attention that scales with long inputs linearly.
Perceiver-VL Architecture Overview
conda create -n Perceiver-VL python=3.8 # You can also use other environment. pip install -r requirements.txt TODO: Finish datasets/tasks instructions and scripts
# Pretrain on Webvid + GCC bash scripts/co_pretrain.sh # Pretrain on Webvid bash scripts/webvid_pretrain.sh # Pretrain on GCC bash scripts/gcc_pretrain.sh # Pretrain on ImageNet bash scripts/imagenet_pretrain.sh Download Checkpoint [link]
# Fintune on MSRVTT Retrieval bash scripts/msrvtt_vrtr_finetune.sh # Fintune on VQA bash scripts/vqa_finetune.sh Perceiver_VL │ ├── assets # illustrations │ └── architecture.png │ ├── model # main source │ ├── datamodules # pytorch-lightning wrap │ │ ├── datamodule_base.py │ │ └── ... │ └── datasets # Datasets │ │ ├── vqa_dataset.py │ │ └── ... │ ├── gadgets │ │ └── my_metrics.py # metric utils │ ├── modules │ │ ├── heads.py # model heads │ │ ├── model_module.py # pytorch-lightning wrap for model │ │ ├── model_utils.py # pytorch-lightning wrap for training metrics │ │ ├── objectives.py # pretraining/finetuning objectives │ │ └── perceiver_vl.py # main model │ ├── transforms # image transformation utils │ │ └── ... │ └── config.py # all configurations │ ├── scripts # all scripts │ ├── vqa_finetune.sh │ ├── co_pretrain.sh │ └── ... │ ├── run.py # main └── requirements.txt @inproceedings{tang2023wacv, title = {Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention}, author = {Zineng Tang and Jaemin Cho and Jie Lei and Mohit Bansal}, booktitle = {WACV}, year = {2023} } Our codebase is based on ViLT. We thank the authors for their open-source contributions.
Zineng Tang (zn.tang.terran@gmail.com)
