Skip to content

PyTorch code for "Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention" (WACV 2023)

License

Notifications You must be signed in to change notification settings

zinengtang/Perceiver_VL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Perceiver-VL

Zineng Tang*, Jaemin Cho*, Jie Lei, Mohit Bansal

Learning vision-language representation by iterative latent attention that scales with long inputs linearly.

Introduction

Perceiver-VL Architecture Overview

Install

Setup python environment

conda create -n Perceiver-VL python=3.8 # You can also use other environment. 

Install other dependencies

pip install -r requirements.txt 

Training

TODO: Finish datasets/tasks instructions and scripts

Pretraining (scripts)

# Pretrain on Webvid + GCC bash scripts/co_pretrain.sh 
# Pretrain on Webvid bash scripts/webvid_pretrain.sh 
# Pretrain on GCC bash scripts/gcc_pretrain.sh 
# Pretrain on ImageNet bash scripts/imagenet_pretrain.sh 

Pretrained Checkpoint

Download Checkpoint [link]

Finetuning on Downstream (scripts)

# Fintune on MSRVTT Retrieval bash scripts/msrvtt_vrtr_finetune.sh 
# Fintune on VQA bash scripts/vqa_finetune.sh 

Code Structure

Perceiver_VL │ ├── assets # illustrations │ └── architecture.png │ ├── model # main source │ ├── datamodules # pytorch-lightning wrap │ │ ├── datamodule_base.py │ │ └── ... │ └── datasets # Datasets │ │ ├── vqa_dataset.py │ │ └── ... │ ├── gadgets │ │ └── my_metrics.py # metric utils │ ├── modules │ │ ├── heads.py # model heads │ │ ├── model_module.py # pytorch-lightning wrap for model │ │ ├── model_utils.py # pytorch-lightning wrap for training metrics │ │ ├── objectives.py # pretraining/finetuning objectives │ │ └── perceiver_vl.py # main model │ ├── transforms # image transformation utils │ │ └── ... │ └── config.py # all configurations │ ├── scripts # all scripts │ ├── vqa_finetune.sh │ ├── co_pretrain.sh │ └── ... │ ├── run.py # main └── requirements.txt 

Citation

@inproceedings{tang2023wacv, title = {Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention}, author = {Zineng Tang and Jaemin Cho and Jie Lei and Mohit Bansal}, booktitle = {WACV}, year = {2023} } 

Acknowledgement

Our codebase is based on ViLT. We thank the authors for their open-source contributions.

Contact

Zineng Tang (zn.tang.terran@gmail.com)

About

PyTorch code for "Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention" (WACV 2023)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •