Perceiver-VL

Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention [WACV 2023 bib]

Zineng Tang*, Jaemin Cho*, Jie Lei, Mohit Bansal

Learning vision-language representation by iterative latent attention that scales with long inputs linearly.

Introduction

Perceiver-VL Architecture Overview

Install

Setup `python` environment

conda create -n Perceiver-VL python=3.8 # You can also use other environment.

Install other dependencies

pip install -r requirements.txt

Training

TODO: Finish datasets/tasks instructions and scripts

Pretraining (scripts)

# Pretrain on Webvid + GCC bash scripts/co_pretrain.sh

# Pretrain on Webvid bash scripts/webvid_pretrain.sh

# Pretrain on GCC bash scripts/gcc_pretrain.sh

# Pretrain on ImageNet bash scripts/imagenet_pretrain.sh

Pretrained Checkpoint

Download Checkpoint [link]

Finetuning on Downstream (scripts)

# Fintune on MSRVTT Retrieval bash scripts/msrvtt_vrtr_finetune.sh

# Fintune on VQA bash scripts/vqa_finetune.sh

Code Structure

Perceiver_VL │ ├── assets # illustrations │ └── architecture.png │ ├── model # main source │ ├── datamodules # pytorch-lightning wrap │ │ ├── datamodule_base.py │ │ └── ... │ └── datasets # Datasets │ │ ├── vqa_dataset.py │ │ └── ... │ ├── gadgets │ │ └── my_metrics.py # metric utils │ ├── modules │ │ ├── heads.py # model heads │ │ ├── model_module.py # pytorch-lightning wrap for model │ │ ├── model_utils.py # pytorch-lightning wrap for training metrics │ │ ├── objectives.py # pretraining/finetuning objectives │ │ └── perceiver_vl.py # main model │ ├── transforms # image transformation utils │ │ └── ... │ └── config.py # all configurations │ ├── scripts # all scripts │ ├── vqa_finetune.sh │ ├── co_pretrain.sh │ └── ... │ ├── run.py # main └── requirements.txt

Citation

@inproceedings{tang2023wacv, title = {Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention}, author = {Zineng Tang and Jaemin Cho and Jie Lei and Mohit Bansal}, booktitle = {WACV}, year = {2023} }

Acknowledgement

Our codebase is based on ViLT. We thank the authors for their open-source contributions.

Contact

Zineng Tang (zn.tang.terran@gmail.com)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Perceiver-VL

Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention [WACV 2023 bib]

Introduction

Install

Setup `python` environment

Install other dependencies

Training

Pretraining (scripts)

Pretrained Checkpoint

Finetuning on Downstream (scripts)

Code Structure

Citation

Acknowledgement

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
assets		assets
model		model
scripts		scripts
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py

License

zinengtang/Perceiver_VL

Folders and files

Latest commit

History

Repository files navigation

Perceiver-VL

Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention [WACV 2023 bib]

Introduction

Install

Setup python environment

Install other dependencies

Training

Pretraining (scripts)

Pretrained Checkpoint

Finetuning on Downstream (scripts)

Code Structure

Citation

Acknowledgement

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Setup `python` environment

Packages