ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval

Introduction

This is the code for reproducing the results from our paper ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval, accepted at CBMI 2022.

Our code is based on OSCAR, whose repository is available here.

Installation

Requirements

Python 3.7
Pytorch 1.2
torchvision 0.4.0
cuda 10.0

Setup with Conda

# create a new environment conda create --name oscar python=3.7 conda activate oscar # install pytorch1.2 conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch export INSTALL_DIR=$PWD # install apex cd $INSTALL_DIR git clone https://github.com/NVIDIA/apex.git cd apex git checkout f3a960f80244cf9e80558ab30f7f7e8cbf03c0a0 python setup.py install --cuda_ext --cpp_ext # install this repo cd $INSTALL_DIR git clone --recursive https://github.com/mesnico/OSCAR-TERAN-distillation cd OSCAR-TERAN-distillation/coco_caption ./get_stanford_models.sh cd .. python setup.py build develop # install requirements pip install -r requirements.txt unset INSTALL_DIR

Download OSCAR & Vin-VL Retrieval data:

Download the checkpoint folder with azcopy:

path/to/azcopy copy 'https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/coco_ir/base/checkpoint-0132780/' <checkpoint-target-folder> --recursive

Download the IR data

path/to/azcopy copy 'https://biglmdiag.blob.core.windows.net/vinvl/datasets/coco_ir' <data-folder> --recursive

Download the pre-extracted Bottom-Up features

path/to/azcopy copy 'https://biglmdiag.blob.core.windows.net/vinvl/image_features/coco_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/' <features-folder> --recursive

Training

cd alad python train.py --data_dir <data-folder>/coco_ir --img_feat_file <features-folder>/features.tsv --eval_model_dir <checkpoint-target-folder>/checkpoint-0132780 --config configs/<config>.yaml --logger_name <output-folder> --val_step 7000 --max_seq_length 50 --max_img_seq_length 34

Configurations

The parameter --config is very important. Configurations are placed in yaml format inside the configs folder:

alad-alignment-triplet.yaml: Trains the alignment head using hinge-based triplet ranking loss, finetuning also the Vin-VL backbone;
alad-matching-triplet-finetune.yaml: Trains only the matching head using hinge-based triplet ranking loss. The parameter --load-teacher-model can be used to provide a backbone previously trained using the alad-alignment-triplet.yaml configuration;
alad-matching-distill-finetune.yaml: Trains only the matching head by distilling the scores from the alignment head. The parameter --load-teacher-model in this case IS NEEDED to provide a correctly trained alignment head, previously trained using the alad-alignment-triplet.yaml configuration;
alad-matching-triplet-e2e.yaml: Trains the matching head, finetuning also the Vin-VL backbone;
alad-alignment-and-matching-distill.yaml: Trains the whole architecture (matching+alignment heads) end-to-end. The variable activate_distillation_after inside the configuration file controls how many epochs to wait before activating the distillation loss (wait that the backbone is minimally stable); alternatively, you can load a pre-trained backbone using the --load-teacher-model option.

Monitor Training

Training and validation metrics, as well as model checkpoints are put inside the <output-folder> path. You can live monitor all the metrics using tensorboard:

tensorboard --logdir <output-folder>

Testing

The following script tests a model on the 1k MS-COCO test set (you can download our best model from here; it is obtained with the alad-alignment-and-matching-distill.yaml configuration.)

cd alad python test.py --data_dir <data-folder>/coco_ir --img_feat_file <features-folder>/features.tsv --eval_model_dir <checkpoint-target-folder>/checkpoint-0132780 --max_seq_length 50 --max_img_seq_length 34 --eval_img_keys_file test_img_keys_1k.tsv --load_checkpoint <path/to/checkpoint.pth.tar>

To test on 5k test set, simply set --eval_img_keys_file test_img_keys.tsv.

Reference

If you found this code useful, please cite the following paper:

@inproceedings{messina2022aladin, title={ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval}, author={Messina, Nicola and Stefanini, Matteo and Cornia, Marcella and Baraldi, Lorenzo and Falchi, Fabrizio and Amato, Giuseppe and Cucchiara, Rita}, booktitle={International Conference on Content-based Multimedia Indexing}, pages={64--70}, year={2022} }

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
alad		alad
coco_caption @ de6f385		coco_caption @ de6f385
docs		docs
oscar		oscar
transformers @ 067923d		transformers @ 067923d
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
teaser.png		teaser.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval

Introduction

Installation

Requirements

Setup with Conda

Download OSCAR & Vin-VL Retrieval data:

Training

Configurations

Monitor Training

Testing

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval

Introduction

Installation

Requirements

Setup with Conda

Download OSCAR & Vin-VL Retrieval data:

Training

Configurations

Monitor Training

Testing

Reference

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages