Skip to content

mesnico/ALADIN

Repository files navigation

ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval

Introduction

This is the code for reproducing the results from our paper ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval, accepted at CBMI 2022.

Our code is based on OSCAR, whose repository is available here.

Architecture

Installation

Requirements

  • Python 3.7
  • Pytorch 1.2
  • torchvision 0.4.0
  • cuda 10.0

Setup with Conda

# create a new environment conda create --name oscar python=3.7 conda activate oscar # install pytorch1.2 conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch export INSTALL_DIR=$PWD # install apex cd $INSTALL_DIR git clone https://github.com/NVIDIA/apex.git cd apex git checkout f3a960f80244cf9e80558ab30f7f7e8cbf03c0a0 python setup.py install --cuda_ext --cpp_ext # install this repo cd $INSTALL_DIR git clone --recursive https://github.com/mesnico/OSCAR-TERAN-distillation cd OSCAR-TERAN-distillation/coco_caption ./get_stanford_models.sh cd .. python setup.py build develop # install requirements pip install -r requirements.txt unset INSTALL_DIR

Download OSCAR & Vin-VL Retrieval data:

Download the checkpoint folder with azcopy:

path/to/azcopy copy 'https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/coco_ir/base/checkpoint-0132780/' <checkpoint-target-folder> --recursive 

Download the IR data

path/to/azcopy copy 'https://biglmdiag.blob.core.windows.net/vinvl/datasets/coco_ir' <data-folder> --recursive 

Download the pre-extracted Bottom-Up features

path/to/azcopy copy 'https://biglmdiag.blob.core.windows.net/vinvl/image_features/coco_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/' <features-folder> --recursive 

Training

cd alad python train.py --data_dir <data-folder>/coco_ir --img_feat_file <features-folder>/features.tsv --eval_model_dir <checkpoint-target-folder>/checkpoint-0132780 --config configs/<config>.yaml --logger_name <output-folder> --val_step 7000 --max_seq_length 50 --max_img_seq_length 34 

Configurations

The parameter --config is very important. Configurations are placed in yaml format inside the configs folder:

  • alad-alignment-triplet.yaml: Trains the alignment head using hinge-based triplet ranking loss, finetuning also the Vin-VL backbone;
  • alad-matching-triplet-finetune.yaml: Trains only the matching head using hinge-based triplet ranking loss. The parameter --load-teacher-model can be used to provide a backbone previously trained using the alad-alignment-triplet.yaml configuration;
  • alad-matching-distill-finetune.yaml: Trains only the matching head by distilling the scores from the alignment head. The parameter --load-teacher-model in this case IS NEEDED to provide a correctly trained alignment head, previously trained using the alad-alignment-triplet.yaml configuration;
  • alad-matching-triplet-e2e.yaml: Trains the matching head, finetuning also the Vin-VL backbone;
  • alad-alignment-and-matching-distill.yaml: Trains the whole architecture (matching+alignment heads) end-to-end. The variable activate_distillation_after inside the configuration file controls how many epochs to wait before activating the distillation loss (wait that the backbone is minimally stable); alternatively, you can load a pre-trained backbone using the --load-teacher-model option.

Monitor Training

Training and validation metrics, as well as model checkpoints are put inside the <output-folder> path. You can live monitor all the metrics using tensorboard:

tensorboard --logdir <output-folder> 

Testing

The following script tests a model on the 1k MS-COCO test set (you can download our best model from here; it is obtained with the alad-alignment-and-matching-distill.yaml configuration.)

cd alad python test.py --data_dir <data-folder>/coco_ir --img_feat_file <features-folder>/features.tsv --eval_model_dir <checkpoint-target-folder>/checkpoint-0132780 --max_seq_length 50 --max_img_seq_length 34 --eval_img_keys_file test_img_keys_1k.tsv --load_checkpoint <path/to/checkpoint.pth.tar> 

To test on 5k test set, simply set --eval_img_keys_file test_img_keys.tsv.

Reference

If you found this code useful, please cite the following paper:

@inproceedings{messina2022aladin, title={ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval}, author={Messina, Nicola and Stefanini, Matteo and Cornia, Marcella and Baraldi, Lorenzo and Falchi, Fabrizio and Amato, Giuseppe and Cucchiara, Rita}, booktitle={International Conference on Content-based Multimedia Indexing}, pages={64--70}, year={2022} } 

About

Official implementation of the paper "ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors