This is the code for reproducing the results from our paper ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval, accepted at CBMI 2022.
Our code is based on OSCAR, whose repository is available here.
- Python 3.7
- Pytorch 1.2
- torchvision 0.4.0
- cuda 10.0
# create a new environment conda create --name oscar python=3.7 conda activate oscar # install pytorch1.2 conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch export INSTALL_DIR=$PWD # install apex cd $INSTALL_DIR git clone https://github.com/NVIDIA/apex.git cd apex git checkout f3a960f80244cf9e80558ab30f7f7e8cbf03c0a0 python setup.py install --cuda_ext --cpp_ext # install this repo cd $INSTALL_DIR git clone --recursive https://github.com/mesnico/OSCAR-TERAN-distillation cd OSCAR-TERAN-distillation/coco_caption ./get_stanford_models.sh cd .. python setup.py build develop # install requirements pip install -r requirements.txt unset INSTALL_DIRDownload the checkpoint folder with azcopy:
path/to/azcopy copy 'https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/coco_ir/base/checkpoint-0132780/' <checkpoint-target-folder> --recursive Download the IR data
path/to/azcopy copy 'https://biglmdiag.blob.core.windows.net/vinvl/datasets/coco_ir' <data-folder> --recursive Download the pre-extracted Bottom-Up features
path/to/azcopy copy 'https://biglmdiag.blob.core.windows.net/vinvl/image_features/coco_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/' <features-folder> --recursive cd alad python train.py --data_dir <data-folder>/coco_ir --img_feat_file <features-folder>/features.tsv --eval_model_dir <checkpoint-target-folder>/checkpoint-0132780 --config configs/<config>.yaml --logger_name <output-folder> --val_step 7000 --max_seq_length 50 --max_img_seq_length 34 The parameter --config is very important. Configurations are placed in yaml format inside the configs folder:
alad-alignment-triplet.yaml: Trains the alignment head using hinge-based triplet ranking loss, finetuning also the Vin-VL backbone;alad-matching-triplet-finetune.yaml: Trains only the matching head using hinge-based triplet ranking loss. The parameter--load-teacher-modelcan be used to provide a backbone previously trained using thealad-alignment-triplet.yamlconfiguration;alad-matching-distill-finetune.yaml: Trains only the matching head by distilling the scores from the alignment head. The parameter--load-teacher-modelin this case IS NEEDED to provide a correctly trained alignment head, previously trained using thealad-alignment-triplet.yamlconfiguration;alad-matching-triplet-e2e.yaml: Trains the matching head, finetuning also the Vin-VL backbone;alad-alignment-and-matching-distill.yaml: Trains the whole architecture (matching+alignment heads) end-to-end. The variableactivate_distillation_afterinside the configuration file controls how many epochs to wait before activating the distillation loss (wait that the backbone is minimally stable); alternatively, you can load a pre-trained backbone using the--load-teacher-modeloption.
Training and validation metrics, as well as model checkpoints are put inside the <output-folder> path. You can live monitor all the metrics using tensorboard:
tensorboard --logdir <output-folder> The following script tests a model on the 1k MS-COCO test set (you can download our best model from here; it is obtained with the alad-alignment-and-matching-distill.yaml configuration.)
cd alad python test.py --data_dir <data-folder>/coco_ir --img_feat_file <features-folder>/features.tsv --eval_model_dir <checkpoint-target-folder>/checkpoint-0132780 --max_seq_length 50 --max_img_seq_length 34 --eval_img_keys_file test_img_keys_1k.tsv --load_checkpoint <path/to/checkpoint.pth.tar> To test on 5k test set, simply set --eval_img_keys_file test_img_keys.tsv.
If you found this code useful, please cite the following paper:
@inproceedings{messina2022aladin, title={ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval}, author={Messina, Nicola and Stefanini, Matteo and Cornia, Marcella and Baraldi, Lorenzo and Falchi, Fabrizio and Amato, Giuseppe and Cucchiara, Rita}, booktitle={International Conference on Content-based Multimedia Indexing}, pages={64--70}, year={2022} } 