Official PyTorch implementation of the following paper:
Towards Scalable Language-Image Pre-training for 3D Medical Imaging
University of Michigan
![]()
We propose Hierarchical attention for Language-Image Pre-training (HLIP), inspired by the natural hierarchy of radiology data: slice, scan, and study. With this lightweight attention mechanism, HLIP can be trained directly on uncurated clinical datasets, enabling scalable language-image pre-training in 3D medical imaging. For real-world clinical use, HLIP can be applied to studies containing either a single scan (e.g., chest CT) or multiple scans (e.g., brain MRI).
- (2025-06) Complete the initiation of HLIP repository.
- (2025-05) Release HLIP models trained on chest CT and brain MRI, feel free to try our demos.
python3 -m venv env source env/bin/activate pip install -U pip pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121 git clone git@github.com:mlfoundations/open_clip.git cd open_clip make install make install-training| Data | Attention | Patch Size | Model |
|---|---|---|---|
| CT-RATE-20K | slice + scan | 8, 24, 24 | ViT-Base |
| BrainMRI220K | scan + study | 16, 16, 16 | ViT-Base |
| BrainMRI220K | scan + study | 8, 16, 16 | ViT-Base |
| BrainMRI220K | slice + scan + study | 8, 16, 16 | ViT-Base |
Chest CT: an example from the external Rad-ChestCT dataset.
python inference_rad_chestct.py \ --model vit_base_singlescan_h2_token1176 \ --resume /path/to/vit_base_chestct_h2_token1176.pt \ --data /docs/tst32751/tst32751.pt \Brain MRI: an example from the external BraTS23 dataset.
python inference_pub_brain_5.py \ --model vit_base_multiscan_h2_token1176 \ --resume /path/to/vit_base_brainmri_h2_token1176.pt \ --patch-size 8 16 16 \ --num-slices 72 \ --data /docs/BraTS-GLI-00459-000/ \Visualizing the activation with --interpret.
CT-RATE
python zeroshot_ct_rate.py \ --model vit_base_singlescan_h2_token2744 \ --resume /path/to/vit_base_chestct_h2_token2744.pt \ --ct-rate-root /data/ct_rate/valid/ \ --zeroshot-template volume \Rad-ChestCT
python zeroshot_rad_chestct.py \ --model vit_base_singlescan_h2_token2744 \ --resume /path/to/vit_base_chestct_h2_token2744.pt \ --rad-chestct-root /data/rad_chestct/ \ --zeroshot-template volume \Brain MRI
python pub_brain_5_embed.py \ --model vit_base_multiscan_h2_token1176 \ --resume /path/to/vit_base_brainmri_h2_token1176.pt \ --num-slices 144 \python zeroshot_pub_brain_5.py \ --model vit_base_multiscan_h2_token1176 \ --resume /path/to/vit_base_brainmri_h2_token1176.pt \ --num-slices 144 \ --zeroshot_prompt prompt \ --zeroshot_template template \As there are ~18K studies in the Pub-Brain-5 dataset, evaluation may take ~30 minutes. We first extract the embedding for each study, followed by zero-shot classification. This procedure facilitates the evaluation of prompt engineering. Although we use a fixed input size of 48, 224, 224, --num-slices is set to 144 during evaluation, as we found that HLIP can directly transfer and benefit from higher-resolution inputs at test time.
Our training implementation is closely aligned with open-clip, allowing us to leverage features such as patch dropout and siglip. Below, we provide a training code demo for chest CT. Training on CT-RATE for 20 epochs takes ~6 hours using a node with 4 A40 GPUs.
torchrun --rdzv_endpoint=localhost:29500 --nproc_per_node 4 main.py \ --json-root ../../data/ct_rate/files/ --data-root /path/to/data/ct_rate/ \ --train-data raw_annotation --input-info -1150 350 crop \ --zeroshot-ct-rate ../../data/ct_rate/metafiles/valid_labels.csv --zeroshot-template volume \ --zeroshot-frequency 1 \ --save-frequency 1 \ --report-to wandb \ --wandb-project-name chest_ct \ --warmup 377 \ --batch-size 16 \ --accum-batch 1 \ --lr=1e-5 \ --wd=0.2 \ --epochs=20 \ --precision amp \ --workers 4 \ --grad-checkpointing \ --model vit_base_singlescan_h2_token2744 \ --use-cxr-bert \ --lock-textUse the following commands for patch dropout:
--force-patch-dropout 0.5 \ --beta2 0.95Use the following commands for siglip:
--siglipIf you find this repository helpful, please consider citing:
@article{zhao2025towards, title={Towards Scalable Language-Image Pre-training for 3D Medical Imaging}, author={Zhao, Chenhui and Lyu, Yiwei and Chowdury, Asadur and Harake, Edward and Kondepudi, Akhil and Rao, Akshay and Hou, Xinhai and Lee, Honglak and Hollon, Todd}, journal={arXiv preprint arXiv:2505.21862}, year={2025} }