HLIP

Official PyTorch implementation of the following paper:
Towards Scalable Language-Image Pre-training for 3D Medical Imaging
University of Michigan

Overview

We propose Hierarchical attention for Language-Image Pre-training (HLIP), inspired by the natural hierarchy of radiology data: slice, scan, and study. With this lightweight attention mechanism, HLIP can be trained directly on uncurated clinical datasets, enabling scalable language-image pre-training in 3D medical imaging. For real-world clinical use, HLIP can be applied to studies containing either a single scan (e.g., chest CT) or multiple scans (e.g., brain MRI).

Updates

(2025-06) Complete the initiation of HLIP repository.
(2025-05) Release HLIP models trained on chest CT and brain MRI, feel free to try our demos.

Getting Started

Install

open-clip

python3 -m venv env source env/bin/activate pip install -U pip pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121 git clone git@github.com:mlfoundations/open_clip.git cd open_clip make install make install-training

Model Card

Data	Attention	Patch Size	Model
CT-RATE-20K	`slice` + `scan`	`8, 24, 24`	ViT-Base
BrainMRI220K	`scan` + `study`	`16, 16, 16`	ViT-Base
BrainMRI220K	`scan` + `study`	`8, 16, 16`	ViT-Base
BrainMRI220K	`slice` + `scan` + `study`	`8, 16, 16`	ViT-Base

Demo

Chest CT: an example from the external Rad-ChestCT dataset.

python inference_rad_chestct.py \ --model vit_base_singlescan_h2_token1176 \ --resume /path/to/vit_base_chestct_h2_token1176.pt \ --data /docs/tst32751/tst32751.pt \

Brain MRI: an example from the external BraTS23 dataset.

python inference_pub_brain_5.py \ --model vit_base_multiscan_h2_token1176 \ --resume /path/to/vit_base_brainmri_h2_token1176.pt \ --patch-size 8 16 16 \ --num-slices 72 \ --data /docs/BraTS-GLI-00459-000/ \

Visualizing the activation with --interpret.

Evaluation

CT-RATE

python zeroshot_ct_rate.py \ --model vit_base_singlescan_h2_token2744 \ --resume /path/to/vit_base_chestct_h2_token2744.pt \ --ct-rate-root /data/ct_rate/valid/ \ --zeroshot-template volume \

Rad-ChestCT

python zeroshot_rad_chestct.py \ --model vit_base_singlescan_h2_token2744 \ --resume /path/to/vit_base_chestct_h2_token2744.pt \ --rad-chestct-root /data/rad_chestct/ \ --zeroshot-template volume \

Brain MRI

python pub_brain_5_embed.py \ --model vit_base_multiscan_h2_token1176 \ --resume /path/to/vit_base_brainmri_h2_token1176.pt \ --num-slices 144 \

python zeroshot_pub_brain_5.py \ --model vit_base_multiscan_h2_token1176 \ --resume /path/to/vit_base_brainmri_h2_token1176.pt \ --num-slices 144 \ --zeroshot_prompt prompt \ --zeroshot_template template \

As there are ~18K studies in the Pub-Brain-5 dataset, evaluation may take ~30 minutes. We first extract the embedding for each study, followed by zero-shot classification. This procedure facilitates the evaluation of prompt engineering. Although we use a fixed input size of 48, 224, 224, --num-slices is set to 144 during evaluation, as we found that HLIP can directly transfer and benefit from higher-resolution inputs at test time.

Training

Our training implementation is closely aligned with open-clip, allowing us to leverage features such as patch dropout and siglip. Below, we provide a training code demo for chest CT. Training on CT-RATE for 20 epochs takes ~6 hours using a node with 4 A40 GPUs.

torchrun --rdzv_endpoint=localhost:29500 --nproc_per_node 4 main.py \ --json-root ../../data/ct_rate/files/ --data-root /path/to/data/ct_rate/ \ --train-data raw_annotation --input-info -1150 350 crop \ --zeroshot-ct-rate ../../data/ct_rate/metafiles/valid_labels.csv --zeroshot-template volume \ --zeroshot-frequency 1 \ --save-frequency 1 \ --report-to wandb \ --wandb-project-name chest_ct \ --warmup 377 \ --batch-size 16 \ --accum-batch 1 \ --lr=1e-5 \ --wd=0.2 \ --epochs=20 \ --precision amp \ --workers 4 \ --grad-checkpointing \ --model vit_base_singlescan_h2_token2744 \ --use-cxr-bert \ --lock-text

Use the following commands for patch dropout:

 --force-patch-dropout 0.5 \ --beta2 0.95

Use the following commands for siglip:

 --siglip

Citation

If you find this repository helpful, please consider citing:

@article{zhao2025towards, title={Towards Scalable Language-Image Pre-training for 3D Medical Imaging}, author={Zhao, Chenhui and Lyu, Yiwei and Chowdury, Asadur and Harake, Edward and Kondepudi, Akhil and Rao, Akshay and Hou, Xinhai and Lee, Honglak and Hollon, Todd}, journal={arXiv preprint arXiv:2505.21862}, year={2025} }

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
data		data
docs		docs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HLIP

Overview

Updates

Getting Started

Install

Model Card

Demo

Evaluation

Training

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HLIP

Overview

Updates

Getting Started

Install

Model Card

Demo

Evaluation

Training

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages