DeepCORO_CLIP is a deep learning model for echocardiography video interpretation using contrastive learning. It leverages a Multiscale Vision Transformer (mVIT) for video encoding and BioMedBERT for text encoding, trained on millions of video-report pairs.
- Contrastive Learning: Train on video-report pairs using CLIP-style contrastive learning
- Single video mode: Process one video per study
- Multi-video mode: Process multiple videos per study with aggregation
- Linear Probing: Fine-tune the model for specific tasks using linear probing
- Multi-GPU Training: Support for distributed training across multiple GPUs
- Hyperparameter Optimization: Built-in support for Weights & Biases sweeps
- Automatic Mixed Precision: Optimized training with AMP
- Distributed Data Parallel: Efficient multi-GPU training
- Patch- vs. Video-level Reasoning: Expose all patch tokens, a single token per video, or a single token per study with two simple flags (
aggregateandper_video_pool) in theVideoEncoder.
- CUDA-capable GPU
- Python 3.11+
-
π₯ Clone the Repository:
https://github.com/HeartWise-AI/DeepCORO_CLIP.git cd DeepCORO_CLIP -
Set up Virtual Environment:
pip install uv uv sync
-
Activate Virtual Environment:
source .venv/bin/activate -
Install yq required to run scripts/run_sweep.sh:
wget https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64 -O /usr/bin/yq && \ chmod +x /usr/bin/yq -
Log into Weights & Biases required for sweep:
wandb login
-
Make sure you have FFMPEG 4.4.x is installed - required for sweep:
which ffmpeg conda remove ffmpeg # remove if /opt/conda/bin/ffmpeg exists sudo apt update sudo apt install ffmpeg sudo apt install libavcodec-extra ffmpeg -version
The project uses configuration files located in the config/ directory:
-
CLIP Training (
config/clip/base_config.yaml):- Training parameters (epochs, batch size, learning rate)
- Model architecture settings
- Data loading parameters
- Optimization settings
- Video mode settings (single/multi)
- Video aggregation parameters
-
Linear Probing (
config/linear_probing/base_config.yaml):- Task-specific parameters
- Head structure configuration
- Loss function settings
- Backbone freezing options
-
CLIP Training (
config/clip/sweep_config_*.yaml,config/clip/sweep_siglip_output_dataset_*.yaml):- Hyperparameter search space for CLIP training
- Supports both single and multi-video training
sweep_siglip_output_dataset_config.yamlsurfaces conservative learning rate, grad-clipping, and SigLIP weighting knobs for unstable SigLIP runs
-
Linear Probing (
config/linear_probing/sweep_config.yaml):- Hyperparameter optimization for linear probing tasks
- Task-specific parameter ranges
Process multiple videos per study with aggregation:
# Single GPU training without logging results to wandb (see scripts/runner.sh) bash scripts/runner.sh --base_config config/clip/base_config.yaml --selected_gpus 0 --use_wandb false --run_mode train # Multi-GPU training with results logging on wandb (see scripts/runner.sh) bash scripts/runner.sh --base_config config/clip/base_config.yaml --selected_gpus 0,1 --use_wandb true --run_mode train # Multi-GPU hyperparameters fine-tuning - RunMode and UseWandb are forced to train and true respectively (see scripts/run_sweep.sh) bash scripts/run_sweep.sh --base_config config/clip/base_config.yaml --sweep_config config/clip/sweep_config_single_video.yaml --selected_gpus 3 --count 5 # SigLIP output dataset stability sweep (tunes grad clipping, AMP, and SigLIP weighting) bash scripts/run_sweep.sh --base_config config/clip/siglip_output_dataset_config.yaml --sweep_config config/clip/sweep_siglip_output_dataset_config.yaml --selected_gpus 0,1 --count 10Not supported
Process validation data from input CSV (rows where Split == 'inference') - working on single GPU only
bash scripts/runner.sh --selected_gpus 0 --base_config config/clip/base_config.yaml --run_mode inference --use_wandb falseNot supported
Fine-tune the model for specific tasks using linear probing - couple of combination examples:
# Single GPU training without logging results to wandb (see script/runner.sh) bash scripts/runner.sh --base_config config/linear_probing/base_config.yaml --selected_gpus 0 --use_wandb false --run_mode train # Multi-GPU training with results logging on wandb (see script/runner.sh) bash scripts/runner.sh --base_config config/linear_probing/base_config.yaml --selected_gpus 0,1 --use_wandb true --run_mode train # Multi-GPU hyperparameters fine-tuning - RunMode and UseWandb are forced to train and true respectively (see scripts/run_sweep.sh) bash scripts/run_sweep.sh --base_config config/linear_probing/base_config.yaml --sweep_config config/linear_probing/sweep_config.yaml --selected_gpus 0,1 --count 5 Process validation data from input CSV (rows where Split == 'val') and compute CI for each head
bash scripts/runner.sh --use_wandb false --base_config config/linear_probing/stenosis/base_config_stenosis_2vue.yaml --run_mode val --selected_gpus 1,2,3Process validation data from input CSV (rows where Split == 'test') and compute CI for each head
bash scripts/runner.sh --use_wandb false --base_config config/linear_probing/stenosis/base_config_stenosis_2vue.yaml --run_mode test --selected_gpus 1,2,3Process validation data from input CSV (rows where Split == 'inference')
bash scripts/runner.sh --use_wandb false --base_config config/linear_probing/stenosis/base_config_stenosis_2vue.yaml --run_mode inference --selected_gpus 1,2,3Optionally, you can build a Docker container to run training, validation, and inference pipelines. For the validation pipeline, please set up your huggingface API key in api_key.json as weights will be publicly available only upon publication.
docker build -t deepcoro_clip-docker .Requirements:
- Make sure your CSV file is in the data folder :
$(pwd)/datacan be replaced by the absolute path to that folder - Create a folder results :
$(pwd)/resultscan be replaced by the absolute path to that folder - Make sure you have a column
FileNamewith root path fined as/app/videos:$(pwd)/videoscan be replaced by the absolute path to your base video folder path
docker run -it --gpus all --shm-size=32g --memory=64g --ipc=host --network=host -v $(pwd)/data:/app/data -v $(pwd)/results:/app/results -v $(pwd)/videos:/app/videos deepcoro_clip-dockerInside the container: Once connected to the docker container:
- For validation and inference: Follow step 3. only from the
Environment Setupsection above - For training: Follow step 3. 5. and 6. from the
Environment Setupsection above - Download pretrained weights:
python utils/download_pretrained_weights.pyThe pretrained weights will be in the folder /app/pretrained_models.
- Run your pipeline: Select the appropriate command from the
Run Modessection above
- Multiscale Vision Transformer (mVIT) backbone
- Configurable number of heads and layers
- Support for pretrained weights
- Optional backbone freezing
- New flags for fine-grained control over the output:
aggregate=True(default) β returns one study-level vector[B, D].aggregate=False, per_video_pool=Trueβ returns one token per video[B, N, D], ready for MIL / linear probing heads.aggregate=False, per_video_pool=Falseβ returns all patch tokens - ONLY Setting that preeservs all the tokens[B, NΒ·L, D]for the most detailed downstream reasoning.
Example (video-level MIL):
from models.video_encoder import VideoEncoder from models.multi_instance_linear_probing import MultiInstanceLinearProbing encoder = VideoEncoder( backbone="mvit", aggregate=False, # skip internal aggregator aggregate_videos_tokens=True, # one token per video ) probe = MultiInstanceLinearProbing( embedding_dim=encoder.embedding_dim, head_structure={"severity": 4}, pooling_mode="attention", ) video_batch = ... # [B, N, T, H, W, C] feats = encoder(video_batch) # [B, N, D] logits = probe(feats) # dict with head output- BioMedBERT for medical text encoding
- Configurable freezing ratio
- Contrastive learning with video features
- Task-specific classification heads
- Configurable dropout and architecture
- Support for multiple output classes per head
We use pre-commit hooks to ensure code quality and consistency:
# Install pre-commit uv pip install pre-commit pre-commit install # Run hooks manually pre-commit run --all-files| GPU Memory | Recommended Batch Size | Command |
|---|---|---|
| 8GB | 4-8 | --batch-size 8 |
| 12GB | 8-16 | --batch-size 16 |
| 16GB | 16-24 | --batch-size 24 |
| 24GB+ | 24-32 | --batch-size 32 |
-
Batch Size Selection:
- Start with smaller batch sizes and increase if memory allows
- Larger batch sizes generally allow faster training
- Reduce if you get OOM errors
-
Number of Workers:
- Rule of thumb:
num_workers = 4 * num_gpus - Reduce if you get memory or file handle errors
- Example:
--num-workers 2for slower storage systems
- Rule of thumb:
-
Learning Rate:
- Default (1e-4) works well for most cases
- For larger batch sizes:
lr = 1e-4 * (batch_size/32) - Example:
--lr 2e-4for batch size 64
-
Number of Epochs:
- Default (50) is good for most cases
- Increase for better performance:
--epochs 100 - Decrease for quick experiments:
--epochs 10
-
Out of Memory (OOM):
- Reduce batch size
- Use gradient accumulation
- Force single GPU mode
-
GPU Selection:
- Use
CUDA_VISIBLE_DEVICESto select specific GPUs - Monitor GPU usage with
nvidia-smi
- Use
-
Training Speed:
- Multi-GPU isn't always faster due to overhead
- Start with single GPU and scale up if needed
- GPU Memory Usage:
nvidia-smi -l 1 # Monitor GPU usage every second- Training Progress:
- Progress bar shows current epoch and batch
- Loss values are printed every 10 batches
- Checkpoints are saved every 5 epochs
- WandB Logging:
- Training metrics are logged to Weights & Biases
- Includes loss, learning rate, batch size
- Access via WandB dashboard
heartwise-ai-deepcoro_clip/ βββ config/ # Configuration files β βββ clip/ # CLIP training configs β βββ linear_probing/ # Linear probing configs βββ dataloaders/ # Data loading modules βββ dataset_creation/ # How MHI dataset was built βββ docs/ # Documentation on CLS-Token implementation βββ models/ # Neural network models βββ projects/ # Project implementations βββ runners/ # Training runners βββ scripts/ # Training scripts βββ utils/ # Utility functions βββ tests/ # Unit test pipeline Contributions to DeepCoro_CLIP repository are welcome! Please follow these steps to contribute:
- Fork the repository
- Create a new branch for your feature or bug fix
- Make your changes and commit them with clear, descriptive messages
- Push your changes to your fork
- Submit a pull request to the main repository
If you find this repository useful, please cite our work:
@article{, title={}, author={}, journal={}, year={}, publisher={} }