MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models via Reinforcement Learning
Run the setup script to configure the environment:
bash setup.shThis script will:
- Create conda environment
medvlm-r1 - Install necessary dependencies
- Configure the open-r1-multimodal framework
Use the Jupyter notebook to quickly experience the model:
jupyter notebook demo.ipynbThe demo includes:
- Model loading
- Medical image VQA examples
- Inference process demonstration
The model generates structured reasoning process:
<think> The image is a magnetic resonance imaging (MRI) scan of a knee joint. The scan shows a chondral abnormality, which is a type of cartilage damage. This is evident from the irregular shape and the presence of a defect in the cartilage. </think> <answer>A</answer> Download the HuatuoGPT-Vision dataset via Hugging Face CLI:
# 1) Install Hugging Face CLI (if not already) pip install -U "huggingface_hub[cli]" # 2) (Optional) Login if the dataset requires auth # huggingface-cli login # 3) Download the dataset to a local directory # Replace <TARGET_DIR> with your local path, e.g., /data/datasets/PubMedVision hf download FreedomIntelligence/PubMedVision \ --repo-type dataset \ --local-dir <TARGET_DIR> \ --local-dir-use-symlinks False \ --include "*" # After download, set <DATASET_PATH_ROOT>=<TARGET_DIR> in your scriptsThe dataset contains:
- MRI, CT, X-ray medical images
- Corresponding visual question-answer pairs
- Multi-modal medical reasoning tasks
Run the training script:
bash train_script.shNote: Please update the following paths in the script:
<DATASET_NAME>: Dataset name<GPU_NUM>: Number of GPUs<LOG_PATH>: Log output path<HF_CACHE_DIR>: Hugging Face cache directory<WANDB_ENTITY>: Weights & Biases entity<WANDB_PROJECT>: Project name<OUTPUT_DIR_ROOT>: Output directory root path<MODEL_REPO_OR_DIR>: Model path<DATASET_PATH_ROOT>: Dataset root path<MASTER_ADDR>: Master node address<MASTER_PORT>: Master node port
Run the testing script:
bash test_script.shNote: Please update the following paths in the script:
<HF_CACHE_DIR>: Hugging Face cache directory<CUDA_DEVICES>: CUDA devices<MODEL_REPO_OR_DIR>: Model path<DATASET_PATH_ROOT>: Dataset root path<OUTPUT_DIR>: Output directory
The testing script supports the following parameters:
MODALITY: Modality type (MRI, CT, Ultrasound, Xray, Dermoscopy, Microscopy, Fundus)PROMPT_TYPE: Prompt type (simple, complex)BSZ: Batch sizeMAX_NEW_TOKENS: Maximum new tokens to generateDO_SAMPLE: Whether to sampleTEMPERATURE: Temperature parameter
r1-v-med/ ├── demo.ipynb # Demo notebook ├── setup.sh # Setup script ├── train_script.sh # Training script ├── test_script.sh # Testing script ├── MRI_CT_XRAY_300each_dataset.json # Test dataset ├── images/ # Example images │ ├── successful_cases/ # Successful cases │ └── failure_cases/ # Failure cases └── src/ ├── eval/ # Evaluation code │ └── test_qwen2vl_med.py # Testing script ├── distill_r1/ # R1 distillation related └── open-r1-multimodal/ # Based framework └── src/open_r1/ ├── grpo.py # GRPO training code └── trainer/ └── grpo_trainer.py # GRPO trainer If you find our work helpful, please cite:
@article{pan2025medvlm, title={MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning}, author={Pan, Jiazhen and Liu, Che and Wu, Junde and Liu, Fenglin and Zhu, Jiayuan and Li, Hongwei Bran and Chen, Chen and Ouyang, Cheng and Rueckert, Daniel}, journal={arXiv preprint arXiv:2502.19634}, year={2025} }Our code is based on the following open-source projects:
- open-r1-multimodal: https://github.com/EvolvingLMMs-Lab/
- R1-V: https://github.com/StarsfieldAI/R1-V
Thanks to these excellent open-source projects for providing a solid foundation for our research.
This project is licensed under the Apache 2.0 License.