[Model] Add Gemma3 GGUF multimodal support#27772
[Model] Add Gemma3 GGUF multimodal support#27772vllm-bot merged 53 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds support for Gemma3 GGUF multimodal models, which is a significant feature enhancement. The implementation correctly handles the detection of multimodal GGUF files and extends the model loader to accommodate vision tower weights from separate mmproj.gguf files. The changes are well-isolated to minimize impact on existing models. However, there are several areas for improvement. A critical issue was found where a function is missing a return statement, which could lead to runtime errors. Additionally, there are multiple instances of hardcoded configurations for the vision tower and image processor, as well as code duplication for model detection logic. These reduce maintainability and make the implementation brittle to future changes. I've also identified a bug in the runai_safetensors_weights_iterator that will cause an incorrect progress bar. Addressing these points will improve the robustness and maintainability of this new feature.
| This pull request has merge conflicts that must be resolved before it can be |
Isotr0py left a comment
There was a problem hiding this comment.
I think this PR introduces too many gemma3-mm GGUF specific hardcoded code.
Besides this, I'm a bit confused why we need to use GGUFQuantMethod for a totally unquantized mm_proj checkpoint as well.
3f11c5d to 784af2c Compare c71ca33 to acff4c0 Compare
Isotr0py left a comment
There was a problem hiding this comment.
Unfortunately, I think this PR is still under poor design, because there are too many gemma3 specific codes outside model implementation...
This commit enhances GGUF model loading with two key improvements: 1. Dynamic vocab_size extraction and override: - Added extract_vocab_size_from_gguf() to read vocab_size from token_embd.weight tensor shape - Updated maybe_patch_hf_config_from_gguf() to automatically override vocab_size from GGUF file for all GGUF models - Enables loading models with extended vocabularies (e.g., Unsloth) without manual config editing 2. Improved automatic weight mapping for multimodal models: - Enhanced find_hf_name_in_tensor_map() to handle naming convention differences between HF and gguf-py: * Strips 'language_model.' prefix for multimodal models * Converts '_weight' suffix to '.weight' (Gemma3 compatibility) - Removed hardcoded Gemma3-specific weight mappings in favor of automatic mapping via gguf-py's vision_name_map - Removed redundant prefix fallback logic I will: - add utomatic compatibility with models using extended vocabularies - make a more maintainable code with no model-specific hardcoded mappings - introduce a robust handling of HF/gguf-py naming convention mismatches - work for all multimodal GGUF models, not just Gemma3 Technical details: - GGUF embedding tensor format: [hidden_size, vocab_size] - vocab_size extracted from shape[1] as source of truth - Uses get_text_config() to handle both regular and multimodal configs - Graceful fallback if vocab_size extraction fails Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
- Change gemma3-transformers test to use model_impl='auto' instead of 'transformers' to avoid missing generate_attention_masks() method in TransformersMultiModalForCausalLM - Change quantized models test to use dtype='bfloat16' instead of 'half' since gemma3_text models don't support float16 due to numerical instability These are test configuration fixes for pre-existing bugs unrelated to GGUF changes. Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
c89868a to f784296 Compare Remove extract_vocab_size_from_gguf() and vocab_size override logic as transformers now handles this internally via modeling_gguf_pytorch_utils. Also fixed the tests/models/multimodal/generation/test_common.py to use HuggingFace implementation for Gemma3 testing. Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
f784296 to 5a245bd Compare Prevent AttributeError when calling generate_attention_masks() on models that don't implement this method (e.g., TransformersMultiModalForCausalLM). This fixes the errors on multi-modal-models-test CI test. The condition now checks: 1. uses_custom_attention_masks flag is set 2. multimodal features are present 3. model has generate_attention_masks method This ensures the method is only called on models that support custom attention masks (e.g., Gemma3ForConditionalGeneration). Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
| Hey @Isotr0py - all tests are passing, with 2 exceptions not related to Gemma3 or the PR work:
It is failing with this access error for the test image to be used:
It is failing for EAGLE-LLAMA3.1-Instruct-8B (before even getting to any Gemma3 test): And it is failing due to GPU OOM errors: can we proceed with the merge? |
| Yeah they are not related, let's merge |
Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Due to the latest changes from upstream, gemma3 is failing to compile on HPU vllm-project/vllm#27772 vllm-project/vllm#28842 -replace unfold to view/reshape -replace text embedding to avoid dynamic shape -remove merge_multimodal replacement since masked_scatter issue is fixed -enable back gemma3 model test --------- Signed-off-by: Jimin Ha <jimin.ha@intel.com>
Due to the latest changes from upstream, gemma3 is failing to compile on HPU vllm-project/vllm#27772 vllm-project/vllm#28842 -replace unfold to view/reshape -replace text embedding to avoid dynamic shape -remove merge_multimodal replacement since masked_scatter issue is fixed -enable back gemma3 model test --------- Signed-off-by: Jimin Ha <jimin.ha@intel.com>
Due to the latest changes from upstream, gemma3 is failing to compile on HPU vllm-project/vllm#27772 vllm-project/vllm#28842 -replace unfold to view/reshape -replace text embedding to avoid dynamic shape -remove merge_multimodal replacement since masked_scatter issue is fixed -enable back gemma3 model test --------- Signed-off-by: Jimin Ha <jimin.ha@intel.com>
[Model] Add Gemma3 GGUF multimodal support
Summary
This PR enables full multimodal inference support for Gemma3 models in GGUF format, allowing users to run quantized Gemma3 multimodal models with both text-only and image+text prompts. The implementation adds automatic detection and loading of
mmproj.gguffiles for vision tower weights while maintaining complete isolation to ensure zero impact on existing Gemma3 HuggingFace models or other model architectures.Resolves: #22497
New Functionality
1. Automatic GGUF Multimodal Detection
The model configuration now automatically detects when a Gemma3 GGUF model has an accompanying
mmproj.gguffile and switches to the multimodal architecture:vllm/config/model.py_detect_gguf_multimodal_gemma3()checks formmproj-*.ggufin model directorymodel_cls = "Gemma3ForConditionalGeneration"when multimodal projector is foundGemma3ForCausalLMfor text-only GGUF modelsExample usage:
2. GGUF Vision Tower Weight Loading
Enhanced GGUF loader to handle multimodal weights from separate
mmproj.gguffiles:vllm/model_executor/model_loader/gguf_loader.py,weight_utils.pyv.*.attn.{q,k,v,out}→ Vision attention weightsv.*.mlp.{fc1,fc2}→ Vision FFN weightsv.post_ln.{weight,bias}→ Post-layernorm weightsmm.0.{weight,bias}→ Multimodal projector weightsfloat16/bfloat16(GGUF spec)is_gguf_weight = True3. GGUF Processor Loading
Processors now correctly load from tokenizer path for GGUF models:
vllm/transformers_utils/processor.py.gguffile paths4. V1 Engine Multimodal Support
Added multimodal embedding gathering in V1 GPU model runner:
vllm/v1/worker/gpu_model_runner.py_gather_mm_embeddings()extracts vision embeddings after encoder executionget_multimodal_embeddings()for proper mergingIsolation & Safety Guarantees
All changes are strictly scoped to Gemma3 GGUF multimodal models only. No other models or formats are affected.
Isolation Verification
if quantization == "gguf"if quantization in ("gguf", "inc")if self.model.endswith('.gguf')if self._is_ggufNon-Regression Tests
Comprehensive testing confirms zero impact on other model types:
No changes to:
Testing
Test Environment
google/gemma-3-1b-it(HF text-only)google/gemma-3-4b-it(HF text-only & multimodal)google/gemma-3-4b-it-q4_0-gguf(GGUF text-only & multimodal)Backward Compatibility
100% backward compatible - No breaking changes to existing functionality:
Checklist
[Model] Add Gemma3 GGUF multimodal supportAdditional Context
This PR builds upon the recently merged text-only GGUF support for Gemma3 and extends it to support multimodal inference. The implementation carefully preserves all upstream bugfixes, including the recent newline token ordering fix (#27538).
Tested on: vLLM main branch (commit
b5d90f740, Oct 29 2025)Signed-off-by: Luciano Martins lucianommartins@users.noreply.github.com