DRAFT: Mistral large 3 Extended Blackwell Support #29884
Draft
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
Purpose
Improve performance and support of Mistral Large 3 on Blackwell.
Details
benchmarks/kernels/benchmark_moe.pyvllm/benchmarks/throughput.pyBest Performance Usage
FP8 Checkpoint on DGX B200 (8 devices)
The FP8 model will fit on a single node.
At low concurrencies, deploy with TP8:
At higher concurrencies (128 concurrent requests and above), deploy with DP8 and expert parallelism:
NVFP4
For NVFP4 checkpoints add the following to leverage the optimized kernels from Flashinfer:
With a version of Flashinfer
>0.5.3:A bug in the auto-tuner fixed recently (flashinfer-ai/flashinfer#2140) allows using
flashinfer-cudnn.GB200 P/D Disaggregated Dynamo Deployment
There are two options to set up a Dynamo P/D disaggregated deployment of this model. The first one is available immediately and relies on the processing pipeline of Dynamo. The second is pending a PR on Dynamo to enable delegating pre-processing to the vLLM backend.
For compatibility with ToT vLLM, you might need to include some changes that are not currently in upstream Dynamo:
With Dynamo request processing
Start by copying
config.jsonfrom Ministral to your model directory.With delegated request processing
Pending on some changes (TODO LINK) in Dynamo, you will be able to skip the file-copying step above.
Next steps
We have identified further optimizations which will be part of other PRs:
Contributors
@dbari, @DanBlanaru, @evezhier, @hypdeb