A lightweight vLLM-style inference engine built from scratch, designed to be minimal, readable, and easy to customize.
- Pure Python implementation with clean structure
- Supports tensor parallelism, prefix caching, and CUDA Graph capture
- Compatible with HuggingFace configs and weights
For local development, install in editable mode:
pip install -e .Example for Qwen3-0.6B:
huggingface-cli download --resume-download Qwen/Qwen3-0.6B \ --local-dir ~/huggingface/Qwen3-0.6B/ \ --local-dir-use-symlinks FalseRun example.py directly, or use this minimal snippet:
from minivllm import LLM, SamplingParams llm = LLM("~/huggingface/Qwen3-0.6B", enforce_eager=True, tensor_parallel_size=1) sampling_params = SamplingParams(temperature=0.6, max_tokens=256) prompts = ["Hello, Mini-vLLM."] outputs = llm.generate(prompts, sampling_params) print(outputs[0]["text"])See bench.py for the benchmark script. Adjust parameters based on your hardware and model size.
- Default examples use the Qwen3 family; add or replace models in
minivllm/models - Ensure CUDA and dependencies are properly installed before running