Skip to content

wzrdl/mini-vllm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mini-vLLM

A lightweight vLLM-style inference engine built from scratch, designed to be minimal, readable, and easy to customize.

Key Features

  • Pure Python implementation with clean structure
  • Supports tensor parallelism, prefix caching, and CUDA Graph capture
  • Compatible with HuggingFace configs and weights

Installation

For local development, install in editable mode:

pip install -e .

Model Download

Example for Qwen3-0.6B:

huggingface-cli download --resume-download Qwen/Qwen3-0.6B \ --local-dir ~/huggingface/Qwen3-0.6B/ \ --local-dir-use-symlinks False

Quick Start

Run example.py directly, or use this minimal snippet:

from minivllm import LLM, SamplingParams llm = LLM("~/huggingface/Qwen3-0.6B", enforce_eager=True, tensor_parallel_size=1) sampling_params = SamplingParams(temperature=0.6, max_tokens=256) prompts = ["Hello, Mini-vLLM."] outputs = llm.generate(prompts, sampling_params) print(outputs[0]["text"])

Benchmark

See bench.py for the benchmark script. Adjust parameters based on your hardware and model size.

Notes

  • Default examples use the Qwen3 family; add or replace models in minivllm/models
  • Ensure CUDA and dependencies are properly installed before running

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages