Skip to content

Wenyueh/MinivLLM

Repository files navigation

图片描述

| English | 简体中文 |

miniVLLM

A custom implementation of vLLM inference engine with attention mechanism benchmarks, based on Nano-vLLM but with self-contained paged attention and flash attention implementation.

Benchmarking on flash attention in prefilling time and paged attention in decoding time are provided.

New to vLLM? Check out HowToApproachvLLM.md for a step-by-step implementation guide covering layers, models, paged attention, CUDA graphs, and scheduling.

Quickstart

# Install uv package manager curl -LsSf https://astral.sh/uv/install.sh | sh # Sync dependencies uv sync # Run the main inference engine uv run python main.py # Run prefilling benchmark uv run python benchmark_prefilling.py # Run decoding benchmark uv run python benchmark_decoding.py

To run multi-GPU setting, simply change world_size to n > 1 in config in main.py

What Each Script Does

uv run python main.py

This is the main inference engine demo

Demonstrates the complete LLM inference pipeline using a custom engine implementation:

  • Create a small version of Qwen3 with random initialization
  • Creates 60 chat prompts (2 base prompts repeated 30 times each)
  • Processes them through the custom LLM engine with batch processing
  • Uses paged attention and KV cache management for efficient inference
  • Generates up to 256 tokens per prompt with temperature sampling

This showcases how the custom vLLM implementation handles batched text generation with memory-efficient attention.

uv run python benchmark_prefilling.py

This is the pefilling phase comparison

Compares three attention implementations during the prefilling phase (processing input prompts):

  1. PyTorch Standard (O(N²) memory): Traditional attention that materializes full attention matrix
  2. Naive Triton (O(N²) memory): GPU kernel that also uses O(N²) memory, limited by shared memory constraints (≤128 tokens)
  3. Flash Attention (O(N) memory): Memory-efficient online softmax algorithm that processes attention in blocks
uv run python benchmark_decoding.py

This is the decoding phase comparison

Compares three implementations during the decoding phase (generating output tokens one at a time):

  1. Naive PyTorch: Simple loop-based implementation using paged KV cache
  2. Optimized PyTorch: Vectorized implementation with batch gathering and masking
  3. Triton Kernel: Custom GPU kernel optimized for paged attention decode

Project Structure

myvllm/ ├── src/ │ └── myvllm/ # Core vLLM implementation │ ├── models/ # Model implementations │ ├── engine/ # LLM engine logic, including sequence definition for input prompts, block management for KV cache management for GPU, scheduler for iteration-based scheduling of sequences, runner for actual implementation of running prefilling and decoding, and engine for generation API interface │ ├── layers/ # Components for model/ │ ├── utils/ # context │ └── sampling_parameters.py ├── main.py # Full inference demo ├── benchmark_prefilling.py # Prefilling attention comparison └── benchmark_decoding.py # Decoding attention comparison 

Requirements

  • Python ≥3.11, < 3.12
  • CUDA-capable GPU
  • Dependencies: transformers, torch, xxhash (managed by uv)

Star History

Star History Chart

About

Based on Nano-vLLM, a simple replication of vLLM with self-contained paged attention and flash attention implementation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages