High-Performance GPU Kernels for Inference

FlashInfer is a library and kernel generator for inference that delivers state-of-the-art performance across diverse GPU architectures. It provides unified APIs for attention, GEMM, and MoE operations with multiple backend implementations including FlashAttention-2/3, cuDNN, CUTLASS, and TensorRT-LLM.

Why FlashInfer?

State-of-the-art Performance: Optimized kernels for prefill, decode, and mixed batching scenarios
Multiple Backends: Automatically selects the best backend for your hardware and workload
Modern Architecture Support: Support for SM75 (Turing) and later (through Blackwell)
Low-Precision Compute: FP8 and FP4 quantization for attention, GEMM, and MoE operations
Production-Ready: CUDAGraph and torch.compile compatible for low-latency serving

Core Features

Attention Kernels

Paged and Ragged KV-Cache: Efficient memory management for dynamic batch serving
Decode, Prefill, and Append: Optimized kernels for all attention phases
MLA Attention: Native support for DeepSeek's Multi-Latent Attention
Cascade Attention: Memory-efficient hierarchical KV-Cache for shared prefixes
Sparse Attention: Block-sparse and variable block-sparse patterns
POD-Attention: Fused prefill+decode for mixed batching

GEMM & Linear Operations

FP8 GEMM: Per-tensor and groupwise scaling
FP4 GEMM: NVFP4 and MXFP4 matrix multiplication for Blackwell GPUs
Grouped GEMM: Efficient batched matrix operations for LoRA and multi-expert routing

Mixture of Experts (MoE)

Fused MoE Kernels
Multiple Routing Methods: DeepSeek-V3, Llama-4, and standard top-k routing
Quantized MoE: FP8 and FP4 expert weights with block-wise scaling

Sampling & Decoding

Sorting-Free Sampling: Efficient Top-K, Top-P, and Min-P without sorting
Speculative Decoding: Chain speculative sampling support

Communication

AllReduce: Custom implementations
Multi-Node NVLink: MNNVL support for multi-node inference
NVSHMEM Integration: For distributed memory operations

Other Operators

RoPE: LLaMA-style rotary position embeddings (including LLaMA 3.1)
Normalization: RMSNorm, LayerNorm, Gemma-style fused operations
Activations: SiLU, GELU with fused gating

GPU Support

Architecture	Compute Capability	Example GPUs
Turing	SM 7.5	T4, RTX 20 series
Ampere	SM 8.0, 8.6	A100, A10, RTX 30 series
Ada Lovelace	SM 8.9	L4, L40, RTX 40 series
Hopper	SM 9.0	H100, H200
Blackwell	SM 10.0, 10.3	B200, B300
Blackwell	SM 12.0, 12.1	RTX 50 series, DGX Spark, Jetson Thor

News

Latest:

Notable updates:

[2025-10-08] Blackwell support added in v0.4.0
[2025-03-10] Blog Post Sorting-Free GPU Kernels for LLM Sampling, which explains the design of sampling kernels in FlashInfer.

Getting Started

Installation

Quickstart:

pip install flashinfer-python

Package Options:

flashinfer-python: Core package that compiles/downloads kernels on first use
flashinfer-cubin: Pre-compiled kernel binaries for all supported GPU architectures
flashinfer-jit-cache: Pre-built kernel cache for specific CUDA versions

For faster initialization and offline usage, install the optional packages to have most kernels pre-compiled:

pip install flashinfer-python flashinfer-cubin # JIT cache (replace cu129 with your CUDA version) pip install flashinfer-jit-cache --index-url https://flashinfer.ai/whl/cu129

Verify Installation

flashinfer show-config

Basic Usage

import torch import flashinfer # Single decode attention q = torch.randn(32, 128, device="cuda", dtype=torch.float16) # [num_qo_heads, head_dim] k = torch.randn(2048, 32, 128, device="cuda", dtype=torch.float16) # [kv_len, num_kv_heads, head_dim] v = torch.randn(2048, 32, 128, device="cuda", dtype=torch.float16) output = flashinfer.single_decode_with_kv_cache(q, k, v)

See documentation for comprehensive API reference and tutorials.

Install from Source

git clone https://github.com/flashinfer-ai/flashinfer.git --recursive cd flashinfer python -m pip install -v .

For development, install in editable mode:

python -m pip install --no-build-isolation -e . -v

Build optional packages:

# flashinfer-cubin cd flashinfer-cubin python -m build --no-isolation --wheel python -m pip install dist/*.whl

# flashinfer-jit-cache (customize for your target GPUs) export FLASHINFER_CUDA_ARCH_LIST="7.5 8.0 8.9 9.0a 10.0a 10.3a 11.0a 12.0f" cd flashinfer-jit-cache python -m build --no-isolation --wheel python -m pip install dist/*.whl

For more details, see the Install from Source documentation.

Nightly Builds

pip install -U --pre flashinfer-python --index-url https://flashinfer.ai/whl/nightly/ --no-deps pip install flashinfer-python # Install dependencies from PyPI pip install -U --pre flashinfer-cubin --index-url https://flashinfer.ai/whl/nightly/ # JIT cache (replace cu129 with your CUDA version) pip install -U --pre flashinfer-jit-cache --index-url https://flashinfer.ai/whl/nightly/cu129

CLI Tools

FlashInfer provides several CLI commands for configuration, module management, and development:

# Verify installation and view configuration flashinfer show-config # List and inspect modules flashinfer list-modules flashinfer module-status # Manage artifacts and cache flashinfer download-cubin flashinfer clear-cache # For developers: generate compile_commands.json for IDE integration flashinfer export-compile-commands [output_path]

For complete documentation, see the CLI reference.

API Logging

FlashInfer provides comprehensive API logging for debugging. Enable it using environment variables:

# Enable logging (levels: 0=off (default), 1=basic, 3=detailed, 5=statistics) export FLASHINFER_LOGLEVEL=3 # Set log destination (stdout (default), stderr, or file path) export FLASHINFER_LOGDEST=stdout

For detailed information about logging levels, configuration, and advanced features, see Logging in our documentation.

Custom Attention Variants

Users can customize their own attention variants with additional parameters. For more details, refer to our JIT examples.

CUDA Support

Supported CUDA Versions: 12.6, 12.8, 13.0, 13.1

Note: FlashInfer strives to follow PyTorch's supported CUDA versions plus the latest CUDA release.

Adoption

FlashInfer powers inference in:

Acknowledgement

FlashInfer is inspired by FlashAttention, vLLM, stream-K, CUTLASS, and AITemplate.

Citation

If you find FlashInfer helpful in your project or research, please consider citing our paper:

@article{ye2025flashinfer, title = {FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving}, author = {  Ye, Zihao and  Chen, Lequn and  Lai, Ruihang and  Lin, Wuwei and  Zhang, Yineng and  Wang, Stephanie and  Chen, Tianqi and  Kasikci, Baris and  Grover, Vinod and  Krishnamurthy, Arvind and  Ceze, Luis  }, journal = {arXiv preprint arXiv:2501.01005}, year = {2025}, url = {https://arxiv.org/abs/2501.01005} }

Name		Name	Last commit message	Last commit date
Latest commit History 1,899 Commits
.claude/skills		.claude/skills
.devcontainer		.devcontainer
.github		.github
3rdparty		3rdparty
benchmarks		benchmarks
ci		ci
csrc		csrc
docker		docker
docs		docs
flashinfer-cubin		flashinfer-cubin
flashinfer-jit-cache		flashinfer-jit-cache
flashinfer		flashinfer
include/flashinfer		include/flashinfer
licenses		licenses
profiler		profiler
scripts		scripts
tests		tests
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Jenkinsfile		Jenkinsfile
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
build_backend.py		build_backend.py
build_utils.py		build_utils.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

High-Performance GPU Kernels for Inference

Why FlashInfer?

Core Features

Attention Kernels

GEMM & Linear Operations

Mixture of Experts (MoE)

Sampling & Decoding

Communication

Other Operators

GPU Support

News

Getting Started

Installation

Verify Installation

Basic Usage

Install from Source

Nightly Builds

CLI Tools

API Logging

Custom Attention Variants

CUDA Support

Adoption

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

High-Performance GPU Kernels for Inference

Why FlashInfer?

Core Features

Attention Kernels

GEMM & Linear Operations

Mixture of Experts (MoE)

Sampling & Decoding

Communication

Other Operators

GPU Support

News

Getting Started

Installation

Verify Installation

Basic Usage

Install from Source

Nightly Builds

CLI Tools

API Logging

Custom Attention Variants

CUDA Support

Adoption

Acknowledgement

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages