[Flashinfer-Bench integration] HF end-to-end inference #2151

sfc-gh-goliaro · 2025-11-30T23:18:55Z

📌 Description

Summary by CodeRabbit

Release Notes

New Features
- Added sparse Mixture of Experts (MoE) routing and execution for efficient token routing to expert models.
- Added tensor parallelism support for distributed LLM inference with optional custom communication optimization.
- Added core linear operations including projections, matrix multiplications, and embeddings.
- Added end-to-end LLM inference examples with optional FP8 quantization support.
- Added tensor-parallel inference examples with distributed model inference patterns.
Documentation
- Added comprehensive examples demonstrating LLM inference workflows and model loading.
- Added documentation for tensor-parallel and MoE-based inference configurations and usage patterns.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

coderabbitai · 2025-11-30T23:19:01Z

Walkthrough

Introduces new modules for sparse Mixture of Experts (MoE), tensor parallelism, and linear operations. Adds end-to-end LLM inference examples for standard and distributed tensor-parallel configurations. Expands public API surface with initialization routines, communication primitives, and layer implementations. Includes FP8 dequantization and weight-loading utilities.

Changes

Cohort / File(s)	Summary
Version Control Configuration `\.gitignore`	Adds ignore pattern for wheel distribution files (\*.whl).
Documentation `examples/README.md`	Introduces comprehensive end-to-end FlashInfer usage examples covering linear operations, sparse MoE, tensor parallelism, and LLM inference (standard and distributed). Includes architecture diagrams, supported models, CLI options, and detailed usage patterns.
Example Scripts `examples/llm_inference.py`, `examples/llm_inference_tp.py`	Provides two complete LLM inference pipelines: standard single-GPU inference (`llm_inference.py`) and tensor-parallel distributed inference (`llm_inference_tp.py`). Both implement embedding, linear layers, attention, MLP/MoE routing, and generation with sampling. Include FP8 dequantization, model loading from HuggingFace, and support for Llama and Qwen model families.
Core Module: Sparse MoE `flashinfer/sparse_moe.py`	Introduces `sparse_moe_forward()` function for top-k expert routing and aggregation, and `SparseMoeBlock` module encapsulating per-expert MLP computation with optional biases and normalization. Includes router gating, token-to-expert mapping, and output accumulation.
Core Module: Tensor Parallelism `flashinfer/tensor_parallel.py`	Establishes Megatron-style tensor parallelism infrastructure with initialization (`init_tensor_parallel`), state accessors, and communication primitives (`all_reduce`, `all_gather`, `reduce_scatter`) supporting FlashInfer custom all-reduce with NCCL fallback. Provides column/row-parallel linear layers, vocabulary-parallel embedding, tensor-parallel MLP and sparse MoE blocks, and autograd-compatible region wrappers for communication.
Core Module: Linear Operations `flashinfer/linear.py`	Adds PyTorch-backed linear algebra functions: `linear`, `linear_with_bias`, `bmm`, `matmul`, and `embedding`, each supporting optional in-place output tensors. Includes torch.compile compatibility hooks and type hints.
Public API Exports `flashinfer/__init__.py`	Expands public surface with sparse MoE (sparse_moe_forward, SparseMoeBlock), tensor parallelism initialization and communication (init_tensor_parallel, all_reduce, all_gather, reduce_scatter, accessors), tensor-parallel layer classes (ColumnParallelLinear, RowParallelLinear, MergedColumnParallelLinear, VocabParallelEmbedding, TensorParallelMLP, TensorParallelSparseMoeBlock), and linear operations (linear, linear_with_bias, bmm, matmul, embedding). Includes backward-compatible aliases for communication functions.

Sequence Diagram(s)

sequenceDiagram participant Token as Token Sequence participant Router as Router Gate participant TopK as Top-K Selection participant Expert as Experts Pool participant Agg as Output Aggregator Token->>Router: Flatten & compute logits Router->>TopK: Apply softmax, select top-k TopK->>TopK: Normalize probabilities (optional) TopK->>Expert: Route tokens to selected experts loop Per Token, Per Selected Expert Expert->>Expert: MLP computation (gate/up/down) end Expert->>Agg: Expert outputs + routing weights Agg->>Agg: Weighted aggregation via index_add_ Agg->>Token: Output reshaped to [batch, seq_len, hidden_size]

sequenceDiagram participant Rank0 as Rank 0 participant Rank1 as Rank 1 participant TP as TP Collective Rank0->>Rank0: Column-parallel linear (local compute) Rank1->>Rank1: Column-parallel linear (local compute) Rank0->>TP: All-reduce or All-gather Rank1->>TP: All-reduce or All-gather TP->>Rank0: Combined result TP->>Rank1: Combined result Rank0->>Rank0: Row-parallel linear (reduce output) Rank1->>Rank1: Row-parallel linear (reduce output) Rank0->>TP: Reduce-scatter Rank1->>TP: Reduce-scatter TP->>Rank0: Local shard TP->>Rank1: Local shard

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Sparse MoE routing logic: Token-to-expert mapping, top-k selection, probability normalization, and per-expert computation aggregation need careful validation
Tensor parallelism communication: All-reduce/all-gather/reduce-scatter primitives with FlashInfer custom AR fallback; backward-pass behavior in autograd wrappers (_CopyToTensorParallelRegion, etc.)
Weight loading and FP8 dequantization: Block-wise FP8 unpacking and shard distribution across tensor-parallel ranks
End-to-end inference flow: Interaction between embedding, attention (RoPE), MLP/MoE, sampling, and generation in both single-GPU and distributed contexts
Module composition: Verify correct layer ordering, normalization placement, and optional feature flags (bias, FP8, model-specific quirks for Qwen/Llama)

Suggested reviewers

aleozlx
djmmoss
yzh119
cyx-6
joker-eph
wenscarl
jiahanc

Poem

🐰 A hop through the code, new experts aligned,
Tensors in parallel, no rank left behind,
Sparse routing magic, weights dequantized,
FlashInfer's power fully realized!
From embeddings to samples, the inference flies,
With custom all-reduce beneath the skies. ✨

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b14408b and 0010caa.

📒 Files selected for processing (8)

.gitignore (1 hunks)
examples/README.md (1 hunks)
examples/llm_inference.py (1 hunks)
examples/llm_inference_tp.py (1 hunks)
flashinfer/__init__.py (2 hunks)
flashinfer/linear.py (1 hunks)
flashinfer/sparse_moe.py (1 hunks)
flashinfer/tensor_parallel.py (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

yzh119 · 2025-12-01T20:23:11Z

.gitignore

 # option (not recommended) you can uncomment the following to ignore the entire idea folder.
 #.idea/
+*.whl
+LICENSE.cutlass.txt


Why ignoring them?

@yzh119 Just removed the licenses from the .gitignore. Should be leave the *.whl? Or is there any use case where we want to commit a .whl file?

Remove specific license files from .gitignore.

sfc-gh-goliaro · 2025-12-04T03:20:44Z

@coderabbitai summary

coderabbitai · 2025-12-04T03:20:50Z

✅ Actions performed

Summary regeneration triggered.

HF end-to-end inference

5950e50

sfc-gh-goliaro added 3 commits November 30, 2025 23:22

update readme

93e89c0

add tp implementation

25645a8

TP fixes

c642d64

yzh119 reviewed Dec 1, 2025

View reviewed changes

Update .gitignore to remove license files

0010caa

Remove specific license files from .gitignore.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Flashinfer-Bench integration] HF end-to-end inference #2151

[Flashinfer-Bench integration] HF end-to-end inference #2151

Uh oh!

sfc-gh-goliaro commented Nov 30, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 30, 2025 •

edited

Loading

yzh119 Dec 1, 2025

sfc-gh-goliaro Dec 4, 2025

sfc-gh-goliaro commented Dec 4, 2025

coderabbitai bot commented Dec 4, 2025

Labels

2 participants

[Flashinfer-Bench integration] HF end-to-end inference #2151

Are you sure you want to change the base?

[Flashinfer-Bench integration] HF end-to-end inference #2151

Uh oh!

Conversation

sfc-gh-goliaro commented Nov 30, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

Summary by CodeRabbit

Release Notes

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

coderabbitai bot commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

yzh119 Dec 1, 2025

Choose a reason for hiding this comment

sfc-gh-goliaro Dec 4, 2025

Choose a reason for hiding this comment

sfc-gh-goliaro commented Dec 4, 2025

coderabbitai bot commented Dec 4, 2025

Labels

2 participants

sfc-gh-goliaro commented Nov 30, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 30, 2025 •

edited

Loading