Skip to content

Conversation

@sfc-gh-goliaro
Copy link

@sfc-gh-goliaro sfc-gh-goliaro commented Nov 30, 2025

📌 Description

Summary by CodeRabbit

Release Notes

  • New Features

    • Added sparse Mixture of Experts (MoE) routing and execution for efficient token routing to expert models.
    • Added tensor parallelism support for distributed LLM inference with optional custom communication optimization.
    • Added core linear operations including projections, matrix multiplications, and embeddings.
    • Added end-to-end LLM inference examples with optional FP8 quantization support.
    • Added tensor-parallel inference examples with distributed model inference patterns.
  • Documentation

    • Added comprehensive examples demonstrating LLM inference workflows and model loading.
    • Added documentation for tensor-parallel and MoE-based inference configurations and usage patterns.

✏️ Tip: You can customize this high-level summary in your review settings.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 30, 2025

Walkthrough

Introduces new modules for sparse Mixture of Experts (MoE), tensor parallelism, and linear operations. Adds end-to-end LLM inference examples for standard and distributed tensor-parallel configurations. Expands public API surface with initialization routines, communication primitives, and layer implementations. Includes FP8 dequantization and weight-loading utilities.

Changes

Cohort / File(s) Summary
Version Control Configuration
\.gitignore
Adds ignore pattern for wheel distribution files (\*.whl).
Documentation
examples/README.md
Introduces comprehensive end-to-end FlashInfer usage examples covering linear operations, sparse MoE, tensor parallelism, and LLM inference (standard and distributed). Includes architecture diagrams, supported models, CLI options, and detailed usage patterns.
Example Scripts
examples/llm_inference.py, examples/llm_inference_tp.py
Provides two complete LLM inference pipelines: standard single-GPU inference (llm_inference.py) and tensor-parallel distributed inference (llm_inference_tp.py). Both implement embedding, linear layers, attention, MLP/MoE routing, and generation with sampling. Include FP8 dequantization, model loading from HuggingFace, and support for Llama and Qwen model families.
Core Module: Sparse MoE
flashinfer/sparse_moe.py
Introduces sparse_moe_forward() function for top-k expert routing and aggregation, and SparseMoeBlock module encapsulating per-expert MLP computation with optional biases and normalization. Includes router gating, token-to-expert mapping, and output accumulation.
Core Module: Tensor Parallelism
flashinfer/tensor_parallel.py
Establishes Megatron-style tensor parallelism infrastructure with initialization (init_tensor_parallel), state accessors, and communication primitives (all_reduce, all_gather, reduce_scatter) supporting FlashInfer custom all-reduce with NCCL fallback. Provides column/row-parallel linear layers, vocabulary-parallel embedding, tensor-parallel MLP and sparse MoE blocks, and autograd-compatible region wrappers for communication.
Core Module: Linear Operations
flashinfer/linear.py
Adds PyTorch-backed linear algebra functions: linear, linear_with_bias, bmm, matmul, and embedding, each supporting optional in-place output tensors. Includes torch.compile compatibility hooks and type hints.
Public API Exports
flashinfer/__init__.py
Expands public surface with sparse MoE (sparse_moe_forward, SparseMoeBlock), tensor parallelism initialization and communication (init_tensor_parallel, all_reduce, all_gather, reduce_scatter, accessors), tensor-parallel layer classes (ColumnParallelLinear, RowParallelLinear, MergedColumnParallelLinear, VocabParallelEmbedding, TensorParallelMLP, TensorParallelSparseMoeBlock), and linear operations (linear, linear_with_bias, bmm, matmul, embedding). Includes backward-compatible aliases for communication functions.

Sequence Diagram(s)

sequenceDiagram participant Token as Token Sequence participant Router as Router Gate participant TopK as Top-K Selection participant Expert as Experts Pool participant Agg as Output Aggregator Token->>Router: Flatten & compute logits Router->>TopK: Apply softmax, select top-k TopK->>TopK: Normalize probabilities (optional) TopK->>Expert: Route tokens to selected experts loop Per Token, Per Selected Expert Expert->>Expert: MLP computation (gate/up/down) end Expert->>Agg: Expert outputs + routing weights Agg->>Agg: Weighted aggregation via index_add_ Agg->>Token: Output reshaped to [batch, seq_len, hidden_size] 
Loading
sequenceDiagram participant Rank0 as Rank 0 participant Rank1 as Rank 1 participant TP as TP Collective Rank0->>Rank0: Column-parallel linear (local compute) Rank1->>Rank1: Column-parallel linear (local compute) Rank0->>TP: All-reduce or All-gather Rank1->>TP: All-reduce or All-gather TP->>Rank0: Combined result TP->>Rank1: Combined result Rank0->>Rank0: Row-parallel linear (reduce output) Rank1->>Rank1: Row-parallel linear (reduce output) Rank0->>TP: Reduce-scatter Rank1->>TP: Reduce-scatter TP->>Rank0: Local shard TP->>Rank1: Local shard 
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Sparse MoE routing logic: Token-to-expert mapping, top-k selection, probability normalization, and per-expert computation aggregation need careful validation
  • Tensor parallelism communication: All-reduce/all-gather/reduce-scatter primitives with FlashInfer custom AR fallback; backward-pass behavior in autograd wrappers (_CopyToTensorParallelRegion, etc.)
  • Weight loading and FP8 dequantization: Block-wise FP8 unpacking and shard distribution across tensor-parallel ranks
  • End-to-end inference flow: Interaction between embedding, attention (RoPE), MLP/MoE, sampling, and generation in both single-GPU and distributed contexts
  • Module composition: Verify correct layer ordering, normalization placement, and optional feature flags (bias, FP8, model-specific quirks for Qwen/Llama)

Suggested reviewers

  • aleozlx
  • djmmoss
  • yzh119
  • cyx-6
  • joker-eph
  • wenscarl
  • jiahanc

Poem

🐰 A hop through the code, new experts aligned,
Tensors in parallel, no rank left behind,
Sparse routing magic, weights dequantized,
FlashInfer's power fully realized!
From embeddings to samples, the inference flies,
With custom all-reduce beneath the skies.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b14408b and 0010caa.

📒 Files selected for processing (8)
  • .gitignore (1 hunks)
  • examples/README.md (1 hunks)
  • examples/llm_inference.py (1 hunks)
  • examples/llm_inference_tp.py (1 hunks)
  • flashinfer/__init__.py (2 hunks)
  • flashinfer/linear.py (1 hunks)
  • flashinfer/sparse_moe.py (1 hunks)
  • flashinfer/tensor_parallel.py (1 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

.gitignore Outdated
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
*.whl
LICENSE.cutlass.txt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ignoring them?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yzh119 Just removed the licenses from the .gitignore. Should be leave the *.whl? Or is there any use case where we want to commit a .whl file?

Remove specific license files from .gitignore.
@sfc-gh-goliaro
Copy link
Author

@coderabbitai summary

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 4, 2025

✅ Actions performed

Summary regeneration triggered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants