- Notifications
You must be signed in to change notification settings - Fork 589
[Flashinfer-Bench integration] HF end-to-end inference #2151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughIntroduces new modules for sparse Mixture of Experts (MoE), tensor parallelism, and linear operations. Adds end-to-end LLM inference examples for standard and distributed tensor-parallel configurations. Expands public API surface with initialization routines, communication primitives, and layer implementations. Includes FP8 dequantization and weight-loading utilities. Changes
Sequence Diagram(s)sequenceDiagram participant Token as Token Sequence participant Router as Router Gate participant TopK as Top-K Selection participant Expert as Experts Pool participant Agg as Output Aggregator Token->>Router: Flatten & compute logits Router->>TopK: Apply softmax, select top-k TopK->>TopK: Normalize probabilities (optional) TopK->>Expert: Route tokens to selected experts loop Per Token, Per Selected Expert Expert->>Expert: MLP computation (gate/up/down) end Expert->>Agg: Expert outputs + routing weights Agg->>Agg: Weighted aggregation via index_add_ Agg->>Token: Output reshaped to [batch, seq_len, hidden_size] sequenceDiagram participant Rank0 as Rank 0 participant Rank1 as Rank 1 participant TP as TP Collective Rank0->>Rank0: Column-parallel linear (local compute) Rank1->>Rank1: Column-parallel linear (local compute) Rank0->>TP: All-reduce or All-gather Rank1->>TP: All-reduce or All-gather TP->>Rank0: Combined result TP->>Rank1: Combined result Rank0->>Rank0: Row-parallel linear (reduce output) Rank1->>Rank1: Row-parallel linear (reduce output) Rank0->>TP: Reduce-scatter Rank1->>TP: Reduce-scatter TP->>Rank0: Local shard TP->>Rank1: Local shard Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Suggested reviewers
Poem
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (8)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
.gitignore Outdated
| # option (not recommended) you can uncomment the following to ignore the entire idea folder. | ||
| #.idea/ | ||
| *.whl | ||
| LICENSE.cutlass.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why ignoring them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yzh119 Just removed the licenses from the .gitignore. Should be leave the *.whl? Or is there any use case where we want to commit a .whl file?
Remove specific license files from .gitignore.
| @coderabbitai summary |
✅ Actions performedSummary regeneration triggered. |
📌 Description
Summary by CodeRabbit
Release Notes
New Features
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes