- Microsoft
Stars
GitHub Copilot CLI brings the power of Copilot coding agent directly to your terminal.
ROCm / cupy
Forked from cupy/cupyA NumPy-compatible array library accelerated by CUDA
MSCCL++: A GPU-driven communication stack for scalable AI applications
gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI
Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).
KV cache store for distributed LLM inference
Distributed Compiler based on Triton for Parallel Systems
Efficient and easy multi-instance LLM serving
PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)
A Datacenter Scale Distributed Inference Serving Framework
A list of ICs and IPs for AI, Machine Learning and Deep Learning.
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
DeepEP: an efficient expert-parallel communication library
FlashMLA: Efficient Multi-head Latent Attention Kernels
A PyTorch native platform for training generative AI models
Dynamic Memory Management for Serving LLMs without PagedAttention
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.
Puzzles for learning Triton
Helpful tools and examples for working with flex-attention
how to optimize some algorithm in cuda.

