Skip to content
View wangraying's full-sized avatar

Organizations

@microsoft @BaguaSys @MicrosoftCopilot

Block or report wangraying

Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

GitHub Copilot CLI brings the power of Copilot coding agent directly to your terminal.

Shell 9,498 1,294 Updated Mar 20, 2026
Python 24 18 Updated Mar 5, 2026

A NumPy-compatible array library accelerated by CUDA

Python 7 4 Updated Feb 27, 2026

MSCCL++: A GPU-driven communication stack for scalable AI applications

C++ 492 90 Updated Mar 21, 2026

A hyperparameter optimization framework

Python 13,728 1,281 Updated Mar 19, 2026

gpt-oss-120b and gpt-oss-20b are two open-weight language models by OpenAI

Python 19,931 2,064 Updated Jan 13, 2026

Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).

Python 2,231 264 Updated Feb 20, 2026

KV cache store for distributed LLM inference

C++ 399 34 Updated Nov 13, 2025

Kernels, of the mega variety :)

Python 693 46 Updated Mar 22, 2026

Distributed Compiler based on Triton for Parallel Systems

Python 1,394 132 Updated Mar 11, 2026

Efficient and easy multi-instance LLM serving

Python 535 46 Updated Mar 12, 2026

PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)

TeX 66 2 Updated Mar 24, 2025

A Datacenter Scale Distributed Inference Serving Framework

Rust 6,367 944 Updated Mar 22, 2026

A list of ICs and IPs for AI, Machine Learning and Deep Learning.

PHP 1,702 279 Updated Jun 5, 2024

A fast communication-overlapping library for tensor/expert parallelism on GPUs.

C++ 1,271 98 Updated Aug 28, 2025

Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation

7,974 288 Updated May 15, 2025

Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference

Python 165 9 Updated Oct 13, 2025

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 6,273 840 Updated Mar 22, 2026

DeepEP: an efficient expert-parallel communication library

Cuda 9,060 1,128 Updated Feb 9, 2026

FlashMLA: Efficient Multi-head Latent Attention Kernels

C++ 12,527 1,005 Updated Feb 6, 2026

Machine Learning Systems

JavaScript 22,862 2,719 Updated Mar 22, 2026

A PyTorch native platform for training generative AI models

Python 5,171 754 Updated Mar 22, 2026

Dynamic Memory Management for Serving LLMs without PagedAttention

C 466 39 Updated May 30, 2025

Serving multiple LoRA finetuned LLM as one

Python 1,148 61 Updated May 8, 2024

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

Jupyter Notebook 2,722 194 Updated Jun 25, 2024

DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.

C 274 28 Updated Aug 6, 2025

Material for gpu-mode lectures

Jupyter Notebook 5,866 587 Updated Feb 1, 2026

Puzzles for learning Triton

Jupyter Notebook 2,343 207 Updated Mar 18, 2026

Helpful tools and examples for working with flex-attention

Python 1,161 75 Updated Feb 8, 2026

how to optimize some algorithm in cuda.

Cuda 2,881 264 Updated Mar 22, 2026
Next