rdspring1

Follow

Ryan Spring rdspring1

Follow

I contribute to PyTorch Compilers.

100 followers · 48 following

Achievements

Achievements

Organizations

Stars

pytorch / ao

PyTorch native quantization and sparsity for training and inference

Python 2,741 465 Updated Mar 22, 2026

meta-pytorch / MSLK

MSLK (Meta Superintelligence Labs Kernels) is a collection of PyTorch GPU operator libraries that are designed and optimized for GenAI training and inference, such as FP8 row-wise quantization and …

Python 87 34 Updated Mar 22, 2026

gpu-mode / reference-kernels

Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!

Python 228 135 Updated Mar 18, 2026

Infatoshi / grokking-megakernels

Companion code for Grokking Megakernels: fuse an entire LLM forward pass into a single CUDA kernel

Cuda 18 1 Updated Feb 9, 2026

NVIDIA / TileGym

Helpful kernel tutorials and examples for tile-based GPU programming

Python 682 55 Updated Mar 22, 2026

zasdfgbnm / tma-ptws

Python 2 Updated Feb 5, 2026

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

Python 5,199 820 Updated Mar 22, 2026

NVIDIA / nvshmem

NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmer…

C++ 484 68 Updated Mar 10, 2026

yifuwang / symm-mem-recipes

Python 163 16 Updated Dec 27, 2024

meta-pytorch / kraken

Triton-based Symmetric Memory operators and examples

Python 94 13 Updated Jan 15, 2026

NVIDIA / cuda-tile

CUDA Tile IR is an MLIR-based intermediate representation and compiler infrastructure for CUDA kernel optimization, focusing on tile-based computation patterns and optimizations targeting NVIDIA te…

MLIR 879 64 Updated Mar 17, 2026

NVIDIA / cutile-python

cuTile is a programming model for writing parallel kernels for NVIDIA GPUs

Python 1,985 128 Updated Mar 21, 2026

karpathy / nanochat

The best ChatGPT that $100 can buy.

Python 49,963 6,544 Updated Mar 17, 2026

beowolx / rensa

High-performance MinHash implementation in Rust with Python bindings for efficient similarity estimation and deduplication of large datasets

Rust 237 21 Updated Mar 10, 2026

ByteDance-Seed / Triton-distributed

Distributed Compiler based on Triton for Parallel Systems

Python 1,394 132 Updated Mar 11, 2026

microsoft / dion

Dion optimizer algorithm

Python 456 53 Updated Jan 16, 2026

meta-pytorch / attention-gym

Helpful tools and examples for working with flex-attention

Python 1,161 75 Updated Feb 8, 2026

apaz-cli / MuonClip

A reference implementation of MuonClip

Python 1 Updated Jul 24, 2025

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 73,966 14,633 Updated Mar 22, 2026

Dao-AILab / quack

A Quirky Assortment of CuTe Kernels

Python 863 98 Updated Mar 22, 2026

DeMoriarty / custom_matmul_kernels

Customized matrix multiplication kernels

Jupyter Notebook 57 6 Updated Mar 5, 2022

simveit / effective_transpose

Effective transpose on Hopper GPU

Cuda 28 3 Updated Sep 6, 2025

LeanModels / DFloat11

DFloat11 [NeurIPS '25]: Lossless Compression of LLMs and DiTs for Efficient GPU Inference

Python 616 38 Updated Nov 24, 2025

a2aproject / A2A

Agent2Agent (A2A) is an open protocol enabling communication and interoperability between opaque agentic applications.

Shell 22,730 2,308 Updated Mar 16, 2026

bertmaher / simplegemm

Cuda 132 16 Updated Mar 19, 2026

pranjalssh / fast.cu

Fastest kernels written from scratch

Cuda 561 69 Updated Sep 18, 2025

NVIDIA / warp

A Python framework for accelerated simulation, data generation and spatial computing.

Python 6,399 468 Updated Mar 22, 2026

ravaghi / kaggle-notebooks

Kaggle Notebooks

Jupyter Notebook 11 2 Updated Jan 14, 2026

srush / prof8

Experimental paper writing linter.

TeX 35 Updated Sep 2, 2024

Ligo-Biosciences / AlphaFold3

Open source implementation of AlphaFold3

Python 1,048 89 Updated Oct 7, 2024