Skip to content
View rdspring1's full-sized avatar

Organizations

@RUSH-LAB

Block or report rdspring1

Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

PyTorch native quantization and sparsity for training and inference

Python 2,741 465 Updated Mar 22, 2026

MSLK (Meta Superintelligence Labs Kernels) is a collection of PyTorch GPU operator libraries that are designed and optimized for GenAI training and inference, such as FP8 row-wise quantization and …

Python 87 34 Updated Mar 22, 2026

Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!

Python 228 135 Updated Mar 18, 2026

Companion code for Grokking Megakernels: fuse an entire LLM forward pass into a single CUDA kernel

Cuda 18 1 Updated Feb 9, 2026

Helpful kernel tutorials and examples for tile-based GPU programming

Python 682 55 Updated Mar 22, 2026
Python 2 Updated Feb 5, 2026

FlashInfer: Kernel Library for LLM Serving

Python 5,199 820 Updated Mar 22, 2026

NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmer…

C++ 484 68 Updated Mar 10, 2026
Python 163 16 Updated Dec 27, 2024

Triton-based Symmetric Memory operators and examples

Python 94 13 Updated Jan 15, 2026

CUDA Tile IR is an MLIR-based intermediate representation and compiler infrastructure for CUDA kernel optimization, focusing on tile-based computation patterns and optimizations targeting NVIDIA te…

MLIR 879 64 Updated Mar 17, 2026

cuTile is a programming model for writing parallel kernels for NVIDIA GPUs

Python 1,985 128 Updated Mar 21, 2026

The best ChatGPT that $100 can buy.

Python 49,963 6,544 Updated Mar 17, 2026

High-performance MinHash implementation in Rust with Python bindings for efficient similarity estimation and deduplication of large datasets

Rust 237 21 Updated Mar 10, 2026

Distributed Compiler based on Triton for Parallel Systems

Python 1,394 132 Updated Mar 11, 2026

Dion optimizer algorithm

Python 456 53 Updated Jan 16, 2026

Helpful tools and examples for working with flex-attention

Python 1,161 75 Updated Feb 8, 2026

A reference implementation of MuonClip

Python 1 Updated Jul 24, 2025

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 73,966 14,633 Updated Mar 22, 2026

A Quirky Assortment of CuTe Kernels

Python 863 98 Updated Mar 22, 2026

Customized matrix multiplication kernels

Jupyter Notebook 57 6 Updated Mar 5, 2022

Effective transpose on Hopper GPU

Cuda 28 3 Updated Sep 6, 2025

DFloat11 [NeurIPS '25]: Lossless Compression of LLMs and DiTs for Efficient GPU Inference

Python 616 38 Updated Nov 24, 2025

Agent2Agent (A2A) is an open protocol enabling communication and interoperability between opaque agentic applications.

Shell 22,730 2,308 Updated Mar 16, 2026
Cuda 132 16 Updated Mar 19, 2026

Fastest kernels written from scratch

Cuda 561 69 Updated Sep 18, 2025

A Python framework for accelerated simulation, data generation and spatial computing.

Python 6,399 468 Updated Mar 22, 2026

Kaggle Notebooks

Jupyter Notebook 11 2 Updated Jan 14, 2026

Experimental paper writing linter.

TeX 35 Updated Sep 2, 2024

Open source implementation of AlphaFold3

Python 1,048 89 Updated Oct 7, 2024
Next