Stars
RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing …
UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)
Tile primitives for speedy kernels
Machine Learning Engineering Open Book
CUDA Templates and Python DSLs for High-Performance Linear Algebra



