Kernels are written in the CuTe-DSL.
# For CUDA 12.9: pip install quack-kernels # For CUDA 13.1: pip install 'quack-kernels[cu13]' --extra-index-url https://download.pytorch.org/whl/cu130 # Or using uv (faster): uv pip install 'quack-kernels[cu13]' # Optional: install NVIDIA matmul heuristics for better untuned GEMM configs pip install 'quack-kernels[heuristics]'- H100 or B200/B300 GPU
- CUDA toolkit 12.9+
- Python 3.12
- 🦆 RMSNorm forward + backward
- 🦆 Softmax forward + backward
- 🦆 Cross entropy forward + backward
- 🦆 Layernorm forward
- 🦆 Hopper gemm + epilogue
- 🦆 Blackwell gemm + epilogue
from quack import rmsnorm, softmax, cross_entropy [2025-07-10] We have a comprehensive blogpost on how to get memory-bound kernels to speed-of-light, right in the comfort of Python thanks to the CuTe-DSL.
See our blogpost for the details.
To set up the development environment:
pip install -e '.[dev]' pre-commit install # For CUDA 13.1: pip install 'quack-kernels[dev,cu13]' --extra-index-url https://download.pytorch.org/whl/cu130 # Or using uv: uv pip install 'quack-kernels[dev,cu13]'