PyTorch SDK for TurboQuant — near-optimal vector quantization with zero metadata overhead.
Based on Google Research (ICLR 2026):
Reference implementation: mlx-vlm PR #858
TurboQuant compresses vectors (KV cache embeddings, search indices) to 2-4 bits with:
- Zero calibration — no training data needed (data-oblivious)
- Zero metadata overhead — a 4-byte seed replaces all per-block scales/zero-points
- Near-optimal compression — within 2.7x of information-theoretic lower bound
- Unbiased inner products — critical for attention score accuracy
Input vector x (e.g., a KV cache key, dim=128) │ ▼ [1] Normalize: x̂ = x / ‖x‖ (store ‖x‖ as scalar) │ ▼ [2] Rotate: x' = R · x̂ (R = random orthogonal from seed) │ After rotation, coordinates ~ Beta distribution │ (known analytically → no calibration needed) ▼ [3] Scalar quantize: qᵢ = LloydMax(x'ᵢ) (precomputed codebook, b-1 bits) │ ▼ [4] QJL correction: (1 extra bit for unbiased inner products) residual = x̂ - dequant(q) signs = sign(P · residual) (P = random Gaussian from seed) │ ▼ Stored: ‖x‖ (32-bit) + q (b-1 bits × d) + ‖residual‖ (32-bit) + signs (d bits) Total overhead: 0.5 bits/element (vs 1.5-2 bits for GPTQ/KIVI) pip install torch numpyNo other dependencies. Clone and use directly:
git clone https://github.com/GenauraApp/TurboQuant.git cd TurboQuant python tests/test_real.py # Verify everything worksfrom turboquant.core import MSECodec codec = MSECodec(dim=128, bits=4, seed=42) # Quantize state = codec.quantize(vectors) # vectors: [n, 128] float tensor # Reconstruct reconstructed = codec.dequantize(state) # [n, 128] — 99.5% cosine similarityfrom turboquant.core import ProdCodec codec = ProdCodec(dim=128, bits=4, seed=42) # Quantize keys state = codec.quantize(keys) # keys: [n, 128] # Compute attention scores directly from quantized keys scores = codec.score(queries, state) # [n_queries, n_keys] — unbiasedimport torch import math from turboquant.core import ProdCodec, MSECodec dim, bits = 128, 4 key_codec = ProdCodec(dim, bits, seed=0) val_codec = MSECodec(dim, bits, seed=1) # Quantize KV cache key_state = key_codec.quantize(keys) # keys: [seq_len, dim] val_state = val_codec.quantize(values) # values: [seq_len, dim] # Attention (no full dequantize needed for keys) scale = 1.0 / math.sqrt(dim) scores = key_codec.score(queries, key_state) * scale weights = torch.softmax(scores, dim=-1) output = weights @ val_codec.dequantize(val_state)from transformers import AutoModelForCausalLM from turboquant.core import ProdCodec, MSECodec model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") # Patch the KV cache after prefill for layer in model.model.layers: attn = layer.self_attn # After model generates keys/values, quantize them: # key_state = ProdCodec(dim=128, bits=4).quantize(keys) # val_state = MSECodec(dim=128, bits=4).quantize(values) # This replaces the FP16 cache with 3.56x smaller quantized stateFor a complete KV cache integration, see the mlx-vlm reference which implements TurboQuantKVCache as a drop-in replacement.
from turboquant.core import ProdCodec codec = ProdCodec(dim=200, bits=4, seed=42) # Index building: O(n) — no k-means training needed states = codec.quantize(database_vectors) # Search: asymmetric scoring (query is full precision) scores = codec.score(query.unsqueeze(0), states) top_k = torch.topk(scores.squeeze(0), k=10)| Use case | Codec | Why |
|---|---|---|
| KV cache keys | ProdCodec | Unbiased inner products for attention scores |
| KV cache values | MSECodec | Best reconstruction (weighted sum after softmax) |
| Vector search index | ProdCodec | Unbiased scoring for retrieval |
| Embedding compression | MSECodec | Minimum reconstruction error |
All verified via Qwen CLI + independent scripts + Claude python-reviewer.
| Metric | Result |
|---|---|
| MSE distortion (4-bit) | 0.009 relative (within 2.3x of theoretical limit) |
| Inner product bias | 0.012% (effectively zero) |
| 4-bit cosine similarity | 99.5% reconstruction, 96-98% attention fidelity |
| 3-bit cosine similarity | 98.3% reconstruction, 91-95% attention fidelity |
| Compression vs FP16 | 3.56x (4-bit) to 6.40x (2-bit) |
| Encode throughput (CPU) | 700K-1M vectors/sec (d=128) |
| Calibration required | None |
@inproceedings{zandieh2026turboquant, title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate}, author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab}, booktitle={ICLR}, year={2026} } @inproceedings{zandieh2024qjl, title={QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead}, author={Zandieh, Amir and Daliri, Majid and Han, Insu}, booktitle={AAAI}, year={2024} } @inproceedings{han2026polarquant, title={PolarQuant: Quantizing KV Caches with Polar Transformation}, author={Han, Insu and Kacham, Praneeth and Karbasi, Amin and Mirrokni, Vahab and Zandieh, Amir}, booktitle={AISTATS}, year={2026} }Apache 2.0