Skip to content

GenauraApp/TurboQuant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TurboQuant

PyTorch SDK for TurboQuant — near-optimal vector quantization with zero metadata overhead.

Based on Google Research (ICLR 2026):

Reference implementation: mlx-vlm PR #858

What It Does

TurboQuant compresses vectors (KV cache embeddings, search indices) to 2-4 bits with:

  • Zero calibration — no training data needed (data-oblivious)
  • Zero metadata overhead — a 4-byte seed replaces all per-block scales/zero-points
  • Near-optimal compression — within 2.7x of information-theoretic lower bound
  • Unbiased inner products — critical for attention score accuracy

How It Works

Input vector x (e.g., a KV cache key, dim=128) │ ▼ [1] Normalize: x̂ = x / ‖x‖ (store ‖x‖ as scalar) │ ▼ [2] Rotate: x' = R · x̂ (R = random orthogonal from seed) │ After rotation, coordinates ~ Beta distribution │ (known analytically → no calibration needed) ▼ [3] Scalar quantize: qᵢ = LloydMax(x'ᵢ) (precomputed codebook, b-1 bits) │ ▼ [4] QJL correction: (1 extra bit for unbiased inner products) residual = x̂ - dequant(q) signs = sign(P · residual) (P = random Gaussian from seed) │ ▼ Stored: ‖x‖ (32-bit) + q (b-1 bits × d) + ‖residual‖ (32-bit) + signs (d bits) Total overhead: 0.5 bits/element (vs 1.5-2 bits for GPTQ/KIVI) 

Install

pip install torch numpy

No other dependencies. Clone and use directly:

git clone https://github.com/GenauraApp/TurboQuant.git cd TurboQuant python tests/test_real.py # Verify everything works

Quick Start

Compress and reconstruct vectors (values)

from turboquant.core import MSECodec codec = MSECodec(dim=128, bits=4, seed=42) # Quantize state = codec.quantize(vectors) # vectors: [n, 128] float tensor # Reconstruct reconstructed = codec.dequantize(state) # [n, 128] — 99.5% cosine similarity

Compress with unbiased inner products (keys)

from turboquant.core import ProdCodec codec = ProdCodec(dim=128, bits=4, seed=42) # Quantize keys state = codec.quantize(keys) # keys: [n, 128] # Compute attention scores directly from quantized keys scores = codec.score(queries, state) # [n_queries, n_keys] — unbiased

Full attention with quantized KV cache

import torch import math from turboquant.core import ProdCodec, MSECodec dim, bits = 128, 4 key_codec = ProdCodec(dim, bits, seed=0) val_codec = MSECodec(dim, bits, seed=1) # Quantize KV cache key_state = key_codec.quantize(keys) # keys: [seq_len, dim] val_state = val_codec.quantize(values) # values: [seq_len, dim] # Attention (no full dequantize needed for keys) scale = 1.0 / math.sqrt(dim) scores = key_codec.score(queries, key_state) * scale weights = torch.softmax(scores, dim=-1) output = weights @ val_codec.dequantize(val_state)

How to Apply This SDK

On top of HuggingFace Transformers

from transformers import AutoModelForCausalLM from turboquant.core import ProdCodec, MSECodec model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") # Patch the KV cache after prefill for layer in model.model.layers: attn = layer.self_attn # After model generates keys/values, quantize them: # key_state = ProdCodec(dim=128, bits=4).quantize(keys) # val_state = MSECodec(dim=128, bits=4).quantize(values) # This replaces the FP16 cache with 3.56x smaller quantized state

For a complete KV cache integration, see the mlx-vlm reference which implements TurboQuantKVCache as a drop-in replacement.

On top of FAISS / vector search

from turboquant.core import ProdCodec codec = ProdCodec(dim=200, bits=4, seed=42) # Index building: O(n) — no k-means training needed states = codec.quantize(database_vectors) # Search: asymmetric scoring (query is full precision) scores = codec.score(query.unsqueeze(0), states) top_k = torch.topk(scores.squeeze(0), k=10)

When to use which codec

Use case Codec Why
KV cache keys ProdCodec Unbiased inner products for attention scores
KV cache values MSECodec Best reconstruction (weighted sum after softmax)
Vector search index ProdCodec Unbiased scoring for retrieval
Embedding compression MSECodec Minimum reconstruction error

Verified Results

All verified via Qwen CLI + independent scripts + Claude python-reviewer.

Metric Result
MSE distortion (4-bit) 0.009 relative (within 2.3x of theoretical limit)
Inner product bias 0.012% (effectively zero)
4-bit cosine similarity 99.5% reconstruction, 96-98% attention fidelity
3-bit cosine similarity 98.3% reconstruction, 91-95% attention fidelity
Compression vs FP16 3.56x (4-bit) to 6.40x (2-bit)
Encode throughput (CPU) 700K-1M vectors/sec (d=128)
Calibration required None

Papers

@inproceedings{zandieh2026turboquant, title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate}, author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab}, booktitle={ICLR}, year={2026} } @inproceedings{zandieh2024qjl, title={QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead}, author={Zandieh, Amir and Daliri, Majid and Han, Insu}, booktitle={AAAI}, year={2024} } @inproceedings{han2026polarquant, title={PolarQuant: Quantizing KV Caches with Polar Transformation}, author={Han, Insu and Kacham, Praneeth and Karbasi, Amin and Mirrokni, Vahab and Zandieh, Amir}, booktitle={AISTATS}, year={2026} }

License

Apache 2.0

Releases

No releases published

Packages

 
 
 

Contributors

Languages