Skip to content

wjddusrb03/langchain-turboquant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

langchain-turboquant

The first LangChain integration for TurboQuant - Google Research's training-free vector compression algorithm (ICLR 2026).

Drop-in replacement for any LangChain vector store with ~6x memory reduction and near-zero accuracy loss. No GPU required.

Python 3.9+ License: MIT Tests: 296 passed

한국어 README


Why langchain-turboquant?

Large-scale RAG pipelines store millions of embedding vectors in memory. At 1536 dimensions (OpenAI text-embedding-3-small), each vector takes 6 KB. A million vectors = 6 GB just for embeddings.

TurboQuant compresses these vectors to ~1 KB each (3-bit quantization), cutting memory by 6x while preserving search accuracy. Unlike Product Quantization (PQ) or IVFPQ, TurboQuant requires no codebook training - it works out of the box on any embedding.

Feature langchain-turboquant FAISS (PQ) Chroma
Compression ratio ~6x (3-bit) ~4x (8-bit PQ) 1x (none)
Training required No Yes (codebook) N/A
Drop-in LangChain Yes Partial Yes
GPU required No Optional No
Asymmetric search Yes Yes N/A

How It Works

TurboQuant implements the two-stage compression algorithm from Google Research (ICLR 2026):

Stage 1: PolarQuant (MSE-optimal scalar quantization)

  1. Random orthogonal rotation: Multiply the vector by a random orthogonal matrix. This "isotropizes" the coordinates so each one follows the same distribution (the hypersphere marginal).
  2. Lloyd-Max quantization: Quantize each rotated coordinate independently using a pre-computed optimal codebook for the hypersphere marginal PDF.

The codebook is computed analytically from the distribution - no training data needed.

Stage 2: QJL (Quantized Johnson-Lindenstrauss residual correction)

  1. Compute the quantization residual (difference between original and Stage 1 reconstruction).
  2. Project the residual through a random Gaussian matrix.
  3. Store only the sign bits (1 bit per dimension) of the projection.

At query time, an asymmetric estimator computes approximate inner products directly on compressed data - the query stays in full precision while stored vectors remain compressed.

Compression Math

For dimension d with b-bit quantization and QJL dimension m:

Compressed bits per vector = d * b + m * 1 + 32 + 32 = d * (b + 1) + 64 Original bits per vector = d * 32 Compression ratio = 32d / (d * (b+1) + 64) 

At d=1536, b=3: ratio = 7.7x (theoretical) / ~6x (practical with uint8 storage)

Installation

pip install langchain-turboquant

Or install from source:

git clone https://github.com/wjddusrb03/langchain-turboquant.git cd langchain-turboquant pip install -e ".[dev]"

Dependencies

  • Python >= 3.9
  • NumPy >= 1.21
  • SciPy >= 1.7
  • LangChain Core >= 0.3

Quick Start

from langchain_turboquant import TurboQuantVectorStore from langchain_openai import OpenAIEmbeddings # Create a compressed vector store (3-bit = ~6x compression) store = TurboQuantVectorStore(embedding=OpenAIEmbeddings(), bits=3) # Add documents - just like any LangChain vector store store.add_texts( ["TurboQuant compresses vectors by 6x", "LangChain is a framework for LLM applications", "RAG combines retrieval with generation"], metadatas=[{"topic": "compression"}, {"topic": "framework"}, {"topic": "rag"}] ) # Search results = store.similarity_search("How does compression work?", k=2) for doc in results: print(doc.page_content) # Check memory savings print(store.memory_stats()) # {'num_documents': 3, 'dimension': 1536, 'bits': 3, # 'compression_ratio': '7.7x', 'memory_saved_pct': '87.0%'}

Use as a LangChain Retriever

from langchain_core.runnables import RunnablePassthrough from langchain_openai import ChatOpenAI retriever = store.as_retriever(search_kwargs={"k": 3}) # Use in a RAG chain chain = ( {"context": retriever, "question": RunnablePassthrough()} | prompt | ChatOpenAI() )

No API Key Demo

Run the included demo with fake embeddings (no API key needed):

python examples/rag_demo.py

API Reference

TurboQuantVectorStore

TurboQuantVectorStore( embedding: Embeddings, # Any LangChain embedding model bits: int = 3, # Quantization bits (1-4, recommended: 3) qjl_dim: int = None, # QJL dimensions (default: same as embedding dim) seed: int = 42, # Random seed for reproducibility )

Methods:

Method Description
add_texts(texts, metadatas, ids) Embed, compress, and store texts
similarity_search(query, k) Return top-k most similar documents
similarity_search_with_score(query, k) Return top-k with cosine similarity scores
similarity_search_by_vector(vector, k) Search by pre-computed embedding vector
from_texts(texts, embedding, ...) Class method to create and populate store
delete(ids) Delete documents by ID
get_by_ids(ids) Retrieve documents by ID
as_retriever(**kwargs) Convert to LangChain Retriever
save(path) Persist store to disk
load(path, embedding) Load store from disk
memory_stats() Get compression statistics

TurboQuantizer (Low-level API)

from langchain_turboquant import TurboQuantizer quantizer = TurboQuantizer(dim=1536, bits=3) # Compress vectors compressed = quantizer.quantize(vectors) # (n, 1536) -> CompressedVectors # Asymmetric search (query in full precision, database compressed) scores = quantizer.cosine_scores(query_vector, compressed) # Reconstruct (for evaluation) reconstructed = quantizer.dequantize(compressed)

Compression Ratios by Configuration

Dimension Bits Theoretical Ratio Memory Saved
384 3 5.8x 82.8%
768 3 6.8x 85.3%
1536 3 7.3x 86.3%
3072 3 7.7x 87.0%
1536 2 9.5x 89.5%
1536 4 6.1x 83.6%

Higher dimensions benefit more from compression (the fixed 64-bit overhead for norms/gammas becomes negligible).

Testing

The project includes 296 comprehensive tests covering:

  • Mathematical correctness (83 tests): Lloyd-Max codebook properties, rotation matrix orthogonality, MSE bounds, PDF integration, centroid conditions
  • Edge cases (35 tests): NaN/Inf vectors, empty arrays, Unicode text, dim=1/2/3, zero vectors, large batches
  • Search recall (44 tests): Top-k recall at various k/n/dim/bits, cluster discrimination, asymmetric estimator statistics, Pearson correlation
  • Persistence (29 tests): Save/load roundtrips, serialization formats, state consistency after add/delete cycles
  • Rigorous validation (68 tests): Compression ratios, performance benchmarks, score ordering, reconstruction quality
  • Core functionality (37 tests): VectorStore CRUD, quantizer operations, LangChain integration
# Run all tests pytest tests/ -v # Run specific test suite pytest tests/test_math_stress.py -v # Mathematical properties pytest tests/test_recall_extensive.py -v # Search recall pytest tests/test_edge_cases.py -v # Edge cases

Architecture

langchain-turboquant/ ├── src/langchain_turboquant/ │ ├── __init__.py # Package exports │ ├── lloyd_max.py # Lloyd-Max optimal codebook computation │ ├── quantizer.py # TurboQuantizer (PolarQuant + QJL) │ └── vectorstore.py # LangChain VectorStore integration ├── tests/ │ ├── test_quantizer.py # Core quantizer tests │ ├── test_vectorstore.py # VectorStore API tests │ ├── test_rigorous.py # Rigorous validation │ ├── test_math_stress.py # Mathematical properties │ ├── test_edge_cases.py # Edge cases │ ├── test_recall_extensive.py # Search recall │ └── test_persistence.py # Persistence tests ├── examples/ │ └── rag_demo.py # Working RAG demo (no API key needed) ├── pyproject.toml ├── LICENSE └── README.md 

References

  • TurboQuant: Zandieh et al., "TurboQuant: Redefining Efficiency of KV Cache Compression for Large Language Models" (ICLR 2026). arXiv:2504.19874
  • PolarQuant: Zandieh et al., "PolarQuant: Achieving High-Fidelity Vector Quantization via Polar Coordinates" (AISTATS 2026). arXiv:2502.02617
  • QJL: Zandieh et al., "QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead" (AAAI 2025). arXiv:2406.03482
  • LangChain: langchain.com

Contributing

Contributions are welcome! If you find a bug, have a feature request, or want to improve the code:

  1. Open an Issue describing the problem or idea
  2. Fork the repo and create a branch
  3. Write tests for your changes
  4. Submit a Pull Request

Please report any problems or suggestions in the Issues tab. All feedback is appreciated!

License

MIT License - see LICENSE for details.

About

LangChain VectorStore with TurboQuant compression (ICLR 2026) - 6x memory reduction, training-free, no GPU required. The first LangChain integration for Google Research's TurboQuant algorithm.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages