Official implementation of "SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching" (COLM 2025). A novel KV cache compression method that organizes cache at sentence level using semantic similarity.
- Updated
Sep 29, 2025 - Python
Official implementation of "SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching" (COLM 2025). A novel KV cache compression method that organizes cache at sentence level using semantic similarity.
This app leverages Semantic Caching to minimize inference latency and reduce API costs by reusing semantically similar prompt responses.
Semantic caching demo with real-time streaming and a cost & sizing calculator, powered by Azure Managed Redis and Azure OpenAI.
Rust Local Token Compression Proxy for coding agents, built solo for GenAI Genesis 2026. 🏆 1st Google Sustainability Hack
Evaluate how a semantic cache performs on your dataset by computing key KPIs over a threshold sweep and producing plots/CSVs:
LLM cost monitoring and optimization toolkit
LLMOps API Gateway in Go. Optimizes GenAI workloads with Qdrant semantic caching, Redis rate-limiting, and OpenTelemetry metrics.
Semantic caching for LLM responses using Redis Vector DB, LangChain, and HuggingFace embeddings, parses PDFs, generates FAQs with Groq, and serves similarity-based answers without redundant LLM calls.
Semantic LLM Gateway featuring intelligent prompt routing (basic MoE), L1/L2 semantic caching (Redis + pgvector), fault-tolerant model fallbacks, and real-time streaming telemetry. Built to reduce AI inference latency and optimize API compute costs.
Simple RAG implementation with semantic caching using Redis and Langchain
Intelligent LLM agent cost optimization runtime.
Add a description, image, and links to the semantic-caching topic page so that developers can more easily learn about it.
To associate your repository with the semantic-caching topic, visit your repo's landing page and select "manage topics."