Improve LLM Performance Using Semantic Cache with Cosmos DB

Improve LLM Performance Using Semantic Cache with Cosmos DB

Explore the advantages of leveraging Azure Cosmos DB Semantic Cache to boost LLM retrieval performance with OpenAI, resulting in cost savings and enhanced response times. This article offers a detailed walkthrough for seamlessly integrating the semantic cache into your current web application, showcasing its effects on LLM retrieval time, API expenses, and scalability.

Jonathan Scholtes

March 22, 2024

11–16 minutes

Azure Cosmos DB, LangChain, OpenAI, Python, Python FastAPI, RAG, React JS, Vector Search

Using a semantic cache can greatly improve query performance by 80% and reduce costs by limiting the number of API calls. This article will explore Azure Cosmos DB Semantic Cache and its significant impact on enhancing LLM retrieval performance with OpenAI. To demonstrate these benefits, we will integrate the Semantic Cache into an existing web application, developed as part of the article series LangChain RAG with React, FastAPI, Cosmos DB Vectors. This implementation will highlight the seamless integration of semantic caching mechanisms into real-world scenarios, providing valuable insights into optimizing data retrieval and processing.

Through the step-by-step approach, we will explore how the Semantic Cache can be seamlessly integrated into existing infrastructure, bolstering the application’s performance and responsiveness.

In this article

Prerequisites

If you don’t have an Azure subscription, create an Azure free account before you begin.
Setup account for OpenAI API – Overview – OpenAI API
Create an Azure Cosmos DB for MongoDB vCore by following this QuickStart.
An IDE for Development, such as VS Code
Python 3.11.4 installed on development environment.

Why Semantic Cache

Semantic caching speeds up the performance of Large Language Models (LLMs) by storing previously computed representations of text segments along with their semantic meanings. When the model encounters a similar text segment later on, it can retrieve its semantic representation from the cache instead of recomputing it. This approach effectively reduces the computational load and accelerates inference, particularly for tasks such as natural language understanding and generation. Semantic caching optimizes data retrieval, reduces expenses, and improves scalability for LLM-based applications.

Diagram of Azure Cosmos DB for MongoDB Semantic Cache to improve LLM Q&A Response

Key features:

Reduced API Costs
Faster Response Times
Scalability

By leveraging the semantic cache capabilities in Azure Cosmos DB, I was able to decrease the LLM retrieval time by 84% for the project highlighted in the article series ‘LangChain RAG with React, FastAPI, Cosmos DB Vectors‘, which will be the focus of our modifications and work in this article.

LLM Retrieval with Cache Disabled

With the semantic cache disabled (or prior to the caching of the prompt), the response time for LLM retrieval is 2.7 seconds.

Research Helper LLM RAG response with-out using semantic caching

LLM Retrieval with Cache Enabled

When the semantic cache is enabled and the prompt is available in the cache, the LLM retrieval response time drops to 0.43 seconds.

Research Helper LLM RAG response using semantic caching

Download the Project

For this project, the code and sample datasets are available to you on my GitHub.

Download the project LLM-Performance-with-Azure-Cosmos-DB-Semantic-Cache from my GitHub repository.

Load Documents into Azure Cosmos DB Vector Store

The demo_loader directory from the GitHub project contains the code to load the sample documents into Azure Cosmos DB for MongoDB vCore.

Setting Up the Environment for Loader

In this article, we’ll be using Python to set up your computer. We’ll use Python and LangChain to put vectors into Azure Cosmos DB for MongoDB vCore.

First setup your python virtual environment in the demo_loader directory.

python -m venv venv

Activate your environment and install dependencies in the demo_loader directory:

venv\Scripts\activate
python -m pip install -r requirements.txt

Create a file, named ‘.env’ in the demo_loader directory, to store the following environment variables.

OPENAI_API_KEY="**Your Open AI Key**"
MONGO_CONNECTION_STRING="mongodb+srv:**your connection string from Azure Cosmos DB**"
AZURE_STORAGE_CONNECTION_STRING="**"

Environment Variable	Description
OPENAI_API_KEY	The key to connect to OpenAI API. If you do not possess an API key for Open AI, you can proceed by following the guidelines outlined here.
MONGO_CONNECTION_STRING	The Connection string for Azure Cosmos DB for MongoDB vCore – demonstrated here.
AZURE_STORAGE_CONNECTION_STRING	The Connection string for Azure Storage Account (see here)

.env file variables

Using LangChain Loader to load Cosmos DB Vector Store

The Python document vectorstoreloader.py serves as the primary entry point for loading data into the Cosmos DB vector store and the Azure Storage Account. The provided code snippet facilitates the loading of document data into a CosmosDB, along with the corresponding images into an Azure Blob Storage container. This process involves the systematic handling of each document within the provided list of file names.

vectorstoreloader.py

file_names = ['documents/Rocket_Propulsion_Elements_with_images.json','documents/Introduction_To_Rocket_Science_And_Engineering_with_images.json']

file_names += file_names

for file_name in file_names:

 CosmosDBLoader(f"{file_name}").load()

 image_loader = BlobLoader()

 with open(file_name) as file:
 data = json.load(file)

 resource_id = data['resource_id']
 for page in data['pages']:
 
 base64_string = page['image'].replace("b'","").replace("'","")

 # Decode the Base64 string into bytes
 decoded_bytes = base64.b64decode(base64_string)

 image_loader.load_binay_data(decoded_bytes,f"{resource_id}/{page['page_id']}.png","images")

VectorstoreLoader Breakdown

file_names containes the two sample JSON files, each representing documents with associated images, roughly 160 document pages.
It iterates through each file name in the file_names list.
For each file name, it:
- Loads the document data from the JSON file into a CosmosDB using CosmosDBLoader – this process is covered in the article: LangChain Vector Search with Cosmos DB for MongoDB.
- Initializes an image_loader object of BlobLoader class.
- Opens the JSON file and loads its data into a Python dictionary named data.
- Extracts the resource_id from the document data.
- Iterates through the pages in the document data.
- Extracts the base64-encoded image data from each page and decodes it into bytes using base64.b64decode.
- Loads the decoded image bytes into the image_loader object, specifying the image’s destination path and container type.

Loading Azure Storage Account with BlobLoader

The BlobLoader operates in a straightforward manner by storing the converted base64 images from the JSON documents as bytes in the Azure Storage Account container ‘images’ passed in to the ‘load_binary_data’ function. The following code initializes a BlobLoader class to upload binary data to an Azure Blob Storage container using the Azure Storage connection string retrieved from environment variables.

The code in the GitHub repository currently utilizes the Azure Storage Account container images, as stated in VectorstoreLoader.py: image_loader.load_binay_data(decoded_bytes, f"{resource_id}/{page['page_id']}.png", "images"). To proceed, you need to create an images container or adjust this code to align with your pre-existing container.

blobloader.py

from os import environ
from dotenv import load_dotenv
from azure.storage.blob import BlobServiceClient

load_dotenv(override=True)


class BlobLoader():

 def __init__(self):
 connection_string = environ.get("AZURE_STORAGE_CONNECTION_STRING")

 # Create the BlobServiceClient object 
 self.blob_service_client = BlobServiceClient.from_connection_string(connection_string)


 def load_binay_data(self,data, blob_name:str, container_name:str): 

 blob_client = self.blob_service_client.get_blob_client(container=container_name, blob=blob_name)

 # Upload the blob data - default blob type is BlockBlob
 blob_client.upload_blob(data,overwrite=True)

BlobLoader Breakdown

The load_dotenv is used to load environment variables from the .env file(described above) into the script’s environment.
The class BlobLoader serves as a wrapper for uploading binary data to an Azure Blob Storage container.
Inside the __init__ method of the BlobLoader class:
- It retrieves the Azure Storage connection string from the environment variables using environ.get("AZURE_STORAGE_CONNECTION_STRING").
- It initializes a BlobServiceClient object using the retrieved connection string.
Inside the load_binary_data method of the BlobLoader class:
- It takes binary data, a blob_name, and a container_name as input parameters.
- It obtains a blob_client object for the specified blob and container using the get_blob_client method of the BlobServiceClient.
- It uploads the binary data to the specified blob in the Azure Blob Storage container using the upload_blob method of the blob_client. The overwrite=True parameter indicates that if a blob with the same name already exists, it should be overwritten.

Load Documents into Cosmos DB Vector Store and Images into Storage Account

Now that we have reviewed the code, let’s load the documents by simply executing the following command from the demo_loader directory.

python vectorstoreloader.py

Verify the successful loading of documents using MongoDB Compass (or a similar tool).

After LangChain Python code execution, loaded document results in MongoDB Compass with vector content

Please verify the successful upload of the ‘png’ images to your Azure Storage Account’s ‘image’ container.

Azure Storage Account images container - resource_id folders

Access the directory to display a list of the images.

Azure Storage Account images container, page images

To verify the bytes were decoded correctly, sample one or two images.

Azure Storage Account displaying selected image

Implementing Azure Cosmos DB Semantic Cache

We will incorporate the semantic caching logic into the Python FastAPI code initially created in the article: LangChain RAG with React, FastAPI, Cosmos DB Vector: Part 2. This article will solely address the code modifications necessary for implementing semantic caching. For further information on the development of the API, please refer to the original article.

Setting Up the Environment for the API

Setup your python virtual environment in the demo_api directory.

python -m venv venv

Activate your environment and install dependencies using the requirements file in the demo_api directory:

venv\Scripts\activate
python -m pip install -r requirements.txt

Create a file, named ‘.env’ in the demo_api directory, to store your environment variables.

OPENAI_API_KEY="**Your Open AI Key**"
MONGO_CONNECTION_STRING="mongodb+srv:**your connection string from Azure Cosmos DB**"
AZURE_STORAGE_CONNECTION_STRING="**"
AZURE_STORAGE_CONTAINER="images"

Environment Variable	Description
OPENAI_API_KEY	The key to connect to OpenAI API. If you do not possess an API key for Open AI, you can proceed by following the guidelines outlined here.
MONGO_CONNECTION_STRING	The Connection string for Azure Cosmos DB for MongoDB vCore (see here)
AZURE_STORAGE_CONNECTION_STRING	The Connection string for Azure Storage Account (see here)
AZURE_STORAGE_CONTAINER	The container name used from above defaults to ‘images‘.

.env file variables

With the environment configured and variables set up, we are ready to initiate the FastAPI server. Run the following command from the demo_api directory to initiate the server.

python main.py

The FastAPI server launches on the localhost loopback 127.0.0.1 port 8000 by default. You can access the Swagger documents using the following localhost address: http://127.0.0.1:8000/docs

Further information on conducting vector search testing with the API is available here.

Walkthrough of Code for Semantic Cache

This article focuses on the components that have been modified for semantic caching, primarily within the model and service layers. For a comprehensive walkthrough of the code, please refer to the FastAPI article here.

Model Layer

The Pydantic models serve as structured repositories for the application’s data. This data within these models is transmitted to the web layer for delivery to the API requestor.

model/airesults.py

from pydantic import BaseModel
from typing import List, Optional, Union
from model.resource import Resource 

class AIResults(BaseModel):
 text:str
 ResourceCollection: list[Resource]
 ResponseSeconds: float

The ResponseSeconds metric has been incorporated into the model to facilitate the communication of the retrieval time for the LLMs in seconds.

Service Layer

The service layers play a crucial role as the groundwork for the core business logic within this particular use case by acting as the repository for LangChain code and semantic caching.

service/init.py

from dotenv import load_dotenv
from langchain.globals import set_llm_cache
from langchain_openai import ChatOpenAI
from data.mongodb.init import semantic_cache


load_dotenv(override=True)


llm : ChatOpenAI | None=None

def LLM_init():
 global llm
 llm = ChatOpenAI(model_name="gpt-3.5-turbo-16k",temperature=0)
 set_llm_cache(semantic_cache) ##comment this line to turn-off cache

LLM_init()

A singleton pattern is employed to initialize a connection to the Azure Cosmos DB Semantic Cache when the service launches. The global variable llm is configured to use the AzureCosmosDBSemanticCache which established as our ChatOpenAI cache by calling set_llm_cache and passing the semantic_cache from data.mongodb.init.

The existing service code does not require any modifications to utilize the cache; it can simply make use of the global ChatOpenAI variable llm in the following manner.

from .init import llm

llm_chain = LLMChain(llm=llm, prompt=prompt)

Data Layer

The data layer establishes connectivity to Azure Cosmos DB and executes vector search operations. Furthermore, the code employs a singleton pattern, with the init.py file being the sole file that was modified to support semantic caching.

data/init.py

from os import environ
from dotenv import load_dotenv
from pymongo import MongoClient
from pymongo.collection import Collection
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores.azure_cosmos_db import AzureCosmosDBVectorSearch
from langchain_community.cache import AzureCosmosDBSemanticCache
from langchain_community.vectorstores.azure_cosmos_db import (
 CosmosDBSimilarityType,
 CosmosDBVectorSearchType,
)


load_dotenv(override=True)


collection: Collection | None = None
vector_store: AzureCosmosDBVectorSearch | None=None
semantic_cache : AzureCosmosDBSemanticCache | None=None

def mongodb_init():
 MONGO_CONNECTION_STRING = environ.get("MONGO_CONNECTION_STRING")
 DB_NAME = "research"
 COLLECTION_NAME = "resources"
 INDEX_NAME = "vectorSearchIndex"

 global collection, vector_store, semantic_cache
 client = MongoClient(MONGO_CONNECTION_STRING)
 db = client[DB_NAME]
 collection = db[COLLECTION_NAME]
 vector_store = AzureCosmosDBVectorSearch.from_connection_string(MONGO_CONNECTION_STRING,
 DB_NAME + "." + COLLECTION_NAME,
 OpenAIEmbeddings(disallowed_special=()),
 index_name=INDEX_NAME ) 

 
 semantic_cache = AzureCosmosDBSemanticCache(
 cosmosdb_connection_string=MONGO_CONNECTION_STRING,
 cosmosdb_client=None,
 embedding=OpenAIEmbeddings(),
 database_name=DB_NAME,
 collection_name=DB_NAME+'_CACHE',
 num_lists=1, #for a small demo, you can start with numLists set to 1 to perform a brute-force search across all vectors.,
 similarity=CosmosDBSimilarityType.COS,
 kind=CosmosDBVectorSearchType.VECTOR_IVF,
 dimensions=1536,
 m=16,
 ef_construction=64,
 ef_search=40,
 score_threshold=.99)
 

mongodb_init()

The modification to the code involves the addition of AzureCosmosDBSemanticCache. The passed in parameters correspond to the values utilized in the demo_loader and those specified in the LangChain docs for Azure Cosmos DB Sementic Cache. It was observed that setting score_threshold=.99 produced the intended effect, unlike the value recommended in the documentation. Setting the collection name as DB_NAME + '_CACHE' will result in the creation of a new Cosmos DB collection for the cache store named ‘research_CACHE’.

React Web User Interface

In this concluding section, we will establish the connection between the React JS web user interface and our updated Python FastAPI endpoint in order to utilize the semantic caching capabilities of Azure Cosmos DB.

Install Node.js

You can follow the steps outlined here to download and install Node.JS.

Set-up React Web User Interface

With Node.js installed, it is necessary to proceed by installing the dependencies before testing out the React interface.

Run the following command from the demo_web directory to perform a clean install of project dependencies, this may take some time.

npm ci

added 1599 packages, and audited 1600 packages in 7m

Next, it is essential to create a file named ‘.env’ within the demo_web directory to facilitate the storage of environment variables. Subsequently, you should include the following details in the newly created ‘.env‘ file.

REACT_APP_API_HOST=http://127.0.0.1:8000

Environment Variable	Description
REACT_APP_API_HOST	Url to our FastAPI server. Default to our local machine: http://127.0.0.1:8000

.env file variables

Now, we have the ability to execute the following command from the demo_web directory to initiate the React web user interface.

npm start

Walkthrough of the React Project

The React project requires only one modification: displaying the LLM response time on the search page. A comprehensive walkthrough of the entire React project can be found at this link.

Search

The Search component serves as the primary interface for interacting with the FastAPI RAG Q&A endpoints and associated React functions.

Search/Search.js

{results !== '' && (
 <Stack direction="column" spacing={2}>
 <SearchAnswer results={results} />
 </Stack>
)}

The search has been updated to pass the JSON results to the SearchAnswer, rather than solely providing the answer. This adjustment allows the SearchAnswer component to manage parsing the response time.

Search Answer

The SearchAnswer component presents the response (answer test) provided by the LangChain RAG FastAPI endpoint and the response time in seconds for the LLM retrieval process (or cache).

Search/SearchAnswer.js

import React from 'react'

import { Box, Paper, Stack, Typography } from '@mui/material'

export default function SearchAnswer(results) {
 return (
 <Paper sx={{ p: 2 }}>
 <Stack direction="column" spacing={2} useFlexGap flexWrap="wrap">
 <Typography
 variant="subtitle1"
 sx={{ color: 'grey', fontSize: '12pt' }}
 >
 Answer:
 </Typography>
 <Box
 sx={{
 border: 1,
 borderColor: 'lightgray',
 borderRadius: 3,
 p: 1,
 fontSize: 14,
 }}
 >
 {results.results.text}
 </Box>
 
 <Stack>
 <Typography
 variant="caption"
 sx={{ color: 'grey', fontSize: '10pt' }}
 >
 Response Time (seconds)
 </Typography>
 <Typography
 variant="caption"
 sx={{ color: 'grey', fontSize: '10pt' }}
 >
 {results.results.ResponseSeconds}
 </Typography>
 </Stack>
 </Stack>
 </Paper>
 )
}

The code remains largely unchanged, except for the addition of modifying the parameter to ‘results‘ and displaying ResponseSeconds underneath the answer text.

Research Helping React JS web user interface showing LLM answer and response time.

Congratulations on successfully loading data into Azure Cosmos DB for MongoDB for vector search. Additionally, you have effectively implemented semantic caching with LangChain and Azure Cosmos DB for MongoDB in FastAPI, and connected a web user interface using React JS. Hopefully this article demonstrated how easy it is to implement semantic caching into an existing application. If you would like to build on this project and deliver a scalable and secure AI solutions, check-out: The Perfect AI Team: Azure Cosmos DB and Azure App Service

2 responses to “Improve LLM Performance Using Semantic Cache with Cosmos DB”

LangChain RAG with React, FastAPI, Cosmos DB Vectors: Part 3 – Stochastic Coder

February 21, 2025 at 1:55 pm

[…] The series on LangChain RAG with React, FastAPI, and Cosmos DB has been a remarkable journey. This marks the conclusion of the three-part series, where we have accomplished significant milestones. These include loading vectors into Azure Cosmos DB for MongoDB vCore, uploading blobs into an Azure Storage Account, setting up our LangChain RAG interface with FastAPI, Python, and OpenAI, and ultimately developing our React web user interface. As a next step, you can enhance your LLM performance with Azure Cosmos DB semantic cache. Check out the article on how to do this: Improve LLM Performance Using Semantic Cache with Cosmos DB […]

LikeLike

Reply
LangChain Vector Search with Cosmos DB for MongoDB – Stochastic Coder

February 21, 2025 at 2:00 pm

[…] Well done! You have used LangChain to load documents from a JSON document into Azure Cosmos DB for MongoDB vCore and then conducted a vector search using similarity search. Next, you can build upon this foundational work in the following series: LangChain RAG with React, FastAPI, Cosmos DB Vector: Part 1. In-addition, you can learn how to improve LLM performance with Azure Cosmos DB semantic cache. Check out the article on how to do this: Improve LLM Performance Using Semantic Cache with Cosmos DB. […]

LikeLike

Reply

Stochastic Coder

Improve LLM Performance Using Semantic Cache with Cosmos DB

Prerequisites

Why Semantic Cache

LLM Retrieval with Cache Disabled

LLM Retrieval with Cache Enabled

Download the Project

Load Documents into Azure Cosmos DB Vector Store

Setting Up the Environment for Loader

Using LangChain Loader to load Cosmos DB Vector Store

VectorstoreLoader Breakdown

Loading Azure Storage Account with BlobLoader

BlobLoader Breakdown

Load Documents into Cosmos DB Vector Store and Images into Storage Account

Implementing Azure Cosmos DB Semantic Cache

Setting Up the Environment for the API

Walkthrough of Code for Semantic Cache

Model Layer

Service Layer

Data Layer

React Web User Interface

Install Node.js

Set-up React Web User Interface

Walkthrough of the React Project

Search

Search Answer

2 responses to “Improve LLM Performance Using Semantic Cache with Cosmos DB”

Leave a Reply Cancel reply

The Perfect AI Team: Azure Cosmos DB and Azure App Service

Chat History with Azure Cosmos DB and Semantic Kernel

Improve LLM Performance Using Semantic Cache with Cosmos DB

Prerequisites

Why Semantic Cache

LLM Retrieval with Cache Disabled

LLM Retrieval with Cache Enabled

Download the Project

Load Documents into Azure Cosmos DB Vector Store

Setting Up the Environment for Loader

Using LangChain Loader to load Cosmos DB Vector Store

VectorstoreLoader Breakdown

Loading Azure Storage Account with BlobLoader

BlobLoader Breakdown

Load Documents into Cosmos DB Vector Store and Images into Storage Account

Implementing Azure Cosmos DB Semantic Cache

Setting Up the Environment for the API

Walkthrough of Code for Semantic Cache

Model Layer

Service Layer

Data Layer

React Web User Interface

Install Node.js

Set-up React Web User Interface

Walkthrough of the React Project

Search

Search Answer

Share this:

2 responses to “Improve LLM Performance Using Semantic Cache with Cosmos DB”

Leave a Reply Cancel reply