Using a semantic cache can greatly improve query performance by 80% and reduce costs by limiting the number of API calls. This article will explore Azure Cosmos DB Semantic Cache and its significant impact on enhancing LLM retrieval performance with OpenAI. To demonstrate these benefits, we will integrate the Semantic Cache into an existing web application, developed as part of the article series LangChain RAG with React, FastAPI, Cosmos DB Vectors. This implementation will highlight the seamless integration of semantic caching mechanisms into real-world scenarios, providing valuable insights into optimizing data retrieval and processing.
Through the step-by-step approach, we will explore how the Semantic Cache can be seamlessly integrated into existing infrastructure, bolstering the application’s performance and responsiveness.
In this article
- Prerequisites
- Why Semantic Cache
- Download the Project
- Load Documents into Azure Cosmos DB Vector Store
- Implementing Azure Cosmos DB Semantic Cache
- React Web User Interface
Prerequisites
- If you don’t have an Azure subscription, create an Azure free account before you begin.
- Setup account for OpenAI API – Overview – OpenAI API
- Create an Azure Cosmos DB for MongoDB vCore by following this QuickStart.
- An IDE for Development, such as VS Code
- Python 3.11.4 installed on development environment.
Why Semantic Cache
Semantic caching speeds up the performance of Large Language Models (LLMs) by storing previously computed representations of text segments along with their semantic meanings. When the model encounters a similar text segment later on, it can retrieve its semantic representation from the cache instead of recomputing it. This approach effectively reduces the computational load and accelerates inference, particularly for tasks such as natural language understanding and generation. Semantic caching optimizes data retrieval, reduces expenses, and improves scalability for LLM-based applications.

Key features:
- Reduced API Costs
- Faster Response Times
- Scalability
By leveraging the semantic cache capabilities in Azure Cosmos DB, I was able to decrease the LLM retrieval time by 84% for the project highlighted in the article series ‘LangChain RAG with React, FastAPI, Cosmos DB Vectors‘, which will be the focus of our modifications and work in this article.
LLM Retrieval with Cache Disabled
With the semantic cache disabled (or prior to the caching of the prompt), the response time for LLM retrieval is 2.7 seconds.

LLM Retrieval with Cache Enabled
When the semantic cache is enabled and the prompt is available in the cache, the LLM retrieval response time drops to 0.43 seconds.

Download the Project
For this project, the code and sample datasets are available to you on my GitHub.

Download the project LLM-Performance-with-Azure-Cosmos-DB-Semantic-Cache from my GitHub repository.
Load Documents into Azure Cosmos DB Vector Store
The demo_loader directory from the GitHub project contains the code to load the sample documents into Azure Cosmos DB for MongoDB vCore.
Setting Up the Environment for Loader
In this article, we’ll be using Python to set up your computer. We’ll use Python and LangChain to put vectors into Azure Cosmos DB for MongoDB vCore.
First setup your python virtual environment in the demo_loader directory.
python -m venv venv Activate your environment and install dependencies in the demo_loader directory:
venv\Scripts\activate
python -m pip install -r requirements.txt Create a file, named ‘.env’ in the demo_loader directory, to store the following environment variables.
OPENAI_API_KEY="**Your Open AI Key**"
MONGO_CONNECTION_STRING="mongodb+srv:**your connection string from Azure Cosmos DB**"
AZURE_STORAGE_CONNECTION_STRING="**" | Environment Variable | Description |
|---|---|
| OPENAI_API_KEY | The key to connect to OpenAI API. If you do not possess an API key for Open AI, you can proceed by following the guidelines outlined here. |
| MONGO_CONNECTION_STRING | The Connection string for Azure Cosmos DB for MongoDB vCore – demonstrated here. |
| AZURE_STORAGE_CONNECTION_STRING | The Connection string for Azure Storage Account (see here) |
Using LangChain Loader to load Cosmos DB Vector Store
The Python document vectorstoreloader.py serves as the primary entry point for loading data into the Cosmos DB vector store and the Azure Storage Account. The provided code snippet facilitates the loading of document data into a CosmosDB, along with the corresponding images into an Azure Blob Storage container. This process involves the systematic handling of each document within the provided list of file names.
vectorstoreloader.py
file_names = ['documents/Rocket_Propulsion_Elements_with_images.json','documents/Introduction_To_Rocket_Science_And_Engineering_with_images.json']
file_names += file_names
for file_name in file_names:
CosmosDBLoader(f"{file_name}").load()
image_loader = BlobLoader()
with open(file_name) as file:
data = json.load(file)
resource_id = data['resource_id']
for page in data['pages']:
base64_string = page['image'].replace("b'","").replace("'","")
# Decode the Base64 string into bytes
decoded_bytes = base64.b64decode(base64_string)
image_loader.load_binay_data(decoded_bytes,f"{resource_id}/{page['page_id']}.png","images") VectorstoreLoader Breakdown
file_namescontaines the two sample JSON files, each representing documents with associated images, roughly 160 document pages.- It iterates through each file name in the
file_nameslist. - For each file name, it:
- Loads the document data from the JSON file into a CosmosDB using
CosmosDBLoader– this process is covered in the article: LangChain Vector Search with Cosmos DB for MongoDB. - Initializes an
image_loaderobject ofBlobLoaderclass. - Opens the JSON file and loads its data into a Python dictionary named
data. - Extracts the
resource_idfrom the document data. - Iterates through the pages in the document data.
- Extracts the base64-encoded image data from each page and decodes it into bytes using
base64.b64decode. - Loads the decoded image bytes into the
image_loaderobject, specifying the image’s destination path and container type.
- Loads the document data from the JSON file into a CosmosDB using
Loading Azure Storage Account with BlobLoader
The BlobLoader operates in a straightforward manner by storing the converted base64 images from the JSON documents as bytes in the Azure Storage Account container ‘images’ passed in to the ‘load_binary_data’ function. The following code initializes a BlobLoader class to upload binary data to an Azure Blob Storage container using the Azure Storage connection string retrieved from environment variables.
The code in the GitHub repository currently utilizes the Azure Storage Account container images, as stated in VectorstoreLoader.py: image_loader.load_binay_data(decoded_bytes, f"{resource_id}/{page['page_id']}.png", "images"). To proceed, you need to create an images container or adjust this code to align with your pre-existing container.
blobloader.py
from os import environ
from dotenv import load_dotenv
from azure.storage.blob import BlobServiceClient
load_dotenv(override=True)
class BlobLoader():
def __init__(self):
connection_string = environ.get("AZURE_STORAGE_CONNECTION_STRING")
# Create the BlobServiceClient object
self.blob_service_client = BlobServiceClient.from_connection_string(connection_string)
def load_binay_data(self,data, blob_name:str, container_name:str):
blob_client = self.blob_service_client.get_blob_client(container=container_name, blob=blob_name)
# Upload the blob data - default blob type is BlockBlob
blob_client.upload_blob(data,overwrite=True) BlobLoader Breakdown
- The
load_dotenvis used to load environment variables from the.envfile(described above) into the script’s environment. - The class
BlobLoaderserves as a wrapper for uploading binary data to an Azure Blob Storage container. - Inside the
__init__method of theBlobLoaderclass:- It retrieves the Azure Storage connection string from the environment variables using
environ.get("AZURE_STORAGE_CONNECTION_STRING"). - It initializes a
BlobServiceClientobject using the retrieved connection string.
- It retrieves the Azure Storage connection string from the environment variables using
- Inside the
load_binary_datamethod of theBlobLoaderclass:- It takes binary
data, ablob_name, and acontainer_nameas input parameters. - It obtains a
blob_clientobject for the specified blob and container using theget_blob_clientmethod of theBlobServiceClient. - It uploads the binary
datato the specified blob in the Azure Blob Storage container using theupload_blobmethod of theblob_client. Theoverwrite=Trueparameter indicates that if a blob with the same name already exists, it should be overwritten.
- It takes binary
Load Documents into Cosmos DB Vector Store and Images into Storage Account
Now that we have reviewed the code, let’s load the documents by simply executing the following command from the demo_loader directory.
python vectorstoreloader.py Verify the successful loading of documents using MongoDB Compass (or a similar tool).

Please verify the successful upload of the ‘png’ images to your Azure Storage Account’s ‘image’ container.

Access the directory to display a list of the images.

To verify the bytes were decoded correctly, sample one or two images.

Implementing Azure Cosmos DB Semantic Cache
We will incorporate the semantic caching logic into the Python FastAPI code initially created in the article: LangChain RAG with React, FastAPI, Cosmos DB Vector: Part 2. This article will solely address the code modifications necessary for implementing semantic caching. For further information on the development of the API, please refer to the original article.
Setting Up the Environment for the API
Setup your python virtual environment in the demo_api directory.
python -m venv venv Activate your environment and install dependencies using the requirements file in the demo_api directory:
venv\Scripts\activate
python -m pip install -r requirements.txt Create a file, named ‘.env’ in the demo_api directory, to store your environment variables.
OPENAI_API_KEY="**Your Open AI Key**"
MONGO_CONNECTION_STRING="mongodb+srv:**your connection string from Azure Cosmos DB**"
AZURE_STORAGE_CONNECTION_STRING="**"
AZURE_STORAGE_CONTAINER="images" | Environment Variable | Description |
|---|---|
| OPENAI_API_KEY | The key to connect to OpenAI API. If you do not possess an API key for Open AI, you can proceed by following the guidelines outlined here. |
| MONGO_CONNECTION_STRING | The Connection string for Azure Cosmos DB for MongoDB vCore (see here) |
| AZURE_STORAGE_CONNECTION_STRING | The Connection string for Azure Storage Account (see here) |
| AZURE_STORAGE_CONTAINER | The container name used from above defaults to ‘images‘. |
With the environment configured and variables set up, we are ready to initiate the FastAPI server. Run the following command from the demo_api directory to initiate the server.
python main.py The FastAPI server launches on the localhost loopback 127.0.0.1 port 8000 by default. You can access the Swagger documents using the following localhost address: http://127.0.0.1:8000/docs
Further information on conducting vector search testing with the API is available here.
Walkthrough of Code for Semantic Cache
This article focuses on the components that have been modified for semantic caching, primarily within the model and service layers. For a comprehensive walkthrough of the code, please refer to the FastAPI article here.
Model Layer
The Pydantic models serve as structured repositories for the application’s data. This data within these models is transmitted to the web layer for delivery to the API requestor.
model/airesults.py
from pydantic import BaseModel
from typing import List, Optional, Union
from model.resource import Resource
class AIResults(BaseModel):
text:str
ResourceCollection: list[Resource]
ResponseSeconds: float The ResponseSeconds metric has been incorporated into the model to facilitate the communication of the retrieval time for the LLMs in seconds.
Service Layer
The service layers play a crucial role as the groundwork for the core business logic within this particular use case by acting as the repository for LangChain code and semantic caching.
service/init.py
from dotenv import load_dotenv
from langchain.globals import set_llm_cache
from langchain_openai import ChatOpenAI
from data.mongodb.init import semantic_cache
load_dotenv(override=True)
llm : ChatOpenAI | None=None
def LLM_init():
global llm
llm = ChatOpenAI(model_name="gpt-3.5-turbo-16k",temperature=0)
set_llm_cache(semantic_cache) ##comment this line to turn-off cache
LLM_init() A singleton pattern is employed to initialize a connection to the Azure Cosmos DB Semantic Cache when the service launches. The global variable llm is configured to use the AzureCosmosDBSemanticCache which established as our ChatOpenAI cache by calling set_llm_cache and passing the semantic_cache from data.mongodb.init.
The existing service code does not require any modifications to utilize the cache; it can simply make use of the global ChatOpenAI variable llm in the following manner.
from .init import llm
llm_chain = LLMChain(llm=llm, prompt=prompt) Data Layer
The data layer establishes connectivity to Azure Cosmos DB and executes vector search operations. Furthermore, the code employs a singleton pattern, with the init.py file being the sole file that was modified to support semantic caching.
data/init.py
from os import environ
from dotenv import load_dotenv
from pymongo import MongoClient
from pymongo.collection import Collection
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores.azure_cosmos_db import AzureCosmosDBVectorSearch
from langchain_community.cache import AzureCosmosDBSemanticCache
from langchain_community.vectorstores.azure_cosmos_db import (
CosmosDBSimilarityType,
CosmosDBVectorSearchType,
)
load_dotenv(override=True)
collection: Collection | None = None
vector_store: AzureCosmosDBVectorSearch | None=None
semantic_cache : AzureCosmosDBSemanticCache | None=None
def mongodb_init():
MONGO_CONNECTION_STRING = environ.get("MONGO_CONNECTION_STRING")
DB_NAME = "research"
COLLECTION_NAME = "resources"
INDEX_NAME = "vectorSearchIndex"
global collection, vector_store, semantic_cache
client = MongoClient(MONGO_CONNECTION_STRING)
db = client[DB_NAME]
collection = db[COLLECTION_NAME]
vector_store = AzureCosmosDBVectorSearch.from_connection_string(MONGO_CONNECTION_STRING,
DB_NAME + "." + COLLECTION_NAME,
OpenAIEmbeddings(disallowed_special=()),
index_name=INDEX_NAME )
semantic_cache = AzureCosmosDBSemanticCache(
cosmosdb_connection_string=MONGO_CONNECTION_STRING,
cosmosdb_client=None,
embedding=OpenAIEmbeddings(),
database_name=DB_NAME,
collection_name=DB_NAME+'_CACHE',
num_lists=1, #for a small demo, you can start with numLists set to 1 to perform a brute-force search across all vectors.,
similarity=CosmosDBSimilarityType.COS,
kind=CosmosDBVectorSearchType.VECTOR_IVF,
dimensions=1536,
m=16,
ef_construction=64,
ef_search=40,
score_threshold=.99)
mongodb_init() The modification to the code involves the addition of AzureCosmosDBSemanticCache. The passed in parameters correspond to the values utilized in the demo_loader and those specified in the LangChain docs for Azure Cosmos DB Sementic Cache. It was observed that setting score_threshold=.99 produced the intended effect, unlike the value recommended in the documentation. Setting the collection name as DB_NAME + '_CACHE' will result in the creation of a new Cosmos DB collection for the cache store named ‘research_CACHE’.
React Web User Interface
In this concluding section, we will establish the connection between the React JS web user interface and our updated Python FastAPI endpoint in order to utilize the semantic caching capabilities of Azure Cosmos DB.
Install Node.js
You can follow the steps outlined here to download and install Node.JS.
Set-up React Web User Interface
With Node.js installed, it is necessary to proceed by installing the dependencies before testing out the React interface.
Run the following command from the demo_web directory to perform a clean install of project dependencies, this may take some time.
npm ci added 1599 packages, and audited 1600 packages in 7m
Next, it is essential to create a file named ‘.env’ within the demo_web directory to facilitate the storage of environment variables. Subsequently, you should include the following details in the newly created ‘.env‘ file.
REACT_APP_API_HOST=http://127.0.0.1:8000 | Environment Variable | Description |
|---|---|
| REACT_APP_API_HOST | Url to our FastAPI server. Default to our local machine: http://127.0.0.1:8000 |
Now, we have the ability to execute the following command from the demo_web directory to initiate the React web user interface.
npm start Walkthrough of the React Project
The React project requires only one modification: displaying the LLM response time on the search page. A comprehensive walkthrough of the entire React project can be found at this link.
Search
The Search component serves as the primary interface for interacting with the FastAPI RAG Q&A endpoints and associated React functions.
Search/Search.js
{results !== '' && (
<Stack direction="column" spacing={2}>
<SearchAnswer results={results} />
</Stack>
)} The search has been updated to pass the JSON results to the SearchAnswer, rather than solely providing the answer. This adjustment allows the SearchAnswer component to manage parsing the response time.
Search Answer
The SearchAnswer component presents the response (answer test) provided by the LangChain RAG FastAPI endpoint and the response time in seconds for the LLM retrieval process (or cache).
Search/SearchAnswer.js
import React from 'react'
import { Box, Paper, Stack, Typography } from '@mui/material'
export default function SearchAnswer(results) {
return (
<Paper sx={{ p: 2 }}>
<Stack direction="column" spacing={2} useFlexGap flexWrap="wrap">
<Typography
variant="subtitle1"
sx={{ color: 'grey', fontSize: '12pt' }}
>
Answer:
</Typography>
<Box
sx={{
border: 1,
borderColor: 'lightgray',
borderRadius: 3,
p: 1,
fontSize: 14,
}}
>
{results.results.text}
</Box>
<Stack>
<Typography
variant="caption"
sx={{ color: 'grey', fontSize: '10pt' }}
>
Response Time (seconds)
</Typography>
<Typography
variant="caption"
sx={{ color: 'grey', fontSize: '10pt' }}
>
{results.results.ResponseSeconds}
</Typography>
</Stack>
</Stack>
</Paper>
)
} The code remains largely unchanged, except for the addition of modifying the parameter to ‘results‘ and displaying ResponseSeconds underneath the answer text.

Congratulations on successfully loading data into Azure Cosmos DB for MongoDB for vector search. Additionally, you have effectively implemented semantic caching with LangChain and Azure Cosmos DB for MongoDB in FastAPI, and connected a web user interface using React JS. Hopefully this article demonstrated how easy it is to implement semantic caching into an existing application. If you would like to build on this project and deliver a scalable and secure AI solutions, check-out: The Perfect AI Team: Azure Cosmos DB and Azure App Service

Leave a Reply