The first part of the LangChain RAG Pattern with React, FastAPI, and Cosmos DB Vector Store series is based on the article LangChain Vector Search with Cosmos DB for MongoDB. This article explains how to load Documents into Cosmos DB for MongoDB VCore Vector Store using LangChain. This is part 1 of 3 and establishes the foundation for the entire series, with part 2 building the LangChain RAG and API with FastAPI. This tutorial will walk you through each step of the project, helping you understand all the necessary components and processes. Throughout the series, you’ll learn how to seamlessly integrate these technologies, gaining the confidence and skills to tackle similar projects. This comprehensive resource will empower you to navigate the complexities of this project and achieve a robust and efficient end-to-end solution.

In this article
- Prerequisites
- Retrieval Augmented Generation (RAG): What Is It?
- Download the Project
- Setting Up Python
- Using LangChain Loader to load Cosmos DB Vector Store
- Loading Azure Storage Account with BlobLoader
- Load Documents into Cosmos DB Vector Store and Images into Storage Account
Prerequisites
- If you don’t have an Azure subscription, create an Azure free account before you begin.
- Setup account for OpenAI API – Overview – OpenAI API
- Create an Azure Cosmos DB for MongoDB vCore by following this QuickStart.
- An IDE for Development, such as VS Code
Retrieval Augmented Generation (RAG): What Is It?
Large Language Model (LLM) systems comprehend a wide array of subjects, but their knowledge is limited to publicly available information up to a certain temporal threshold. If there is a need to develop AI tools with the capability to comprehend confidential data or novel information, integrating the necessary data into the model becomes imperative. We commonly refer to this process as Retrieval Augmented Generation (RAG). This process may also be referred to as ‘grounding’ a model.
Download the Project
For this project, the code and sample datasets are available to you on my GitHub.

Clone the Document Loader (demo_loader) Github Repo

Setting Up Python
In this tutorial, we’ll actively employ Python requiring its setup on your computer. We’ll use Python and LangChain to ingest vectors into Azure Cosmos DB for MongoDB vCore and conduct a similarity search. Python 3.11.4 was utilized during the development and testing of this walkthrough.
First setup your python virtual environment in the demo_loader directory.
python -m venv venv Activate your environment and install dependencies in the demo_loader directory:
venv\Scripts\activate
python -m pip install -r requirements.txt Create a file, named ‘.env’ in the demo_loader directory, to store your environment variables.
OPENAI_API_KEY="**Your Open AI Key**"
MONGO_CONNECTION_STRING="mongodb+srv:**your connection string from Azure Cosmos DB**"
AZURE_STORAGE_CONNECTION_STRING="**" | Environment Variable | Description |
|---|---|
| OPENAI_API_KEY | The key to connect to OpenAI API. If you do not possess an API key for Open AI, you can proceed by following the guidelines outlined here. |
| MONGO_CONNECTION_STRING | The Connection string for Azure Cosmos DB for MongoDB vCore (see below) |
| AZURE_STORAGE_CONNECTION_STRING | The Connection string for Azure Storage Account (see below) |
Azure Cosmos DB for MongoDB vCore Connection String
The environment variable MONGO_CONNECTION_STRING from the ‘.env’ file will contain the Azure Cosmos DB for MongoDB vCore connection string. Obtain this value by selecting “Connection strings” from the Azure portal for your Cosmos DB instance. It may be necessary to include your username and password within the designated fields.

Azure Storage Account Connection String
The environment variable AZURE_STORAGE_CONNECTION_STRING from the ‘.env’ file will contain the Azure Storage Account connection string. Obtain this value by selecting “Access keys” from the Azure portal for your Storage Account instance.

Using LangChain Loader to load Cosmos DB Vector Store
The Python document vectorstoreloader.py serves as the primary entry point for loading data into the Cosmos DB vector store and the Azure Storage Account. The provided code snippet facilitates the loading of document data into a CosmosDB, along with the corresponding images into an Azure Blob Storage container. This process involves the systematic handling of each document within the provided list of file names.
vectorstoreloader.py
file_names = ['documents/Rocket_Propulsion_Elements_with_images.json','documents/Introduction_To_Rocket_Science_And_Engineering_with_images.json']
file_names += file_names
for file_name in file_names:
CosmosDBLoader(f"{file_name}").load()
image_loader = BlobLoader()
with open(file_name) as file:
data = json.load(file)
resource_id = data['resource_id']
for page in data['pages']:
base64_string = page['image'].replace("b'","").replace("'","")
# Decode the Base64 string into bytes
decoded_bytes = base64.b64decode(base64_string)
image_loader.load_binay_data(decoded_bytes,f"{resource_id}/{page['page_id']}.png","images") VectorstoreLoader Breakdown
-
file_namescontaines the two sample JSON files, each representing documents with associated images, roughly 160 document pages. - It iterates through each file name in the
file_nameslist. - For each file name, it:
- Loads the document data from the JSON file into a CosmosDB using
CosmosDBLoader– this process is covered in the article: LangChain Vector Search with Cosmos DB for MongoDB. - Initializes an
image_loaderobject ofBlobLoaderclass. - Opens the JSON file and loads its data into a Python dictionary named
data. - Extracts the
resource_idfrom the document data. - Iterates through the pages in the document data.
- Extracts the base64-encoded image data from each page and decodes it into bytes using
base64.b64decode. - Loads the decoded image bytes into the
image_loaderobject, specifying the image’s destination path and container type.
- Loads the document data from the JSON file into a CosmosDB using
Loading Azure Storage Account with BlobLoader
The BlobLoader operates in a straightforward manner by storing the converted base64 images from the JSON documents as bytes in the Azure Storage Account container ‘images’ passed in to the ‘load_binary_data’ function. The following code initializes a BlobLoader class to upload binary data to an Azure Blob Storage container using the Azure Storage connection string retrieved from environment variables.
The code in the GitHub repository currently utilizes the Azure Storage Account container images, as stated in VectorstoreLoader.py: image_loader.load_binay_data(decoded_bytes, f"{resource_id}/{page['page_id']}.png", "images"). To proceed, you will be required to create an images container or adjust this code to align with your pre-existing container.
BlobLoader.py
from os import environ
from dotenv import load_dotenv
from azure.storage.blob import BlobServiceClient
load_dotenv(override=True)
class BlobLoader():
def __init__(self):
connection_string = environ.get("AZURE_STORAGE_CONNECTION_STRING")
# Create the BlobServiceClient object
self.blob_service_client = BlobServiceClient.from_connection_string(connection_string)
def load_binay_data(self,data, blob_name:str, container_name:str):
blob_client = self.blob_service_client.get_blob_client(container=container_name, blob=blob_name)
# Upload the blob data - default blob type is BlockBlob
blob_client.upload_blob(data,overwrite=True) BlobLoader Breakdown
- The
load_dotenvis used to load environment variables from the.envfile(described above) into the script’s environment. - The class
BlobLoaderserves as a wrapper for uploading binary data to an Azure Blob Storage container. - Inside the
__init__method of theBlobLoaderclass:- It retrieves the Azure Storage connection string from the environment variables using
environ.get("AZURE_STORAGE_CONNECTION_STRING"). - It initializes a
BlobServiceClientobject using the retrieved connection string.
- It retrieves the Azure Storage connection string from the environment variables using
- Inside the
load_binary_datamethod of theBlobLoaderclass:- It takes binary
data, ablob_name, and acontainer_nameas input parameters. - It obtains a
blob_clientobject for the specified blob and container using theget_blob_clientmethod of theBlobServiceClient. - It uploads the binary
datato the specified blob in the Azure Blob Storage container using theupload_blobmethod of theblob_client. Theoverwrite=Trueparameter indicates that if a blob with the same name already exists, it should be overwritten.
- It takes binary
Load Documents into Cosmos DB Vector Store and Images into Storage Account
Now that we have reviewed the code, let’s load the documents by simply executing the following command from the demo_loader directory.
python vectorstoreloader.py Verify the successful loading of documents using MongoDB Compass (or a similar tool).

Please verify the successful upload of the ‘png’ images to your Azure Storage Account’s ‘image’ container.

Access the directory to display a list of the images.

Sample an image or two to verify the bytes were decoded correctly.

Congratulations! You have utilized LangChain to import documents from JSON files into Azure Cosmos DB for MongoDB vCore, and subsequently uploaded binary documents to an Azure Storage Account for application in Part 2 and Part 3 of this instructional guide.

Leave a Reply