RAG (Retrieval-augmented generation) ChatBot

Important

Disclaimer: The code has been tested on:

Ubuntu 22.04.2 LTS running on a Lenovo Legion 5 Pro with twenty 12th Gen Intel® Core™ i7-12700H and an NVIDIA GeForce RTX 3060.
MacOS Sonoma 14.3.1 running on a MacBook Pro M1 (2020).

If you are using another Operating System or different hardware, and you can't load the models, please take a look at the official Llama Cpp Python's GitHub issue.

Warning

llama_cpp_pyhon doesn't use GPU on M1 if you are running an x86 version of Python. More info here.
It's important to note that the large language model sometimes generates hallucinations or false information.

Note

To decide which hardware to use/buy to host you local LLMs we recommend you to read this great benchmarks:

Decision model:

Memory capacity is the main limit. Check if your model fits in memory (with quantization) https://www.canirun.ai/.
Memory bandwidth mostly determines speed (tokens/sec). Check if the bandwidth gives you acceptable speed.
If not, upgrade hardware or optimize the model.

For instance, it seems better to buy a second-hand or refurbished Mac Studio M2 Max with at least 64GB RAM, since it has 400Gbps of memory bandwidth compared to the M4 Pro, which has just 273Gbps.

Introduction

This project combines the power of llama.cpp and Chroma to build:

a Conversation-aware Chatbot (ChatGPT like experience).
a RAG (Retrieval-augmented generation) ChatBot.

The RAG Chatbot works by taking a collection of Markdown files as input and, when asked a question, provides the corresponding answer based on the context provided by those files.

Note

We decided to grab and refactor the RecursiveCharacterTextSplitter class from LangChain to effectively chunk Markdown files without adding LangChain as a dependency.

The Memory Builder component of the project loads Markdown pages from the docs folder. It then divides these pages into smaller sections, calculates the embeddings (a numerical representation) of these sections with the all-MiniLM-L6-v2 sentence-transformer, and saves them in an embedding database called Chroma for later use.

When a user asks a question, the RAG ChatBot retrieves the most relevant sections from the Embedding database. Since the original question can't be always optimal to retrieve for the LLM, we first prompt an LLM to rewrite the question, then conduct retrieval-augmented reading. The most relevant sections are then used as context to generate the final answer using a local language model (LLM). Additionally, the chatbot is designed to remember previous interactions. It saves the chat history and considers the relevant context from previous conversations to provide more accurate answers.

To deal with context overflows, we implemented two approaches:

Create And Refine the Context: synthesize a responses sequentially through all retrieved contents.
Hierarchical Summarization of Context: generate an answer for each relevant section independently, and then hierarchically combine the answers.

Prerequisites

Python 3.12+
GPU supporting CUDA 12.4+
Poetry 2.3.0

For the UI:

Node 22.12+
Yarn 1.22+

Install Poetry

Install Poetry with pipx by following this link.

You must use the current adopted version of Poetry defined here.

If you have poetry already installed and is not the right version, you can downgrade (or upgrade) poetry through:

poetry self update <version>

or with pipx:

pipx install poetry==<version> --force

Bootstrap Environment

To easily install the dependencies we created a make file.

How to use the make file

Important

Run Setup as your init command (or after Clean).

Check: make check
- Use it to check that which pip3 and which python3 points to the right path.
Setup:
- Setup with NVIDIA CUDA acceleration: make setup_cuda
  - Creates an environment and installs all dependencies with NVIDIA CUDA acceleration.
- Setup with Metal GPU acceleration: make setup_metal
  - Creates an environment and installs all dependencies with Metal GPU acceleration for macOS system only.
Start: make start
- Start both the backend and frontend ensuring that the backend is running and ready before launching the frontend.
Update: make update
- Update an environment and installs all updated dependencies.
Tidy up the code: make tidy
- Run Ruff check and format.
Clean: make clean
- Removes the environment and all cached files.
Test: make test
- Runs all tests.
- Using pytest

Environment

Copy .𝐞𝐧𝐯.𝐞𝐱𝐚𝐦𝐩𝐥𝐞 → .𝐞𝐧𝐯 and fill it in.

Using the Open-Source Models Locally

We utilize the open-source library llama-cpp-python, a binding for llama-cpp, allowing us to utilize it within a Python environment. llama-cpp serves as a C++ backend designed to work efficiently with transformer-based models. Running the LLMs architecture on a local PC is impossible due to the large (~7 billion) number of parameters. This library enable us to run them either on a CPU or GPU. Additionally, we use the Quantization and 4-bit precision to reduce number of bits required to represent the numbers. The quantized models are stored in GGML/GGUF format.

Supported Models

🤖 Model	Supported	Model Size	Max Context Window	Notes and link to the model card
`qwen-3.5:0.8b` Qwen 3.5 0.8B	✅	0.8B	256k	Tiny and fast multimodal, great for edge device - Card
`qwen-3.5:2b` Qwen 3.5 2B	✅	2B	256k	Multimodal for lightweight agents (small tool calls) - Card
`qwen-3.5:4b` Qwen 3.5 4B	✅	4B	256k	Doesn’t drift from tasks as bad as 2B Card
`qwen-3.5:9b` Qwen 3.5 9B	✅	9B	256k	Recommended model Can handle more complex tasks and competes with larger models like gpt-oss 120B Card
`qwen-2.5:3b` - Qwen2.5 Instruct	✅	3B	128k	Card
`qwen-2.5:3b-math-reasoning` - Qwen2.5 Instruct Math Reasoning	✅	3B	128k	Card
`llama-3.2:1b` Meta Llama 3.2 Instruct	✅	1B	128k	Optimized to run locally on a mobile or edge device - Card
`llama-3.2` Meta Llama 3.2 Instruct	✅	3B	128k	Optimized to run locally on a mobile or edge device - Card
`llama-3.1` Meta Llama 3.1 Instruct	✅	8B	128k	Recommended model Card
`deep-seek-r1:7b` - DeepSeek R1 Distill Qwen 7B	✅	7B	128k	Experimental Card
`openchat-3.6` - OpenChat 3.6	✅	8B	8192	Card
`openchat-3.5` - OpenChat 3.5	✅	7B	8192	Card
`starling` Starling Beta	✅	7B	8192	Is trained from `Openchat-3.5-0106`. It's recommended if you prefer more verbosity over OpenChat - Card
`phi-3.5` Phi-3.5 Mini Instruct	✅	3.8B	128k	Card
`stablelm-zephyr` StableLM Zephyr OpenOrca	✅	3B	4096	Card

Supported Response Synthesis strategies

✨ Response Synthesis strategy	Supported	Notes
`create-and-refine` Create and Refine	✅
`tree-summarization` Recommended - Tree Summarization	✅

Build the memory index

You could download some Markdown pages from the Blendle Employee Handbook and put them under docs.

Then run:

python chatbot/memory_builder.py --chunk-size 1000 --chunk-overlap 50

Run the Chatbot

The Chatbot has a UI built with Vite, React and TypeScript, and a backend built with FastAPI that serves the LLMs through llama-cpp-python.

To install the UI dependencies, run:

cd frontend nvm use yarn # Create .env file echo "VITE_API_URL=http://localhost:8000" > .env

To start the backend type:

cd backend && PYTHONPATH=.:../chatbot uvicorn main:app --reload

To start the frontend (in a new terminal):

cd frontend && yarn dev

or to start both ensuring that the backend is running and ready before launching the frontend just run:

make start

The application will be available at http://localhost:5173, with the backend API at http://localhost:8000.

You can enable the RAG Mode feature in the UI to ask questions based on the context provided by the Markdown files you loaded and indexed in the previous step:

You can also upload a Markdown file using the file uploader. The document management section shows the uploaded and indexed documents. Once you upload one or multiple files, they will be: uploaded → chunked → embedded → upserted to Chroma.

References

Large Language Models (LLMs):
LLM Frameworks:
- llama.cpp:
  - llama.cpp
  - llama-cpp-python
- Deepval - A framework for evaluating LLMs:
  - https://github.com/confident-ai/deepeval
- Structured Outputs
LLM Datasets:
- High-quality datasets
Agents:
- Agents
- Building effective agents
Agent Frameworks:
- PydanticAI
- Atomic Agents
  - Want to Build AI Agents? Tired of LangChain, CrewAI, AutoGen & Other AI Agent Frameworks?
- agno - a lightweight, high-performance library for building Agents.
Embeddings:
- To find the list of best embeddings models for the retrieval task in your language go to the Massive Text Embedding Benchmark (MTEB) Leaderboard
- all-MiniLM-L6-v2
  - This is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space (Max Tokens 512) and can be used for tasks like classification or semantic search.
Vector Databases:
- Indexing algorithms:
  - There are many algorithms for building indexes to optimize vector search. Most vector databases implement Hierarchical Navigable Small World (HNSW) and/or Inverted File Index (IVF). Here are some great articles explaining them, and the trade-off between speed, memory and quality:
    - Nearest Neighbor Indexes for Similarity Search
    - Hierarchical Navigable Small World (HNSW)
    - From NVIDIA - Accelerating Vector Search: Using GPU-Powered Indexes with RAPIDS RAFT
    - From NVIDIA - Accelerating Vector Search: Fine-Tuning GPU Index Algorithms
    - PS: Flat indexes (i.e. no optimisation) can be used to maintain 100% recall and precision, at the expense of speed.
- Chroma
  - chroma
- Qdrant:
  - Qdrant Internals: Immutable Data Structures
  - Food Discovery with Qdrant
Retrieval Augmented Generation (RAG):
- Building A Generative AI Platform
- Rewrite-Retrieve-Read
  - Because the original query can not be always optimal to retrieve for the LLM, especially in the real world, we first prompt an LLM to rewrite the queries, then conduct retrieval-augmented reading.
- Rerank
- Building Response Synthesis from Scratch
- Conversational awareness
- RAG is Dead, Again?
Chatbot UI:
- Open WebUI
  - Running AI Locally Using Ollama on Ubuntu Linux
Text Processing and Cleaning:
- clean-text
- Fast Semantic Text Deduplication
Inspirational Open Source Repositories:

Name		Name	Last commit message	Last commit date
Latest commit History 219 Commits
.github/workflows		.github/workflows
backend		backend
chatbot		chatbot
docs		docs
frontend		frontend
images		images
models		models
notes		notes
tests		tests
vector_store		vector_store
version		version
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG (Retrieval-augmented generation) ChatBot

Table of contents

Introduction

Prerequisites

Install Poetry

Bootstrap Environment

How to use the make file

Environment

Using the Open-Source Models Locally

Supported Models

Supported Response Synthesis strategies

Build the memory index

Run the Chatbot

References

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 3

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG (Retrieval-augmented generation) ChatBot

Table of contents

Introduction

Prerequisites

Install Poetry

Bootstrap Environment

How to use the make file

Environment

Using the Open-Source Models Locally

Supported Models

Supported Response Synthesis strategies

Build the memory index

Run the Chatbot

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 3

Languages

Packages