Name	Name	Last commit message	Last commit date
Latest commit History 197 Commits
.streamlit	.streamlit
deploy	deploy
docs	docs
pages	pages
scripts	scripts
src	src
tests	tests
.env.example	.env.example
.gitignore	.gitignore
Makefile	Makefile
README.md	README.md
YOUTUBE_SETUP.md	YOUTUBE_SETUP.md
app.py	app.py
batch_ingest_neet2025.py	batch_ingest_neet2025.py
batch_ingest_v2.py	batch_ingest_v2.py
buildspec.yml	buildspec.yml
config.yaml	config.yaml
continue_multilingual.py	continue_multilingual.py
faiss_probe.sh	faiss_probe.sh
ingest_remaining.py	ingest_remaining.py
ingest_youtube.sh	ingest_youtube.sh
neet_2025_questions.txt	neet_2025_questions.txt
question_to_video_locator.py	question_to_video_locator.py
reingest_multilingual.py	reingest_multilingual.py
reingest_neet_pdf.py	reingest_neet_pdf.py
requirements.txt	requirements.txt
update_csv_processor.py	update_csv_processor.py
update_script.py	update_script.py
update_worker.py	update_worker.py

NEET Knowledge RAG

A comprehensive Retrieval-Augmented Generation (RAG) system for NEET aspirants in India. Supports multiple content types including PDFs, YouTube videos, text notes, and HTML pages.

Features

Multi-format Support: Process PDFs, YouTube videos, audio/video files, text files, Markdown, and HTML
Flexible RAG Pipeline: Uses LangChain for robust document processing and retrieval
Local Embeddings: HuggingFace all-MiniLM-L6-v2 (free, no API key needed)
LLM via OpenRouter: Gemini Flash 2.0 (or any OpenAI-compatible provider)
Vector Storage: FAISS for efficient similarity search
YouTube Audio Fallback: When subtitles are blocked, downloads audio and transcribes via Gemini multimodal
Streamlit Frontend: Web UI for managing sources and chatting with your knowledge base
CLI: Full command-line interface for ingestion, querying, and source management

Architecture

┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ Content │────▶│ Processing │────▶│ Vector Store │ │ Sources │ │ (Chunking) │ │ (FAISS) │ │ (YT/PDF/Text) │ └──────────────────┘ └────────┬────────┘ └─────────────────┘ │ ▼ ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ User Query │────▶│ Retrieval │────▶│ LLM Response │ │ (CLI/Streamlit) │ │ (Similarity) │ │ (Generation) │ └─────────────────┘ └──────────────────┘ └─────────────────┘

Installation

# Clone the repository git clone <repository-url> cd neet_knowledge_project # Create virtual environment python -m venv venv source venv/bin/activate # Linux/Mac # venv\Scripts\activate # Windows # Install dependencies pip install -r requirements.txt # Copy and configure environment cp .env.example .env # Edit .env with your API keys

Configuration

Edit .env with your API keys:

# Required: OpenRouter or OpenAI API key OPENAI_API_KEY=sk-or-v1-your-key-here OPENAI_BASE_URL=https://openrouter.ai/api/v1 OPENAI_MODEL_NAME=google/gemini-2.0-flash-001 # Optional: YouTube Data API for better metadata YOUTUBE_API_KEY=your-youtube-api-key

Quick Start

Option 1: Streamlit Web UI (Recommended for Demo)

streamlit run app.py

This opens a web interface where you can:

Add YouTube URLs or PDF paths via the sidebar
Click "Update/Ingest All Sources" to process them
Chat with your knowledge base in the main panel

Option 2: CLI

# Ingest test data (text files) python -m src.main ingest ./tests/test_data/text/physics_notes.txt python -m src.main ingest ./tests/test_data/text/chemistry_notes.md python -m src.main ingest ./tests/test_data/html/biology_cell.html # Ingest a YouTube video python -m src.main ingest "https://www.youtube.com/watch?v=VIDEO_ID" # Query the knowledge base python -m src.main query "What are Newton's laws of motion?" # Interactive chat python -m src.main chat # Check system stats python -m src.main stats

Source Management (CLI)

# Add a YouTube source for periodic updates python -m src.main source add youtube "https://youtube.com/watch?v=..." --title "Physics Lecture" # List all tracked sources python -m src.main source list # Update all sources that need refresh python -m src.main source update # Remove a source python -m src.main source remove <source_id>

Supported Content Types

Type	Extension/Format	Processing
Text	.txt	Direct text extraction
Markdown	.md, .markdown	Section-based extraction
HTML	.html, .htm	Main content extraction
PDF	.pdf	Text extraction + OCR fallback
YouTube	URL	Subtitles → Audio download → Gemini transcription
Video	.mp4, .avi, .mov	Audio transcription
Audio	.mp3, .wav	Speech-to-text

Project Structure

neet_knowledge_project/ ├── app.py # Streamlit web frontend ├── config.yaml # Configuration file ├── requirements.txt # Python dependencies ├── .env.example # Environment variable template ├── src/ │ ├── main.py # CLI entry point │ ├── processors/ # Content processors │ │ ├── pdf_processor.py │ │ ├── youtube_processor.py │ │ ├── text_processor.py │ │ ├── html_processor.py │ │ ├── video_processor.py │ │ └── unified.py # ContentProcessor router │ ├── rag/ # RAG system │ │ ├── vector_store.py # FAISS vector store manager │ │ ├── llm_manager.py # LLM provider abstraction │ │ └── neet_rag.py # Main RAG orchestrator │ └── utils/ │ ├── config.py # YAML config loader │ └── content_manager.py # Source tracking + auto-updater ├── tests/ │ ├── test_rag.py # Unit tests │ └── test_data/ # Sample NEET content │ ├── text/ # Physics & Chemistry notes │ └── html/ # Biology notes └── data/ ├── sources.json # Tracked content sources ├── faiss_index/ # FAISS vector database (generated) └── audio/ # Cached audio downloads (generated)

Running Tests

python tests/test_rag.py

Environment Variables

Variable	Required	Description
`OPENAI_API_KEY`	Yes	OpenRouter or OpenAI API key
`OPENAI_BASE_URL`	No	API base URL (default: OpenRouter)
`OPENAI_MODEL_NAME`	No	LLM model (default: gemini-2.0-flash)
`YOUTUBE_API_KEY`	No	YouTube Data API v3 key for metadata

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NEET Knowledge RAG

Features

Architecture

Installation

Configuration

Quick Start

Option 1: Streamlit Web UI (Recommended for Demo)

Option 2: CLI

Source Management (CLI)

Supported Content Types

Project Structure

Running Tests

Environment Variables

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NEET Knowledge RAG

Features

Architecture

Installation

Configuration

Quick Start

Option 1: Streamlit Web UI (Recommended for Demo)

Option 2: CLI

Source Management (CLI)

Supported Content Types

Project Structure

Running Tests

Environment Variables

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages