A comprehensive Retrieval-Augmented Generation (RAG) system for NEET aspirants in India. Supports multiple content types including PDFs, YouTube videos, text notes, and HTML pages.
- Multi-format Support: Process PDFs, YouTube videos, audio/video files, text files, Markdown, and HTML
- Flexible RAG Pipeline: Uses LangChain for robust document processing and retrieval
- Local Embeddings: HuggingFace
all-MiniLM-L6-v2(free, no API key needed) - LLM via OpenRouter: Gemini Flash 2.0 (or any OpenAI-compatible provider)
- Vector Storage: FAISS for efficient similarity search
- YouTube Audio Fallback: When subtitles are blocked, downloads audio and transcribes via Gemini multimodal
- Streamlit Frontend: Web UI for managing sources and chatting with your knowledge base
- CLI: Full command-line interface for ingestion, querying, and source management
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ Content │────▶│ Processing │────▶│ Vector Store │ │ Sources │ │ (Chunking) │ │ (FAISS) │ │ (YT/PDF/Text) │ └──────────────────┘ └────────┬────────┘ └─────────────────┘ │ ▼ ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ User Query │────▶│ Retrieval │────▶│ LLM Response │ │ (CLI/Streamlit) │ │ (Similarity) │ │ (Generation) │ └─────────────────┘ └──────────────────┘ └─────────────────┘ # Clone the repository git clone <repository-url> cd neet_knowledge_project # Create virtual environment python -m venv venv source venv/bin/activate # Linux/Mac # venv\Scripts\activate # Windows # Install dependencies pip install -r requirements.txt # Copy and configure environment cp .env.example .env # Edit .env with your API keysEdit .env with your API keys:
# Required: OpenRouter or OpenAI API key OPENAI_API_KEY=sk-or-v1-your-key-here OPENAI_BASE_URL=https://openrouter.ai/api/v1 OPENAI_MODEL_NAME=google/gemini-2.0-flash-001 # Optional: YouTube Data API for better metadata YOUTUBE_API_KEY=your-youtube-api-keystreamlit run app.pyThis opens a web interface where you can:
- Add YouTube URLs or PDF paths via the sidebar
- Click "Update/Ingest All Sources" to process them
- Chat with your knowledge base in the main panel
# Ingest test data (text files) python -m src.main ingest ./tests/test_data/text/physics_notes.txt python -m src.main ingest ./tests/test_data/text/chemistry_notes.md python -m src.main ingest ./tests/test_data/html/biology_cell.html # Ingest a YouTube video python -m src.main ingest "https://www.youtube.com/watch?v=VIDEO_ID" # Query the knowledge base python -m src.main query "What are Newton's laws of motion?" # Interactive chat python -m src.main chat # Check system stats python -m src.main stats# Add a YouTube source for periodic updates python -m src.main source add youtube "https://youtube.com/watch?v=..." --title "Physics Lecture" # List all tracked sources python -m src.main source list # Update all sources that need refresh python -m src.main source update # Remove a source python -m src.main source remove <source_id>| Type | Extension/Format | Processing |
|---|---|---|
| Text | .txt | Direct text extraction |
| Markdown | .md, .markdown | Section-based extraction |
| HTML | .html, .htm | Main content extraction |
| Text extraction + OCR fallback | ||
| YouTube | URL | Subtitles → Audio download → Gemini transcription |
| Video | .mp4, .avi, .mov | Audio transcription |
| Audio | .mp3, .wav | Speech-to-text |
neet_knowledge_project/ ├── app.py # Streamlit web frontend ├── config.yaml # Configuration file ├── requirements.txt # Python dependencies ├── .env.example # Environment variable template ├── src/ │ ├── main.py # CLI entry point │ ├── processors/ # Content processors │ │ ├── pdf_processor.py │ │ ├── youtube_processor.py │ │ ├── text_processor.py │ │ ├── html_processor.py │ │ ├── video_processor.py │ │ └── unified.py # ContentProcessor router │ ├── rag/ # RAG system │ │ ├── vector_store.py # FAISS vector store manager │ │ ├── llm_manager.py # LLM provider abstraction │ │ └── neet_rag.py # Main RAG orchestrator │ └── utils/ │ ├── config.py # YAML config loader │ └── content_manager.py # Source tracking + auto-updater ├── tests/ │ ├── test_rag.py # Unit tests │ └── test_data/ # Sample NEET content │ ├── text/ # Physics & Chemistry notes │ └── html/ # Biology notes └── data/ ├── sources.json # Tracked content sources ├── faiss_index/ # FAISS vector database (generated) └── audio/ # Cached audio downloads (generated) python tests/test_rag.py| Variable | Required | Description |
|---|---|---|
OPENAI_API_KEY | Yes | OpenRouter or OpenAI API key |
OPENAI_BASE_URL | No | API base URL (default: OpenRouter) |
OPENAI_MODEL_NAME | No | LLM model (default: gemini-2.0-flash) |
YOUTUBE_API_KEY | No | YouTube Data API v3 key for metadata |
MIT License