Skip to content

GoodEd/neet_knowledge_project

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

197 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NEET Knowledge RAG

A comprehensive Retrieval-Augmented Generation (RAG) system for NEET aspirants in India. Supports multiple content types including PDFs, YouTube videos, text notes, and HTML pages.

Features

  • Multi-format Support: Process PDFs, YouTube videos, audio/video files, text files, Markdown, and HTML
  • Flexible RAG Pipeline: Uses LangChain for robust document processing and retrieval
  • Local Embeddings: HuggingFace all-MiniLM-L6-v2 (free, no API key needed)
  • LLM via OpenRouter: Gemini Flash 2.0 (or any OpenAI-compatible provider)
  • Vector Storage: FAISS for efficient similarity search
  • YouTube Audio Fallback: When subtitles are blocked, downloads audio and transcribes via Gemini multimodal
  • Streamlit Frontend: Web UI for managing sources and chatting with your knowledge base
  • CLI: Full command-line interface for ingestion, querying, and source management

Architecture

┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ Content │────▶│ Processing │────▶│ Vector Store │ │ Sources │ │ (Chunking) │ │ (FAISS) │ │ (YT/PDF/Text) │ └──────────────────┘ └────────┬────────┘ └─────────────────┘ │ ▼ ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ User Query │────▶│ Retrieval │────▶│ LLM Response │ │ (CLI/Streamlit) │ │ (Similarity) │ │ (Generation) │ └─────────────────┘ └──────────────────┘ └─────────────────┘ 

Installation

# Clone the repository git clone <repository-url> cd neet_knowledge_project # Create virtual environment python -m venv venv source venv/bin/activate # Linux/Mac # venv\Scripts\activate # Windows # Install dependencies pip install -r requirements.txt # Copy and configure environment cp .env.example .env # Edit .env with your API keys

Configuration

Edit .env with your API keys:

# Required: OpenRouter or OpenAI API key OPENAI_API_KEY=sk-or-v1-your-key-here OPENAI_BASE_URL=https://openrouter.ai/api/v1 OPENAI_MODEL_NAME=google/gemini-2.0-flash-001 # Optional: YouTube Data API for better metadata YOUTUBE_API_KEY=your-youtube-api-key

Quick Start

Option 1: Streamlit Web UI (Recommended for Demo)

streamlit run app.py

This opens a web interface where you can:

  • Add YouTube URLs or PDF paths via the sidebar
  • Click "Update/Ingest All Sources" to process them
  • Chat with your knowledge base in the main panel

Option 2: CLI

# Ingest test data (text files) python -m src.main ingest ./tests/test_data/text/physics_notes.txt python -m src.main ingest ./tests/test_data/text/chemistry_notes.md python -m src.main ingest ./tests/test_data/html/biology_cell.html # Ingest a YouTube video python -m src.main ingest "https://www.youtube.com/watch?v=VIDEO_ID" # Query the knowledge base python -m src.main query "What are Newton's laws of motion?" # Interactive chat python -m src.main chat # Check system stats python -m src.main stats

Source Management (CLI)

# Add a YouTube source for periodic updates python -m src.main source add youtube "https://youtube.com/watch?v=..." --title "Physics Lecture" # List all tracked sources python -m src.main source list # Update all sources that need refresh python -m src.main source update # Remove a source python -m src.main source remove <source_id>

Supported Content Types

Type Extension/Format Processing
Text .txt Direct text extraction
Markdown .md, .markdown Section-based extraction
HTML .html, .htm Main content extraction
PDF .pdf Text extraction + OCR fallback
YouTube URL Subtitles → Audio download → Gemini transcription
Video .mp4, .avi, .mov Audio transcription
Audio .mp3, .wav Speech-to-text

Project Structure

neet_knowledge_project/ ├── app.py # Streamlit web frontend ├── config.yaml # Configuration file ├── requirements.txt # Python dependencies ├── .env.example # Environment variable template ├── src/ │ ├── main.py # CLI entry point │ ├── processors/ # Content processors │ │ ├── pdf_processor.py │ │ ├── youtube_processor.py │ │ ├── text_processor.py │ │ ├── html_processor.py │ │ ├── video_processor.py │ │ └── unified.py # ContentProcessor router │ ├── rag/ # RAG system │ │ ├── vector_store.py # FAISS vector store manager │ │ ├── llm_manager.py # LLM provider abstraction │ │ └── neet_rag.py # Main RAG orchestrator │ └── utils/ │ ├── config.py # YAML config loader │ └── content_manager.py # Source tracking + auto-updater ├── tests/ │ ├── test_rag.py # Unit tests │ └── test_data/ # Sample NEET content │ ├── text/ # Physics & Chemistry notes │ └── html/ # Biology notes └── data/ ├── sources.json # Tracked content sources ├── faiss_index/ # FAISS vector database (generated) └── audio/ # Cached audio downloads (generated) 

Running Tests

python tests/test_rag.py

Environment Variables

Variable Required Description
OPENAI_API_KEY Yes OpenRouter or OpenAI API key
OPENAI_BASE_URL No API base URL (default: OpenRouter)
OPENAI_MODEL_NAME No LLM model (default: gemini-2.0-flash)
YOUTUBE_API_KEY No YouTube Data API v3 key for metadata

License

MIT License

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors