๐ Complete self-hosted AI web scraping stack with ZERO external API costs!
This project combines powerful open-source tools to create a comprehensive web scraping and AI extraction system that runs entirely locally. No OpenAI, no Pinecone, no external APIs - just pure local AI power.
- Web Scraping: Extract content from any website
- AI Extraction: Structure data using local LLMs
- Semantic Search: Find similar content across scraped data
- Embedding Storage: Auto-generate and store vectors
- MCP Integration: Works with Claude Code and AI assistants
# 1. Clone this repository git clone https://github.com/Maheidem/zero-cost-ai-scraper.git cd zero-cost-ai-scraper # 2. Copy environment template cp .env.example .env # 3. Start everything docker-compose up -d # 4. Install Ollama models (takes 15-30 minutes) ./scripts/install-models.sh # 5. Verify setup ./scripts/health-check.shThat's it! Your zero-cost AI scraper is ready.
- Firecrawl: Web scraping engine (builds from source with patches)
- Ollama: Local LLM processing (qwen3-coder:30b, nomic-embed-text)
- Qdrant: Vector database for embeddings
- SearXNG: Privacy-focused web search
- MCP Server: 6 tools for AI assistant integration
- โ Zero External API Costs
- โ Privacy-First (everything runs locally)
- โ MCP Compatible (works with Claude Code)
- โ Auto-Embeddings (builds searchable knowledge base)
- โ Semantic Search (find similar content)
- โ One-Command Setup
| Operation | Time | Cost |
|---|---|---|
| AI Extraction | 15-20s | $0.00 |
| Web Scraping | 2-5s | $0.00 |
| Embedding Generation | 1-2s | $0.00 |
| Similarity Search | <500ms | $0.00 |
| Total Monthly Cost | - | $0.00 |
Compare this to cloud solutions that cost $50-200/month!
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ AI Assistant โโโโโถโ MCP Server โโโโโถโ Firecrawl โ โ (Claude Code) โ โ (6 tools) โ โ (Web Scraping) โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โ โ โ โ โโโโโโโโโโโโโโโโ โ โ โ SearXNG โโโโโโโโโโโโโโโโ โ โ (Web Search) โ โ โโโโโโโโโโโโโโโโ โผ โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ Ollama โ โ Qdrant โ โ qwen3-coder:30b โ โ (Vector DB) โ โ nomic-embed-textโ โ โ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ import requests # Scrape and auto-store embeddings response = requests.post("http://localhost:3002/v1/scrape", json={ "url": "https://python.org/about", "formats": ["markdown"] })# Extract structured data using local AI response = requests.post("http://localhost:3002/v1/extract", json={ "urls": ["https://news.ycombinator.com"], "prompt": "Extract top stories and their scores", "schema": { "stories": { "type": "array", "items": { "title": "string", "score": "number", "url": "string" } } } }) # Notice: "llmUsage": 0 (local processing!)# Find similar content across all scraped data response = requests.post("http://localhost:3002/v1/similarity-search", json={ "query": "machine learning tutorials", "limit": 5 })// Use any of the 6 MCP tools: // - firecrawl_scrape // - firecrawl_extract // - firecrawl_similarity_search // - firecrawl_search // - firecrawl_crawl // - firecrawl_map // Example: Extract data from Reddit mcp__firecrawl-local__firecrawl_extract({ "urls": ["https://reddit.com/r/programming"], "prompt": "Extract trending programming topics", "schema": {"topics": ["string"], "engagement": "number"} })zero-cost-ai-scraper/ โโโ mcp-server/ # MCP server (our code, MIT license) โโโ searxng/ # Search engine configuration โโโ configs/ # Service configurations โโโ scripts/ # Setup and maintenance scripts โโโ patches/ # Firecrawl modification patches โโโ examples/ # Usage examples โโโ docker-compose.yml # Complete stack definition โโโ docs/ # Comprehensive documentation -
Original Projects (we build upon):
-
AI Assistants:
- Claude Code (via MCP)
- Any MCP-compatible tool
- Direct API integration
- Research Automation: Scrape academic papers, documentation, tutorials
- Content Aggregation: Build knowledge bases from web sources
- Competitive Intelligence: Monitor competitor websites and news
- Documentation Extraction: Pull API docs, guides, examples
- News Monitoring: Track industry news and developments
- No Data Leaves Your Machine: Everything runs locally
- No User Tracking: SearXNG doesn't store search queries
- Open Source: All code is auditable
- Self-Hosted: You control your data
- Docker & Docker Compose: For containerization
- 8GB+ RAM: For running Ollama models
- 50GB+ Disk: For model storage
- Internet: For initial setup and model download
This project builds on amazing open-source work:
- Firecrawl by Sideguide Technologies Inc. (AGPL v3)
- SearXNG by SearXNG team (AGPL v3)
- Our integrations are MIT licensed
See CONTRIBUTING.md for guidelines.
- Our Code: MIT License (MCP server, configurations, scripts)
- Dependencies: Various (see individual project licenses)
- Patches: AGPL v3 compatible modifications
If this helps you save money on AI APIs, please โญ star this repository!
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: See docs/ directory
Built with โค๏ธ for the open-source community
Making enterprise-level AI accessible to everyone, at zero cost.