Enterprise-ready Natural Language to SQL converter with multi-provider support
Convert natural language queries to optimized SQL using multiple AI providers. Built with Clean Architecture principles for enterprise-scale applications handling 1000+ table databases.
- Multiple AI Providers: OpenAI, Anthropic Claude, Google Gemini - no vendor lock-in
- Database Support: PostgreSQL, Amazon Redshift
- Large Schema Handling: Vector embeddings and intelligent filtering for 1000+ tables
- Smart Caching: Query and schema embedding caching for improved performance
- Async Support: Full async/await support
- Clean Architecture: Ports & Adapters pattern for maintainability
| Document | Description |
|---|---|
| Architecture | Component diagram and data flow |
| API Reference | Python API and CLI command reference |
| Configuration | Environment variables and schema filters |
| Enterprise Guide | Large-scale deployment and migration |
| Redshift Support | Amazon Redshift setup and examples |
| Contributing | Contribution guidelines |
# With UV (recommended) uv add nlp2sql # With pip pip install nlp2sql # With specific providers pip install nlp2sql[anthropic,gemini] pip install nlp2sql[all-providers] # With embeddings pip install nlp2sql[embeddings-local] # Local embeddings (free) pip install nlp2sql[embeddings-openai] # OpenAI embeddingsexport OPENAI_API_KEY="your-openai-key" # or ANTHROPIC_API_KEY, GOOGLE_API_KEYimport asyncio import nlp2sql from nlp2sql import ProviderConfig async def main(): nlp = await nlp2sql.connect( "postgresql://user:pass@localhost:5432/mydb", provider=ProviderConfig(provider="openai", api_key="sk-..."), ) result = await nlp.ask("Show me all active users") print(result.sql) print(result.confidence) print(result.is_valid) asyncio.run(main())connect() auto-detects the database type from the URL, loads the schema, and builds the FAISS embedding index. Subsequent ask() calls reuse everything from disk cache.
Pass a list of dicts -- connect() handles embedding and indexing automatically:
nlp = await nlp2sql.connect( "redshift://user:pass@host:5439/db", provider=ProviderConfig(provider="openai", api_key="sk-..."), schema="dwh_data_share_llm", examples=[ { "question": "Total revenue last month?", "sql": "SELECT SUM(revenue) FROM sales WHERE date >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')", "database_type": "redshift", }, ], ) result = await nlp.ask("Show me total sales this quarter")nlp = await nlp2sql.connect( "postgresql://localhost/enterprise", provider=ProviderConfig(provider="anthropic", api_key="sk-ant-..."), schema_filters={ "include_schemas": ["sales", "finance"], "exclude_system_tables": True, }, )nlp = await nlp2sql.connect( "postgresql://localhost/mydb", provider=ProviderConfig( provider="openai", api_key="sk-...", model="gpt-4o", temperature=0.0, max_tokens=4000, ), )nlp2sql query \ --database-url postgresql://user:pass@localhost:5432/mydb \ --question "Show all active users" \ --explain nlp2sql inspect --database-url postgresql://localhost/mydbFor full control over the lifecycle, the lower-level API is still available:
from nlp2sql import create_and_initialize_service, ProviderConfig, DatabaseType service = await create_and_initialize_service( database_url="postgresql://localhost/mydb", provider_config=ProviderConfig(provider="openai", api_key="sk-..."), database_type=DatabaseType.POSTGRES, ) result = await service.generate_sql("Count total users", database_type=DatabaseType.POSTGRES)Question ──► Cache check ──► Schema retrieval ──► Relevance filtering ──► Context building ──► AI generation ──► Validation │ │ │ SchemaRepository FAISS + TF-IDF hybrid Reuses precomputed (+ disk cache) + batch scoring relevance scores - Schema retrieval -- Fetches tables from database via
SchemaRepository(with disk cache for Redshift) - Relevance filtering -- FAISS dense search + TF-IDF sparse search (50/50 hybrid) finds candidate tables; batch scoring refines with precomputed embeddings
- Context building -- Builds optimized schema context within token limits, reusing scores from step 2 (zero additional embedding calls)
- SQL generation -- AI provider (OpenAI, Anthropic, or Gemini) generates SQL from question + schema context
- Validation -- SQL syntax and safety checks before returning results
See Architecture for the detailed flow with method references and design decisions.
| Provider | Default Model | Context Size | Best For |
|---|---|---|---|
| OpenAI | gpt-4o-mini | 128K | Cost-effective, fast |
| Anthropic | claude-sonnet-4-20250514 | 200K | Large schemas |
| Google Gemini | gemini-2.0-flash | 1M | High volume |
All models are configurable via ProviderConfig(model="..."). See Configuration for details.
Clean Architecture (Ports & Adapters) with three layers: core entities, port interfaces, and adapter implementations. The schema management layer uses FAISS + TF-IDF hybrid search for relevance filtering at scale.
nlp2sql/ ├── client.py # DSL: connect() + NLP2SQL class (recommended entry point) ├── core/ # Pure Python: entities, ProviderConfig, QueryResult, sql_safety, sql_keywords ├── ports/ # Interfaces: AIProviderPort, SchemaRepositoryPort, EmbeddingProviderPort, │ # ExampleRepositoryPort, QuerySafetyPort, QueryValidatorPort, CachePort ├── adapters/ # Implementations: OpenAI, Anthropic, Gemini, PostgreSQL, Redshift, │ # RegexQueryValidator ├── services/ # Orchestration: QueryGenerationService ├── schema/ # Schema management: SchemaManager, SchemaAnalyzer, SchemaEmbeddingManager, │ # ExampleStore ├── config/ # Pydantic Settings (centralized defaults) └── exceptions/ # Exception hierarchy (NLP2SQLException -> 8 subclasses) See Architecture for the full component diagram, data flow, and design decisions.
# Clone and install git clone https://github.com/luiscarbonel1991/nlp2sql.git cd nlp2sql uv sync # Start test databases cd docker && docker-compose up -d # Run tests uv run pytest # Code quality uv run ruff format . uv run ruff check . uv run mypy src/nlp2sql includes a Model Context Protocol server for AI assistant integration.
{ "mcpServers": { "nlp2sql": { "command": "python", "args": ["/path/to/nlp2sql/mcp_server/server.py"], "env": { "OPENAI_API_KEY": "${OPENAI_API_KEY}", "NLP2SQL_DEFAULT_DB_URL": "postgresql://user:pass@localhost:5432/mydb" } } } }Tools: ask_database, explore_schema, run_sql, list_databases, explain_sql
See mcp_server/README.md for complete setup.
We welcome contributions. See CONTRIBUTING.md for guidelines.
MIT License - see LICENSE.
Luis Carbonel - @luiscarbonel1991
