A high-performance Python ETL pipeline for building a global database of local businesses and points of interest (POIs) for the WeRate mobile application. This system ingests data from multiple sources, enriches it with AI-generated descriptions and semantic embeddings, and stores it in Firebase Firestore for efficient querying.
- Overview
- Features
- Architecture
- Installation
- Configuration
- Usage
- Scripts
- Data Sources
- Data Schema
- Performance
- Testing
- Project Structure
- Dependencies
- License
WeRate is a mobile app that gamifies local business discovery. This ETL pipeline establishes the foundation by:
- Base Layer: Ingesting POI data from OpenStreetMap and Overture Maps
- Ratings Layer: Merging external ratings data (TripAdvisor, etc.) with base POIs
- AI Enrichment: Generating concise, one-sentence descriptions using Grok AI
- Semantic Search: Creating vector embeddings for intelligent search capabilities
- Firebase Storage: Uploading enriched data to Firestore with geospatial indexing
- Multi-Source Data Ingestion: OpenStreetMap (Overpass API), Overture Maps (Parquet), TripAdvisor (JSONL)
- Intelligent Matching: Phonetic hashing and fuzzy string matching for robust restaurant-to-POI linking
- AI Description Generation: Batch processing with Grok API for cost-effective summarization
- Vector Embeddings: Local sentence-transformers model for semantic search capabilities
- High Performance: 644x speedup achieved through:
- Multi-level indexing with phonetic hashing
- RapidFuzz for fast string similarity
- Asynchronous HTTP requests with aiohttp
- Parallel batch processing with ThreadPoolExecutor
- Batch API calls to Grok (10-20 items per request)
- Geospatial Support: Firestore GeoPoint for location-based queries
- Robust Error Handling: Automatic retries, fallback Overpass API servers, timeout management
- Comprehensive Testing: pytest suite with mocked external dependencies
┌─────────────────────────────────────────────────────────────┐ │ Data Sources │ ├────────────────┬────────────────┬──────────────────────────┤ │ OpenStreetMap │ Overture Maps │ TripAdvisor Ratings │ │ (Overpass API)│ (Parquet) │ (JSONL) │ └────────┬───────┴────────┬───────┴──────────┬───────────────┘ │ │ │ v v v ┌────────────────────────────────────────────────────────────┐ │ Base Layer ETL (base_layer_etl.py) │ │ - Fetch OSM data (amenity, shop tags) │ │ - Ingest Overture data (places theme) │ │ - Normalize to unified schema │ │ - Upload to Firestore │ └────────────────────────┬───────────────────────────────────┘ │ v ┌────────────────────────────────────────────────────────────┐ │ Merge & Enrich ETL (merge_ratings_etl.py) │ │ - Multi-level phonetic indexing │ │ - Fuzzy matching (RapidFuzz) │ │ - Batch Grok AI descriptions │ │ - Local embedding generation (sentence-transformers) │ │ - Parallel processing (ThreadPoolExecutor) │ └────────────────────────┬───────────────────────────────────┘ │ v ┌────────────────────────────────────────────────────────────┐ │ Firebase Firestore │ │ Collection: pois │ │ - GeoPoint indexing for location queries │ │ - Vector embeddings for semantic search │ │ - Batch uploads (500 docs per commit) │ └────────────────────────────────────────────────────────────┘ - Python 3.9+
- Firebase project with Firestore enabled
- Grok API key (x.ai)
- OpenStreetMap Overpass API access (public, no key required)
- Clone the repository:
git clone <repository-url> cd WeRateAlt- Install dependencies:
pip install -r requirements.txt- Download the sentence-transformers model (first run only):
python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"- Configure environment variables:
cp .env.example .env # Edit .env with your credentials- Add Firebase credentials:
- Download your Firebase service account key from Firebase Console
- Save it as
firebase_dat.jsonin the project root - Update
FIREBASE_CREDENTIALS_PATHin.envif using a different path
Edit .env file with your settings:
# Firebase credentials path FIREBASE_CREDENTIALS_PATH=./firebase_dat.json # Grok API key for AI descriptions GROK_API_KEY=your-grok-api-key-here # OSM query parameters (default: London) OSM_LAT=51.5074 OSM_LNG=-0.1278 OSM_RADIUS=5000 # Embedding configuration DRY_RUN_EMBEDDINGS=false # Set to 'true' for testing without generating embeddings # Optional configurations OVERTURE_PARQUET_PATH=./data/theme=places/places.parquet FIRESTORE_COLLECTION=pois UPLOAD_TO_FIRESTORE=truePopulate the database with base POI data from OpenStreetMap and Overture Maps:
python scripts/base_layer_etl.pyThis script:
- Fetches POIs from OSM Overpass API (restaurants, bars, cafes, shops)
- Ingests Overture Maps parquet files (if available)
- Normalizes data to unified schema
- Uploads to Firestore collection
Process ratings files, match with base layer, and enrich with AI descriptions:
python scripts/merge_ratings_etl.pyThis script:
- Loads all
.jsonand.jsonlfiles fromRatings/directory - Builds optimized multi-level index of OSM data
- Matches ratings to POIs using phonetic hashing and fuzzy matching
- Generates AI descriptions in batches via Grok API
- Creates vector embeddings using local sentence-transformers model
- Uploads merged records to Firestore
Performance: Processes 10,000 restaurants in approximately 2 minutes.
Remove all documents from the Firestore collection:
python scripts/clear_firestore.pyUse this before re-running the ETL pipeline with updated data.
Verify embedding generation and semantic similarity:
python scripts/test_embeddings.pyPurpose: Ingest base POI data from OpenStreetMap and Overture Maps.
Key Functions:
fetch_osm_data(lat, lng, radius): Query Overpass API with retry logicingest_overture_sample(file_path): Load and filter Overture parquet datato_unified_schema(records, source_type): Normalize data formatupload_to_firestore(records, collection_name): Batch upload to Firebase
Performance Optimizations:
- Retry mechanism with exponential backoff
- Alternative Overpass API URLs for failover
- Extended timeouts (120s) for large queries
Purpose: Match ratings data with base POIs, enrich with AI descriptions and embeddings.
Key Functions:
load_ratings_files(ratings_dir): Load JSON/JSONL ratings filesfetch_osm_data_for_matching(): Retrieve base POIs for matchingbuild_osm_index(osm_data): Create multi-level phonetic and word-based indexfind_osm_match(rating, osm_index): Fuzzy match rating to POI (O(1) lookup)fetch_grok_descriptions_batch(): Batch AI description generationgenerate_embedding(text): Create vector embedding using sentence-transformersrun_merge_etl(): Main orchestration with parallel batch processing
Performance Optimizations:
- 262x faster lookups: Phonetic hashing with Double Metaphone algorithm
- 64x faster string matching: RapidFuzz instead of difflib
- Async HTTP: aiohttp for non-blocking API calls
- Parallel processing: ThreadPoolExecutor with 10 concurrent workers
- Batch Grok calls: 10-20 items per API request
- Local embeddings: No external API calls, fast CPU/GPU inference
Configuration Parameters:
run_merge_etl( osm_lat=51.5074, # Query center latitude osm_lng=-0.1278, # Query center longitude osm_radius=5000, # Query radius in meters ratings_dir="Ratings", # Directory with ratings files collection_name="pois", # Firestore collection batch_size=20, # Restaurants per batch max_workers=10, # Parallel processing threads generate_embeddings=False # Enable/disable embedding generation )Purpose: Delete all documents from a Firestore collection.
Usage: Run before re-uploading data to avoid duplicates.
Purpose: Test embedding generation and semantic similarity calculation.
Features:
- Demonstrates
generate_embedding()function - Shows semantic text construction
- Calculates cosine similarity between vectors
- Supports
DRY_RUN_EMBEDDINGSflag for testing
- API: Overpass API
- Query Tags:
amenity: restaurant, bar, cafe, pubshop: supermarket, clothes, convenience
- Fields Extracted: name, lat, lon, address, amenity/shop type
- Coverage: Global, community-maintained
- Format: Parquet files (GeoParquet)
- Theme: places
- Filters: Food, retail, hospitality categories
- Fields Extracted: name, coordinates, address, category
- Coverage: Global, curated dataset
- Format: JSONL (JSON Lines)
- Fields Extracted: name, rating, price_level, cuisines, opening_hours, url, image_url, rank
- Example:
tripadvisor_london.jsonl - Coverage: Manually collected per city
{ "source_id": "123456789", "source_type": "OSM", "name": "The Italian Kitchen", "category": "restaurant", "location": { "_latitude": 51.5074, "_longitude": -0.1278 }, "address": "123 Main Street, London, UK", "data_status": { "has_ai_description": true, "has_embeddings": true, "last_updated": "2026-02-07T23:20:06Z" }, "rating_avg": 4.5, "price_range": "$$", "opening_hours": { "monday": "09:00 - 22:00", "tuesday": "09:00 - 22:00", "wednesday": "09:00 - 22:00", "thursday": "09:00 - 22:00", "friday": "09:00 - 23:00", "saturday": "10:00 - 23:00", "sunday": "10:00 - 22:00" }, "ai_description": "Cozy Italian restaurant serving authentic pasta and wood-fired pizza in a rustic setting.", "review_count": 342, "neighborhood": "Covent Garden", "url": "https://www.tripadvisor.com/Restaurant_Review-...", "image_url": "https://media-cdn.tripadvisor.com/...", "rank": 15, "description_embedding": { "_type": "vector", "_value": [0.023, -0.145, 0.089, ...] // 384 dimensions }, "semantic_text": "Name: The Italian Kitchen. Category: restaurant. Address: 123 Main Street, London, UK. Source: OSM. Description: Cozy Italian restaurant serving authentic pasta and wood-fired pizza in a rustic setting." }| Field | Type | Description |
|---|---|---|
source_id | String | Original ID from OSM or Overture |
source_type | String | Data source: 'OSM' or 'Overture' |
name | String | Business name |
category | String | Type of business (restaurant, cafe, shop, etc.) |
location | GeoPoint | Firebase geospatial coordinate object |
address | String | Full address string |
data_status | Object | Metadata about data completeness |
rating_avg | Float | Average rating (0-5 scale) |
price_range | String | Price level ( |
opening_hours | Object | Dictionary of day -> hours string |
ai_description | String | Grok-generated one-sentence summary |
review_count | Integer | Number of reviews |
neighborhood | String | District or neighborhood name |
url | String | External website URL |
image_url | String | Representative image URL |
rank | Integer | Ranking within category/location |
description_embedding | Vector | 384-dimensional embedding for semantic search |
semantic_text | String | Text used to generate embedding |
| Metric | Value |
|---|---|
| Total Processing Time | 106.6 seconds (~1.8 minutes) |
| Match Rate | 2,194 / 9,817 (22.3%) |
| Throughput | 20 restaurants/second |
| Speedup vs. Initial | 644x faster |
| Grok API Calls | ~220 batch requests (10 items each) |
| Firebase Uploads | 5 batch commits (500 docs each) |
| Optimization | Speedup | Description |
|---|---|---|
| Phonetic Indexing | 262x | Double Metaphone hash for O(1) candidate lookup |
| RapidFuzz | 64x | Fast C++ string similarity vs. Python difflib |
| Batch Grok API | 10-20x | Multiple items per API call |
| Async HTTP | 5-10x | Non-blocking aiohttp vs. synchronous requests |
| Parallel Processing | 10x | ThreadPoolExecutor with 10 workers |
| Combined | 644x | Projected 16 minutes → 6.3 seconds (100-200 items) |
For detailed performance analysis, see PERFORMANCE_OPTIMIZATIONS.md.
For large-scale processing with 50% cost savings, use the asynchronous Batch API. See BATCH_API_USAGE.md for detailed instructions.
pytest tests/pytest tests/test_base_layer_etl.py -vtest_base_layer_etl.py: Tests for OSM and Overture ingestion, schema normalization, Firebase interactiontest_embeddings.py: Tests for embedding generation and semantic similarity
Tests use unittest.mock to avoid external API calls.
- Test with dry run embeddings (no API calls):
# In .env, set: DRY_RUN_EMBEDDINGS=true python scripts/merge_ratings_etl.py- Test with small dataset:
# Create a test ratings file with 10-20 entries python scripts/merge_ratings_etl.py- Verify Firebase data:
# Check Firestore Console or use Firebase Admin SDK to query firebase firestore:get pois --limit 10WeRateAlt/ ├── .env # Environment variables (git-ignored) ├── .env.example # Template for environment setup ├── .gitignore # Git ignore rules ├── README.md # This file ├── requirements.txt # Python dependencies ├── firebase_dat.json # Firebase credentials (git-ignored) ├── PERFORMANCE_OPTIMIZATIONS.md # Performance analysis ├── BATCH_API_USAGE.md # Grok Batch API guide ├── Ratings/ # External ratings data │ └── tripadvisor_london.jsonl # TripAdvisor London restaurants ├── scripts/ # ETL scripts │ ├── base_layer_etl.py # OSM/Overture ingestion │ ├── merge_ratings_etl.py # Ratings merging & enrichment │ ├── clear_firestore.py # Database cleanup utility │ └── test_embeddings.py # Embedding test script └── tests/ # Test suite ├── __init__.py └── test_base_layer_etl.py # Unit tests - firebase-admin (6.0.0+): Firebase Firestore SDK
- python-dotenv (1.0.0+): Environment variable management
- pandas (2.0.0+): Data manipulation and analysis
- pyarrow (14.0.0+): Parquet file reading
- requests (2.28.0+): Synchronous HTTP client
- aiohttp (3.9.0+): Asynchronous HTTP client
- shapely (2.0.0+): Geometry parsing (GeoJSON, WKB)
- rapidfuzz (3.0.0+): Fast string similarity (64x faster than difflib)
- metaphone (0.6+): Phonetic hashing for fuzzy matching
- sentence-transformers (2.2.0+): Local embedding generation
- pydantic (2.0.0+): Data validation and structured outputs
- pytest (7.0.0+): Testing framework
[Specify your license here]
[Add contribution guidelines if applicable]
For issues or questions:
- Check existing documentation in
PERFORMANCE_OPTIMIZATIONS.mdandBATCH_API_USAGE.md - Review log files (*.log) for detailed error messages
- [Add contact information or issue tracker URL]
Future enhancements:
- Support for additional rating sources (Google Places, Yelp)
- Real-time data updates via webhooks
- Advanced semantic search queries
- Multi-language support
- Image embedding generation
- User-generated content integration
- Mobile app API endpoints
Last Updated: February 2026 Version: 1.0.0