WeRate - Point of Interest Database ETL

A high-performance Python ETL pipeline for building a global database of local businesses and points of interest (POIs) for the WeRate mobile application. This system ingests data from multiple sources, enriches it with AI-generated descriptions and semantic embeddings, and stores it in Firebase Firestore for efficient querying.

Overview

WeRate is a mobile app that gamifies local business discovery. This ETL pipeline establishes the foundation by:

Base Layer: Ingesting POI data from OpenStreetMap and Overture Maps
Ratings Layer: Merging external ratings data (TripAdvisor, etc.) with base POIs
AI Enrichment: Generating concise, one-sentence descriptions using Grok AI
Semantic Search: Creating vector embeddings for intelligent search capabilities
Firebase Storage: Uploading enriched data to Firestore with geospatial indexing

Features

Multi-Source Data Ingestion: OpenStreetMap (Overpass API), Overture Maps (Parquet), TripAdvisor (JSONL)
Intelligent Matching: Phonetic hashing and fuzzy string matching for robust restaurant-to-POI linking
AI Description Generation: Batch processing with Grok API for cost-effective summarization
Vector Embeddings: Local sentence-transformers model for semantic search capabilities
High Performance: 644x speedup achieved through:
- Multi-level indexing with phonetic hashing
- RapidFuzz for fast string similarity
- Asynchronous HTTP requests with aiohttp
- Parallel batch processing with ThreadPoolExecutor
- Batch API calls to Grok (10-20 items per request)
Geospatial Support: Firestore GeoPoint for location-based queries
Robust Error Handling: Automatic retries, fallback Overpass API servers, timeout management
Comprehensive Testing: pytest suite with mocked external dependencies

Architecture

┌─────────────────────────────────────────────────────────────┐ │ Data Sources │ ├────────────────┬────────────────┬──────────────────────────┤ │ OpenStreetMap │ Overture Maps │ TripAdvisor Ratings │ │ (Overpass API)│ (Parquet) │ (JSONL) │ └────────┬───────┴────────┬───────┴──────────┬───────────────┘ │ │ │ v v v ┌────────────────────────────────────────────────────────────┐ │ Base Layer ETL (base_layer_etl.py) │ │ - Fetch OSM data (amenity, shop tags) │ │ - Ingest Overture data (places theme) │ │ - Normalize to unified schema │ │ - Upload to Firestore │ └────────────────────────┬───────────────────────────────────┘ │ v ┌────────────────────────────────────────────────────────────┐ │ Merge & Enrich ETL (merge_ratings_etl.py) │ │ - Multi-level phonetic indexing │ │ - Fuzzy matching (RapidFuzz) │ │ - Batch Grok AI descriptions │ │ - Local embedding generation (sentence-transformers) │ │ - Parallel processing (ThreadPoolExecutor) │ └────────────────────────┬───────────────────────────────────┘ │ v ┌────────────────────────────────────────────────────────────┐ │ Firebase Firestore │ │ Collection: pois │ │ - GeoPoint indexing for location queries │ │ - Vector embeddings for semantic search │ │ - Batch uploads (500 docs per commit) │ └────────────────────────────────────────────────────────────┘

Installation

Prerequisites

Python 3.9+
Firebase project with Firestore enabled
Grok API key (x.ai)
OpenStreetMap Overpass API access (public, no key required)

Setup

Clone the repository:

git clone <repository-url> cd WeRateAlt

Install dependencies:

pip install -r requirements.txt

Download the sentence-transformers model (first run only):

python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"

Configure environment variables:

cp .env.example .env # Edit .env with your credentials

Add Firebase credentials:
- Download your Firebase service account key from Firebase Console
- Save it as firebase_dat.json in the project root
- Update FIREBASE_CREDENTIALS_PATH in .env if using a different path

Configuration

Edit .env file with your settings:

# Firebase credentials path FIREBASE_CREDENTIALS_PATH=./firebase_dat.json # Grok API key for AI descriptions GROK_API_KEY=your-grok-api-key-here # OSM query parameters (default: London) OSM_LAT=51.5074 OSM_LNG=-0.1278 OSM_RADIUS=5000 # Embedding configuration DRY_RUN_EMBEDDINGS=false # Set to 'true' for testing without generating embeddings # Optional configurations OVERTURE_PARQUET_PATH=./data/theme=places/places.parquet FIRESTORE_COLLECTION=pois UPLOAD_TO_FIRESTORE=true

Usage

1. Base Layer Ingestion (Optional)

Populate the database with base POI data from OpenStreetMap and Overture Maps:

python scripts/base_layer_etl.py

This script:

Fetches POIs from OSM Overpass API (restaurants, bars, cafes, shops)
Ingests Overture Maps parquet files (if available)
Normalizes data to unified schema
Uploads to Firestore collection

2. Merge Ratings and Enrich

Process ratings files, match with base layer, and enrich with AI descriptions:

python scripts/merge_ratings_etl.py

This script:

Loads all .json and .jsonl files from Ratings/ directory
Builds optimized multi-level index of OSM data
Matches ratings to POIs using phonetic hashing and fuzzy matching
Generates AI descriptions in batches via Grok API
Creates vector embeddings using local sentence-transformers model
Uploads merged records to Firestore

Performance: Processes 10,000 restaurants in approximately 2 minutes.

3. Clear Firestore Collection

Remove all documents from the Firestore collection:

python scripts/clear_firestore.py

Use this before re-running the ETL pipeline with updated data.

4. Test Embeddings

Verify embedding generation and semantic similarity:

python scripts/test_embeddings.py

Scripts

base_layer_etl.py

Purpose: Ingest base POI data from OpenStreetMap and Overture Maps.

Key Functions:

fetch_osm_data(lat, lng, radius): Query Overpass API with retry logic
ingest_overture_sample(file_path): Load and filter Overture parquet data
to_unified_schema(records, source_type): Normalize data format
upload_to_firestore(records, collection_name): Batch upload to Firebase

Performance Optimizations:

Retry mechanism with exponential backoff
Alternative Overpass API URLs for failover
Extended timeouts (120s) for large queries

merge_ratings_etl.py

Purpose: Match ratings data with base POIs, enrich with AI descriptions and embeddings.

Key Functions:

load_ratings_files(ratings_dir): Load JSON/JSONL ratings files
fetch_osm_data_for_matching(): Retrieve base POIs for matching
build_osm_index(osm_data): Create multi-level phonetic and word-based index
find_osm_match(rating, osm_index): Fuzzy match rating to POI (O(1) lookup)
fetch_grok_descriptions_batch(): Batch AI description generation
generate_embedding(text): Create vector embedding using sentence-transformers
run_merge_etl(): Main orchestration with parallel batch processing

Performance Optimizations:

262x faster lookups: Phonetic hashing with Double Metaphone algorithm
64x faster string matching: RapidFuzz instead of difflib
Async HTTP: aiohttp for non-blocking API calls
Parallel processing: ThreadPoolExecutor with 10 concurrent workers
Batch Grok calls: 10-20 items per API request
Local embeddings: No external API calls, fast CPU/GPU inference

Configuration Parameters:

run_merge_etl( osm_lat=51.5074, # Query center latitude osm_lng=-0.1278, # Query center longitude  osm_radius=5000, # Query radius in meters ratings_dir="Ratings", # Directory with ratings files collection_name="pois", # Firestore collection batch_size=20, # Restaurants per batch max_workers=10, # Parallel processing threads generate_embeddings=False # Enable/disable embedding generation )

clear_firestore.py

Purpose: Delete all documents from a Firestore collection.

Usage: Run before re-uploading data to avoid duplicates.

test_embeddings.py

Purpose: Test embedding generation and semantic similarity calculation.

Features:

Demonstrates generate_embedding() function
Shows semantic text construction
Calculates cosine similarity between vectors
Supports DRY_RUN_EMBEDDINGS flag for testing

Data Sources

OpenStreetMap (OSM)

API: Overpass API
Query Tags:
- amenity: restaurant, bar, cafe, pub
- shop: supermarket, clothes, convenience
Fields Extracted: name, lat, lon, address, amenity/shop type
Coverage: Global, community-maintained

Overture Maps

Format: Parquet files (GeoParquet)
Theme: places
Filters: Food, retail, hospitality categories
Fields Extracted: name, coordinates, address, category
Coverage: Global, curated dataset

TripAdvisor Ratings

Format: JSONL (JSON Lines)
Fields Extracted: name, rating, price_level, cuisines, opening_hours, url, image_url, rank
Example: tripadvisor_london.jsonl
Coverage: Manually collected per city

Data Schema

Firestore Document Structure

{ "source_id": "123456789", "source_type": "OSM", "name": "The Italian Kitchen", "category": "restaurant", "location": { "_latitude": 51.5074, "_longitude": -0.1278 }, "address": "123 Main Street, London, UK", "data_status": { "has_ai_description": true, "has_embeddings": true, "last_updated": "2026-02-07T23:20:06Z" }, "rating_avg": 4.5, "price_range": "$$", "opening_hours": { "monday": "09:00 - 22:00", "tuesday": "09:00 - 22:00", "wednesday": "09:00 - 22:00", "thursday": "09:00 - 22:00", "friday": "09:00 - 23:00", "saturday": "10:00 - 23:00", "sunday": "10:00 - 22:00" }, "ai_description": "Cozy Italian restaurant serving authentic pasta and wood-fired pizza in a rustic setting.", "review_count": 342, "neighborhood": "Covent Garden", "url": "https://www.tripadvisor.com/Restaurant_Review-...", "image_url": "https://media-cdn.tripadvisor.com/...", "rank": 15, "description_embedding": { "_type": "vector", "_value": [0.023, -0.145, 0.089, ...] // 384 dimensions }, "semantic_text": "Name: The Italian Kitchen. Category: restaurant. Address: 123 Main Street, London, UK. Source: OSM. Description: Cozy Italian restaurant serving authentic pasta and wood-fired pizza in a rustic setting." }

Field Descriptions

Field	Type	Description
`source_id`	String	Original ID from OSM or Overture
`source_type`	String	Data source: 'OSM' or 'Overture'
`name`	String	Business name
`category`	String	Type of business (restaurant, cafe, shop, etc.)
`location`	GeoPoint	Firebase geospatial coordinate object
`address`	String	Full address string
`data_status`	Object	Metadata about data completeness
`rating_avg`	Float	Average rating (0-5 scale)
`price_range`	String	Price level ($, $$, $$$, $$$$)
`opening_hours`	Object	Dictionary of day -> hours string
`ai_description`	String	Grok-generated one-sentence summary
`review_count`	Integer	Number of reviews
`neighborhood`	String	District or neighborhood name
`url`	String	External website URL
`image_url`	String	Representative image URL
`rank`	Integer	Ranking within category/location
`description_embedding`	Vector	384-dimensional embedding for semantic search
`semantic_text`	String	Text used to generate embedding

Performance

Benchmark Results (9,817 restaurants)

Metric	Value
Total Processing Time	106.6 seconds (~1.8 minutes)
Match Rate	2,194 / 9,817 (22.3%)
Throughput	20 restaurants/second
Speedup vs. Initial	644x faster
Grok API Calls	~220 batch requests (10 items each)
Firebase Uploads	5 batch commits (500 docs each)

Optimization Summary

Optimization	Speedup	Description
Phonetic Indexing	262x	Double Metaphone hash for O(1) candidate lookup
RapidFuzz	64x	Fast C++ string similarity vs. Python difflib
Batch Grok API	10-20x	Multiple items per API call
Async HTTP	5-10x	Non-blocking aiohttp vs. synchronous requests
Parallel Processing	10x	ThreadPoolExecutor with 10 workers
Combined	644x	Projected 16 minutes → 6.3 seconds (100-200 items)

For detailed performance analysis, see PERFORMANCE_OPTIMIZATIONS.md.

Grok Batch API

For large-scale processing with 50% cost savings, use the asynchronous Batch API. See BATCH_API_USAGE.md for detailed instructions.

Testing

Run All Tests

pytest tests/

Run Specific Test Module

pytest tests/test_base_layer_etl.py -v

Test Coverage

test_base_layer_etl.py: Tests for OSM and Overture ingestion, schema normalization, Firebase interaction
test_embeddings.py: Tests for embedding generation and semantic similarity

Tests use unittest.mock to avoid external API calls.

Manual Testing

Test with dry run embeddings (no API calls):

# In .env, set: DRY_RUN_EMBEDDINGS=true python scripts/merge_ratings_etl.py

Test with small dataset:

# Create a test ratings file with 10-20 entries python scripts/merge_ratings_etl.py

Verify Firebase data:

# Check Firestore Console or use Firebase Admin SDK to query firebase firestore:get pois --limit 10

Project Structure

WeRateAlt/ ├── .env # Environment variables (git-ignored) ├── .env.example # Template for environment setup ├── .gitignore # Git ignore rules ├── README.md # This file ├── requirements.txt # Python dependencies ├── firebase_dat.json # Firebase credentials (git-ignored) ├── PERFORMANCE_OPTIMIZATIONS.md # Performance analysis ├── BATCH_API_USAGE.md # Grok Batch API guide ├── Ratings/ # External ratings data │ └── tripadvisor_london.jsonl # TripAdvisor London restaurants ├── scripts/ # ETL scripts │ ├── base_layer_etl.py # OSM/Overture ingestion │ ├── merge_ratings_etl.py # Ratings merging & enrichment │ ├── clear_firestore.py # Database cleanup utility │ └── test_embeddings.py # Embedding test script └── tests/ # Test suite ├── __init__.py └── test_base_layer_etl.py # Unit tests

Dependencies

Core Libraries

firebase-admin (6.0.0+): Firebase Firestore SDK
python-dotenv (1.0.0+): Environment variable management
pandas (2.0.0+): Data manipulation and analysis
pyarrow (14.0.0+): Parquet file reading
requests (2.28.0+): Synchronous HTTP client
aiohttp (3.9.0+): Asynchronous HTTP client
shapely (2.0.0+): Geometry parsing (GeoJSON, WKB)

Performance & Matching

rapidfuzz (3.0.0+): Fast string similarity (64x faster than difflib)
metaphone (0.6+): Phonetic hashing for fuzzy matching

AI & Embeddings

sentence-transformers (2.2.0+): Local embedding generation
pydantic (2.0.0+): Data validation and structured outputs

Testing

pytest (7.0.0+): Testing framework

License

[Specify your license here]

Contributing

[Add contribution guidelines if applicable]

Support

For issues or questions:

Check existing documentation in PERFORMANCE_OPTIMIZATIONS.md and BATCH_API_USAGE.md
Review log files (*.log) for detailed error messages
[Add contact information or issue tracker URL]

Roadmap

Future enhancements:

Support for additional rating sources (Google Places, Yelp)
Real-time data updates via webhooks
Advanced semantic search queries
Multi-language support
Image embedding generation
User-generated content integration
Mobile app API endpoints

Last Updated: February 2026 Version: 1.0.0

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Ratings		Ratings
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
BATCH_API_USAGE.md		BATCH_API_USAGE.md
PERFORMANCE_OPTIMIZATIONS.md		PERFORMANCE_OPTIMIZATIONS.md
README.md		README.md
embeddings_test_run.log		embeddings_test_run.log
fast_model_run.log		fast_model_run.log
merge_big_run.log		merge_big_run.log
merge_full_run.log		merge_full_run.log
merge_run.log		merge_run.log
optimized_run.log		optimized_run.log
requirements.txt		requirements.txt
review.py		review.py
structured_output_run.log		structured_output_run.log
tripadvisor_full_run.log		tripadvisor_full_run.log
tripadvisor_test_run.log		tripadvisor_test_run.log
ultra_optimized_run.log		ultra_optimized_run.log

Folders and files

Latest commit

History

Repository files navigation

WeRate - Point of Interest Database ETL

Table of Contents

Overview

Features

Architecture

Installation

Prerequisites

Setup

Configuration

Usage

1. Base Layer Ingestion (Optional)

2. Merge Ratings and Enrich

3. Clear Firestore Collection

4. Test Embeddings

Scripts

base_layer_etl.py

merge_ratings_etl.py

clear_firestore.py

test_embeddings.py

Data Sources

OpenStreetMap (OSM)

Overture Maps

TripAdvisor Ratings

Data Schema

Firestore Document Structure

Field Descriptions

Performance

Benchmark Results (9,817 restaurants)

Optimization Summary

Grok Batch API

Testing

Run All Tests

Run Specific Test Module

Test Coverage

Manual Testing

Project Structure

Dependencies

Core Libraries

Performance & Matching

AI & Embeddings

Testing

License

Contributing

Support

Roadmap

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages