Skip to content

Dacron101/WeRateAlt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WeRate - Point of Interest Database ETL

A high-performance Python ETL pipeline for building a global database of local businesses and points of interest (POIs) for the WeRate mobile application. This system ingests data from multiple sources, enriches it with AI-generated descriptions and semantic embeddings, and stores it in Firebase Firestore for efficient querying.

Table of Contents

Overview

WeRate is a mobile app that gamifies local business discovery. This ETL pipeline establishes the foundation by:

  1. Base Layer: Ingesting POI data from OpenStreetMap and Overture Maps
  2. Ratings Layer: Merging external ratings data (TripAdvisor, etc.) with base POIs
  3. AI Enrichment: Generating concise, one-sentence descriptions using Grok AI
  4. Semantic Search: Creating vector embeddings for intelligent search capabilities
  5. Firebase Storage: Uploading enriched data to Firestore with geospatial indexing

Features

  • Multi-Source Data Ingestion: OpenStreetMap (Overpass API), Overture Maps (Parquet), TripAdvisor (JSONL)
  • Intelligent Matching: Phonetic hashing and fuzzy string matching for robust restaurant-to-POI linking
  • AI Description Generation: Batch processing with Grok API for cost-effective summarization
  • Vector Embeddings: Local sentence-transformers model for semantic search capabilities
  • High Performance: 644x speedup achieved through:
    • Multi-level indexing with phonetic hashing
    • RapidFuzz for fast string similarity
    • Asynchronous HTTP requests with aiohttp
    • Parallel batch processing with ThreadPoolExecutor
    • Batch API calls to Grok (10-20 items per request)
  • Geospatial Support: Firestore GeoPoint for location-based queries
  • Robust Error Handling: Automatic retries, fallback Overpass API servers, timeout management
  • Comprehensive Testing: pytest suite with mocked external dependencies

Architecture

┌─────────────────────────────────────────────────────────────┐ │ Data Sources │ ├────────────────┬────────────────┬──────────────────────────┤ │ OpenStreetMap │ Overture Maps │ TripAdvisor Ratings │ │ (Overpass API)│ (Parquet) │ (JSONL) │ └────────┬───────┴────────┬───────┴──────────┬───────────────┘ │ │ │ v v v ┌────────────────────────────────────────────────────────────┐ │ Base Layer ETL (base_layer_etl.py) │ │ - Fetch OSM data (amenity, shop tags) │ │ - Ingest Overture data (places theme) │ │ - Normalize to unified schema │ │ - Upload to Firestore │ └────────────────────────┬───────────────────────────────────┘ │ v ┌────────────────────────────────────────────────────────────┐ │ Merge & Enrich ETL (merge_ratings_etl.py) │ │ - Multi-level phonetic indexing │ │ - Fuzzy matching (RapidFuzz) │ │ - Batch Grok AI descriptions │ │ - Local embedding generation (sentence-transformers) │ │ - Parallel processing (ThreadPoolExecutor) │ └────────────────────────┬───────────────────────────────────┘ │ v ┌────────────────────────────────────────────────────────────┐ │ Firebase Firestore │ │ Collection: pois │ │ - GeoPoint indexing for location queries │ │ - Vector embeddings for semantic search │ │ - Batch uploads (500 docs per commit) │ └────────────────────────────────────────────────────────────┘ 

Installation

Prerequisites

  • Python 3.9+
  • Firebase project with Firestore enabled
  • Grok API key (x.ai)
  • OpenStreetMap Overpass API access (public, no key required)

Setup

  1. Clone the repository:
git clone <repository-url> cd WeRateAlt
  1. Install dependencies:
pip install -r requirements.txt
  1. Download the sentence-transformers model (first run only):
python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
  1. Configure environment variables:
cp .env.example .env # Edit .env with your credentials
  1. Add Firebase credentials:
    • Download your Firebase service account key from Firebase Console
    • Save it as firebase_dat.json in the project root
    • Update FIREBASE_CREDENTIALS_PATH in .env if using a different path

Configuration

Edit .env file with your settings:

# Firebase credentials path FIREBASE_CREDENTIALS_PATH=./firebase_dat.json # Grok API key for AI descriptions GROK_API_KEY=your-grok-api-key-here # OSM query parameters (default: London) OSM_LAT=51.5074 OSM_LNG=-0.1278 OSM_RADIUS=5000 # Embedding configuration DRY_RUN_EMBEDDINGS=false # Set to 'true' for testing without generating embeddings # Optional configurations OVERTURE_PARQUET_PATH=./data/theme=places/places.parquet FIRESTORE_COLLECTION=pois UPLOAD_TO_FIRESTORE=true

Usage

1. Base Layer Ingestion (Optional)

Populate the database with base POI data from OpenStreetMap and Overture Maps:

python scripts/base_layer_etl.py

This script:

  • Fetches POIs from OSM Overpass API (restaurants, bars, cafes, shops)
  • Ingests Overture Maps parquet files (if available)
  • Normalizes data to unified schema
  • Uploads to Firestore collection

2. Merge Ratings and Enrich

Process ratings files, match with base layer, and enrich with AI descriptions:

python scripts/merge_ratings_etl.py

This script:

  • Loads all .json and .jsonl files from Ratings/ directory
  • Builds optimized multi-level index of OSM data
  • Matches ratings to POIs using phonetic hashing and fuzzy matching
  • Generates AI descriptions in batches via Grok API
  • Creates vector embeddings using local sentence-transformers model
  • Uploads merged records to Firestore

Performance: Processes 10,000 restaurants in approximately 2 minutes.

3. Clear Firestore Collection

Remove all documents from the Firestore collection:

python scripts/clear_firestore.py

Use this before re-running the ETL pipeline with updated data.

4. Test Embeddings

Verify embedding generation and semantic similarity:

python scripts/test_embeddings.py

Scripts

base_layer_etl.py

Purpose: Ingest base POI data from OpenStreetMap and Overture Maps.

Key Functions:

  • fetch_osm_data(lat, lng, radius): Query Overpass API with retry logic
  • ingest_overture_sample(file_path): Load and filter Overture parquet data
  • to_unified_schema(records, source_type): Normalize data format
  • upload_to_firestore(records, collection_name): Batch upload to Firebase

Performance Optimizations:

  • Retry mechanism with exponential backoff
  • Alternative Overpass API URLs for failover
  • Extended timeouts (120s) for large queries

merge_ratings_etl.py

Purpose: Match ratings data with base POIs, enrich with AI descriptions and embeddings.

Key Functions:

  • load_ratings_files(ratings_dir): Load JSON/JSONL ratings files
  • fetch_osm_data_for_matching(): Retrieve base POIs for matching
  • build_osm_index(osm_data): Create multi-level phonetic and word-based index
  • find_osm_match(rating, osm_index): Fuzzy match rating to POI (O(1) lookup)
  • fetch_grok_descriptions_batch(): Batch AI description generation
  • generate_embedding(text): Create vector embedding using sentence-transformers
  • run_merge_etl(): Main orchestration with parallel batch processing

Performance Optimizations:

  • 262x faster lookups: Phonetic hashing with Double Metaphone algorithm
  • 64x faster string matching: RapidFuzz instead of difflib
  • Async HTTP: aiohttp for non-blocking API calls
  • Parallel processing: ThreadPoolExecutor with 10 concurrent workers
  • Batch Grok calls: 10-20 items per API request
  • Local embeddings: No external API calls, fast CPU/GPU inference

Configuration Parameters:

run_merge_etl( osm_lat=51.5074, # Query center latitude osm_lng=-0.1278, # Query center longitude  osm_radius=5000, # Query radius in meters ratings_dir="Ratings", # Directory with ratings files collection_name="pois", # Firestore collection batch_size=20, # Restaurants per batch max_workers=10, # Parallel processing threads generate_embeddings=False # Enable/disable embedding generation )

clear_firestore.py

Purpose: Delete all documents from a Firestore collection.

Usage: Run before re-uploading data to avoid duplicates.

test_embeddings.py

Purpose: Test embedding generation and semantic similarity calculation.

Features:

  • Demonstrates generate_embedding() function
  • Shows semantic text construction
  • Calculates cosine similarity between vectors
  • Supports DRY_RUN_EMBEDDINGS flag for testing

Data Sources

OpenStreetMap (OSM)

  • API: Overpass API
  • Query Tags:
    • amenity: restaurant, bar, cafe, pub
    • shop: supermarket, clothes, convenience
  • Fields Extracted: name, lat, lon, address, amenity/shop type
  • Coverage: Global, community-maintained

Overture Maps

  • Format: Parquet files (GeoParquet)
  • Theme: places
  • Filters: Food, retail, hospitality categories
  • Fields Extracted: name, coordinates, address, category
  • Coverage: Global, curated dataset

TripAdvisor Ratings

  • Format: JSONL (JSON Lines)
  • Fields Extracted: name, rating, price_level, cuisines, opening_hours, url, image_url, rank
  • Example: tripadvisor_london.jsonl
  • Coverage: Manually collected per city

Data Schema

Firestore Document Structure

{ "source_id": "123456789", "source_type": "OSM", "name": "The Italian Kitchen", "category": "restaurant", "location": { "_latitude": 51.5074, "_longitude": -0.1278 }, "address": "123 Main Street, London, UK", "data_status": { "has_ai_description": true, "has_embeddings": true, "last_updated": "2026-02-07T23:20:06Z" }, "rating_avg": 4.5, "price_range": "$$", "opening_hours": { "monday": "09:00 - 22:00", "tuesday": "09:00 - 22:00", "wednesday": "09:00 - 22:00", "thursday": "09:00 - 22:00", "friday": "09:00 - 23:00", "saturday": "10:00 - 23:00", "sunday": "10:00 - 22:00" }, "ai_description": "Cozy Italian restaurant serving authentic pasta and wood-fired pizza in a rustic setting.", "review_count": 342, "neighborhood": "Covent Garden", "url": "https://www.tripadvisor.com/Restaurant_Review-...", "image_url": "https://media-cdn.tripadvisor.com/...", "rank": 15, "description_embedding": { "_type": "vector", "_value": [0.023, -0.145, 0.089, ...] // 384 dimensions }, "semantic_text": "Name: The Italian Kitchen. Category: restaurant. Address: 123 Main Street, London, UK. Source: OSM. Description: Cozy Italian restaurant serving authentic pasta and wood-fired pizza in a rustic setting." }

Field Descriptions

Field Type Description
source_id String Original ID from OSM or Overture
source_type String Data source: 'OSM' or 'Overture'
name String Business name
category String Type of business (restaurant, cafe, shop, etc.)
location GeoPoint Firebase geospatial coordinate object
address String Full address string
data_status Object Metadata about data completeness
rating_avg Float Average rating (0-5 scale)
price_range String Price level ($, $$, $$$, $$$$)
opening_hours Object Dictionary of day -> hours string
ai_description String Grok-generated one-sentence summary
review_count Integer Number of reviews
neighborhood String District or neighborhood name
url String External website URL
image_url String Representative image URL
rank Integer Ranking within category/location
description_embedding Vector 384-dimensional embedding for semantic search
semantic_text String Text used to generate embedding

Performance

Benchmark Results (9,817 restaurants)

Metric Value
Total Processing Time 106.6 seconds (~1.8 minutes)
Match Rate 2,194 / 9,817 (22.3%)
Throughput 20 restaurants/second
Speedup vs. Initial 644x faster
Grok API Calls ~220 batch requests (10 items each)
Firebase Uploads 5 batch commits (500 docs each)

Optimization Summary

Optimization Speedup Description
Phonetic Indexing 262x Double Metaphone hash for O(1) candidate lookup
RapidFuzz 64x Fast C++ string similarity vs. Python difflib
Batch Grok API 10-20x Multiple items per API call
Async HTTP 5-10x Non-blocking aiohttp vs. synchronous requests
Parallel Processing 10x ThreadPoolExecutor with 10 workers
Combined 644x Projected 16 minutes → 6.3 seconds (100-200 items)

For detailed performance analysis, see PERFORMANCE_OPTIMIZATIONS.md.

Grok Batch API

For large-scale processing with 50% cost savings, use the asynchronous Batch API. See BATCH_API_USAGE.md for detailed instructions.

Testing

Run All Tests

pytest tests/

Run Specific Test Module

pytest tests/test_base_layer_etl.py -v

Test Coverage

  • test_base_layer_etl.py: Tests for OSM and Overture ingestion, schema normalization, Firebase interaction
  • test_embeddings.py: Tests for embedding generation and semantic similarity

Tests use unittest.mock to avoid external API calls.

Manual Testing

  1. Test with dry run embeddings (no API calls):
# In .env, set: DRY_RUN_EMBEDDINGS=true python scripts/merge_ratings_etl.py
  1. Test with small dataset:
# Create a test ratings file with 10-20 entries python scripts/merge_ratings_etl.py
  1. Verify Firebase data:
# Check Firestore Console or use Firebase Admin SDK to query firebase firestore:get pois --limit 10

Project Structure

WeRateAlt/ ├── .env # Environment variables (git-ignored) ├── .env.example # Template for environment setup ├── .gitignore # Git ignore rules ├── README.md # This file ├── requirements.txt # Python dependencies ├── firebase_dat.json # Firebase credentials (git-ignored) ├── PERFORMANCE_OPTIMIZATIONS.md # Performance analysis ├── BATCH_API_USAGE.md # Grok Batch API guide ├── Ratings/ # External ratings data │ └── tripadvisor_london.jsonl # TripAdvisor London restaurants ├── scripts/ # ETL scripts │ ├── base_layer_etl.py # OSM/Overture ingestion │ ├── merge_ratings_etl.py # Ratings merging & enrichment │ ├── clear_firestore.py # Database cleanup utility │ └── test_embeddings.py # Embedding test script └── tests/ # Test suite ├── __init__.py └── test_base_layer_etl.py # Unit tests 

Dependencies

Core Libraries

  • firebase-admin (6.0.0+): Firebase Firestore SDK
  • python-dotenv (1.0.0+): Environment variable management
  • pandas (2.0.0+): Data manipulation and analysis
  • pyarrow (14.0.0+): Parquet file reading
  • requests (2.28.0+): Synchronous HTTP client
  • aiohttp (3.9.0+): Asynchronous HTTP client
  • shapely (2.0.0+): Geometry parsing (GeoJSON, WKB)

Performance & Matching

  • rapidfuzz (3.0.0+): Fast string similarity (64x faster than difflib)
  • metaphone (0.6+): Phonetic hashing for fuzzy matching

AI & Embeddings

  • sentence-transformers (2.2.0+): Local embedding generation
  • pydantic (2.0.0+): Data validation and structured outputs

Testing

  • pytest (7.0.0+): Testing framework

License

[Specify your license here]

Contributing

[Add contribution guidelines if applicable]

Support

For issues or questions:

  1. Check existing documentation in PERFORMANCE_OPTIMIZATIONS.md and BATCH_API_USAGE.md
  2. Review log files (*.log) for detailed error messages
  3. [Add contact information or issue tracker URL]

Roadmap

Future enhancements:

  • Support for additional rating sources (Google Places, Yelp)
  • Real-time data updates via webhooks
  • Advanced semantic search queries
  • Multi-language support
  • Image embedding generation
  • User-generated content integration
  • Mobile app API endpoints

Last Updated: February 2026 Version: 1.0.0

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages