An ETL for the Mozilla Organization Firefox repositories
This repository contains a Python-based ETL (Extract, Transform, Load) script designed to process pull request data from Mozilla Organization Firefox repositories on GitHub and load them into Google BigQuery. The application runs in a Docker container for easy deployment and isolation.
- Containerized: Runs in a Docker container using the latest stable Python
- Secure: Runs as a non-root user (
app) inside the container - Streaming Architecture: Processes pull requests in chunks of 100 for memory efficiency
- BigQuery Integration: Loads data directly into BigQuery using the Python client library
- Rate Limit Handling: Automatically handles GitHub API rate limits
- Comprehensive Logging: Detailed logging for monitoring and debugging
- GitHub App (recommended, required for authenticated runs): Create a GitHub App with read access to the target repositories, then note the numeric App ID and download a private key (PEM format). Without these the ETL runs unauthenticated (low rate-limit quota — suitable for testing only).
- Google Cloud Project: Set up a GCP project with BigQuery enabled
- BigQuery Dataset: Create a dataset in your GCP project
- Authentication: Configure GCP credentials (see Authentication section below)
docker build -t github-etl .Create an env file (do not commit it):
# github-etl.env GITHUB_REPOS=mozilla-firefox/firefox GITHUB_APP_ID=your_github_app_id GITHUB_PRIVATE_KEY=<paste PEM contents here, with real newline characters (do not use "\n" escape sequences)> BIGQUERY_PROJECT=your-gcp-project BIGQUERY_DATASET=your_dataset GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.jsonThen run the container using --env-file to avoid exposing secrets in shell history or via /proc/<pid>/environ:
docker run --rm \ --env-file github-etl.env \ -v /local/path/to/credentials.json:/path/to/credentials.json \ github-etlNote: Never pass the private key inline with
-e GITHUB_PRIVATE_KEY="$(cat ...)"— that leaks the key into your shell history and makes it visible to other processes viaps//proc. Use--env-file, Docker secrets, or a secret manager that injectsGITHUB_PRIVATE_KEYas an environment variable instead.
| Variable | Required | Default | Description |
|---|---|---|---|
GITHUB_REPOS | Yes | - | Comma separated repositories in format "owner/repo" (e.g., "mozilla/firefox") |
GITHUB_APP_ID | No* | - | GitHub App numeric ID (found on the App's settings page). Required for authenticated access. |
GITHUB_PRIVATE_KEY | No* | - | RSA private key in PEM format for the GitHub App. Required for authenticated access. |
BIGQUERY_PROJECT | Yes | - | Google Cloud Project ID |
BIGQUERY_DATASET | Yes | - | BigQuery dataset ID |
GOOGLE_APPLICATION_CREDENTIALS | Yes* | - | Path to GCP service account JSON file (*or use Workload Identity) |
main.py: The main ETL script containing the business logicrequirements.txt: Python dependenciesDockerfile: Container configuration
- Base Image:
python:3.14-slim(latest stable Python) - User:
app(uid: 1000, gid: 1000) - Working Directory:
/app - Ownership: All files in
/appare owned by theappuser
The pipeline uses a streaming/chunked architecture that processes pull requests in batches of 100:
-
Extract: Generator yields chunks of 100 PRs from GitHub API
- Implements pagination and rate limit handling
- Fetches all pull requests (open, closed, merged) sorted by creation date
-
Transform: Flattens and structures PR data for BigQuery
- Extracts key fields (number, title, state, timestamps, user info)
- Flattens nested objects (user, head/base branches)
- Converts arrays (labels, assignees) to JSON strings
-
Load: Inserts transformed data into BigQuery
- Uses BigQuery Python client library
- Adds snapshot_date timestamp to all rows
- Immediate insertion after each chunk is transformed
Benefits of Chunked Processing:
- Memory-efficient for large repositories
- Incremental progress visibility
- Early failure detection
- Supports streaming data pipelines
The script uses the BigQuery Python client library which supports multiple authentication methods:
-
Service Account Key File (Recommended for local development):
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"
-
Workload Identity (Recommended for Kubernetes):
- Configure Workload Identity on your GKE cluster
- No explicit credentials file needed
-
Application Default Credentials (For local development):
gcloud auth application-default login
Set up environment variables and run the script:
export GITHUB_REPOS="mozilla/firefox" export GITHUB_APP_ID="your_github_app_id" # Load the PEM from a file to avoid the key appearing in shell history export GITHUB_PRIVATE_KEY="$(< your_private_key.pem)" export BIGQUERY_PROJECT="your-gcp-project" export BIGQUERY_DATASET="your_dataset" python3 main.pyFor local development and testing, you can use Docker Compose to run the ETL with mocked services (no GitHub API rate limits or GCP credentials required):
# Start all services (mock GitHub API, BigQuery emulator, and ETL) docker-compose up --build # View logs docker-compose logs -f github-etl # Stop services docker-compose downThis setup includes:
- Mock GitHub API: Generates 250 sample pull requests
- BigQuery Emulator: Local BigQuery instance for testing
- ETL Service: Configured to use both mock services
Add new Python packages to requirements.txt and rebuild the Docker image.
This project is licensed under the Mozilla Public License Version 2.0. See the LICENSE file for details.