github-etl

An ETL for the Mozilla Organization Firefox repositories

Overview

This repository contains a Python-based ETL (Extract, Transform, Load) script designed to process pull request data from Mozilla Organization Firefox repositories on GitHub and load them into Google BigQuery. The application runs in a Docker container for easy deployment and isolation.

Features

Containerized: Runs in a Docker container using the latest stable Python
Secure: Runs as a non-root user (app) inside the container
Streaming Architecture: Processes pull requests in chunks of 100 for memory efficiency
BigQuery Integration: Loads data directly into BigQuery using the Python client library
Rate Limit Handling: Automatically handles GitHub API rate limits
Comprehensive Logging: Detailed logging for monitoring and debugging

Quick Start

Prerequisites

GitHub App (recommended, required for authenticated runs): Create a GitHub App with read access to the target repositories, then note the numeric App ID and download a private key (PEM format). Without these the ETL runs unauthenticated (low rate-limit quota — suitable for testing only).
Google Cloud Project: Set up a GCP project with BigQuery enabled
BigQuery Dataset: Create a dataset in your GCP project
Authentication: Configure GCP credentials (see Authentication section below)

Building the Docker Image

docker build -t github-etl .

Running the Container

Create an env file (do not commit it):

# github-etl.env GITHUB_REPOS=mozilla-firefox/firefox GITHUB_APP_ID=your_github_app_id GITHUB_PRIVATE_KEY=<paste PEM contents here, with real newline characters (do not use "\n" escape sequences)> BIGQUERY_PROJECT=your-gcp-project BIGQUERY_DATASET=your_dataset GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json

Then run the container using --env-file to avoid exposing secrets in shell history or via /proc/<pid>/environ:

docker run --rm \ --env-file github-etl.env \ -v /local/path/to/credentials.json:/path/to/credentials.json \ github-etl

Note: Never pass the private key inline with -e GITHUB_PRIVATE_KEY="$(cat ...)" — that leaks the key into your shell history and makes it visible to other processes via ps//proc. Use --env-file, Docker secrets, or a secret manager that injects GITHUB_PRIVATE_KEY as an environment variable instead.

Environment Variables

Variable	Required	Default	Description
`GITHUB_REPOS`	Yes	-	Comma separated repositories in format "owner/repo" (e.g., "mozilla/firefox")
`GITHUB_APP_ID`	No*	-	GitHub App numeric ID (found on the App's settings page). Required for authenticated access.
`GITHUB_PRIVATE_KEY`	No*	-	RSA private key in PEM format for the GitHub App. Required for authenticated access.
`BIGQUERY_PROJECT`	Yes	-	Google Cloud Project ID
`BIGQUERY_DATASET`	Yes	-	BigQuery dataset ID
`GOOGLE_APPLICATION_CREDENTIALS`	Yes*	-	Path to GCP service account JSON file (*or use Workload Identity)

Architecture

Components

main.py: The main ETL script containing the business logic
requirements.txt: Python dependencies
Dockerfile: Container configuration

Container Specifications

Base Image: python:3.14-slim (latest stable Python)
User: app (uid: 1000, gid: 1000)
Working Directory: /app
Ownership: All files in /app are owned by the app user

ETL Process

The pipeline uses a streaming/chunked architecture that processes pull requests in batches of 100:

Extract: Generator yields chunks of 100 PRs from GitHub API
- Implements pagination and rate limit handling
- Fetches all pull requests (open, closed, merged) sorted by creation date
Transform: Flattens and structures PR data for BigQuery
- Extracts key fields (number, title, state, timestamps, user info)
- Flattens nested objects (user, head/base branches)
- Converts arrays (labels, assignees) to JSON strings
Load: Inserts transformed data into BigQuery
- Uses BigQuery Python client library
- Adds snapshot_date timestamp to all rows
- Immediate insertion after each chunk is transformed

Benefits of Chunked Processing:

Memory-efficient for large repositories
Incremental progress visibility
Early failure detection
Supports streaming data pipelines

Authentication

Google Cloud Authentication

The script uses the BigQuery Python client library which supports multiple authentication methods:

Service Account Key File (Recommended for local development):

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"

Workload Identity (Recommended for Kubernetes):
- Configure Workload Identity on your GKE cluster
- No explicit credentials file needed
Application Default Credentials (For local development):
```
gcloud auth application-default login
```

Development

Local Development

Set up environment variables and run the script:

export GITHUB_REPOS="mozilla/firefox" export GITHUB_APP_ID="your_github_app_id" # Load the PEM from a file to avoid the key appearing in shell history export GITHUB_PRIVATE_KEY="$(< your_private_key.pem)" export BIGQUERY_PROJECT="your-gcp-project" export BIGQUERY_DATASET="your_dataset" python3 main.py

Local Testing with Docker Compose

For local development and testing, you can use Docker Compose to run the ETL with mocked services (no GitHub API rate limits or GCP credentials required):

# Start all services (mock GitHub API, BigQuery emulator, and ETL) docker-compose up --build # View logs docker-compose logs -f github-etl # Stop services docker-compose down

This setup includes:

Mock GitHub API: Generates 250 sample pull requests
BigQuery Emulator: Local BigQuery instance for testing
ETL Service: Configured to use both mock services

Adding Dependencies

Add new Python packages to requirements.txt and rebuild the Docker image.

License

This project is licensed under the Mozilla Public License Version 2.0. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.mock		Dockerfile.mock
LICENSE		LICENSE
README.md		README.md
data.yml		data.yml
docker-compose.yml		docker-compose.yml
main.py		main.py
mock_github_api.py		mock_github_api.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

github-etl

Overview

Features

Quick Start

Prerequisites

Building the Docker Image

Running the Container

Environment Variables

Architecture

Components

Container Specifications

ETL Process

Authentication

Google Cloud Authentication

Development

Local Development

Local Testing with Docker Compose

Adding Dependencies

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

github-etl

Overview

Features

Quick Start

Prerequisites

Building the Docker Image

Running the Container

Environment Variables

Architecture

Components

Container Specifications

ETL Process

Authentication

Google Cloud Authentication

Development

Local Development

Local Testing with Docker Compose

Adding Dependencies

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages