vid-nexus: Video Download System with GCP Spot Instances

This package provides a scalable, cost-effective solution for downloading videos from YouTube and Vimeo using Google Cloud Platform (GCP) spot instances and Amazon SQS for job queuing. The system automatically manages worker instances based on queue workload and handles video downloads with error recovery.

Architecture Overview

The system consists of four main components:

SQS Message Publisher (publish_sqs.py) - Publishes video IDs to SQS queue
Orchestrator (orchestrate.py) - Manages GCP spot instances based on queue load
Worker Script (spot_download.py) - Downloads videos on spot instances
Automation Script (publish_and_orchestrate_download.sh) - Coordinates the entire process

Prerequisites

AWS Setup

AWS account with SQS access
AWS credentials configured (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY)
SQS queue created in desired region

Creating an SQS Queue

Create an SQS queue using AWS CLI:

# Standard queue (recommended for most use cases) aws sqs create-queue --queue-name test-queue \ --region us-west-2 \ --attributes VisibilityTimeout=600 # Get the queue URL (save this for later use) aws sqs get-queue-url \ --queue-name test-queue \ --region us-west-2

Queue Configuration Recommendations:

VisibilityTimeoutSeconds: 600 (10 minutes) - Time for worker to process message. 10 minutes is good enough for video crawling purposes.

GCP Setup

GCP project with Compute Engine API enabled
Service account with necessary permissions:
- Compute Engine Admin
- Storage Admin
- Service Account User
GCS buckets created in multiple regions for storage
Google Cloud SDK (gcloud) installed and configured

Dependencies

Python 3.7+
Required Python packages (see requirements.txt)
yt-dlp for YouTube downloads
ffmpeg for video processing

Quick Start

1. Prepare Video IDs

Create JSON files containing lists of YouTube video IDs:

["dQw4w9WgXcQ", "jNQXAC9IVRw", "kJQP7kiw5Fk"]

2. Set Environment Variables

export AWS_ACCESS_KEY_ID="your_aws_key" export AWS_SECRET_ACCESS_KEY="your_aws_secret" export MESSAGE_FILES="/path/to/video_ids1.json /path/to/video_ids2.json" export ROOT_BLOB_DIR="your-video-folder" export QUEUE_URL="https://sqs.us-west-2.amazonaws.com/your-account/your-queue" export DOWNLOAD_VIDEO="true" # Set to download video files export DOWNLOAD_METADATA="true" # Set to download metadata

3. Run the Complete Pipeline

./scripts/publish_and_orchestrate_download.sh

Detailed Component Documentation

SQS Message Publisher (`publish_sqs.py`)

Publishes video IDs from JSON files to an Amazon SQS queue for processing by worker instances.

Usage:

python -m yt_crawl.publish_sqs \ --sqs_queue_url https://sqs.us-west-2.amazonaws.com/YOUR_AWS_ACCOUNT_ID/yt-download-queue \ --message_files /path/to/file1.json /path/to/file2.json \ --aws_region us-west-2 \ --max_workers 20

Arguments:

--sqs_queue_url: SQS queue URL for video download jobs
--message_files: Paths to JSON files containing video IDs (supports multiple files)
--aws_region: AWS region where the SQS queue is located
--max_workers: Number of concurrent threads for publishing messages

Features:

Multithreaded publishing for faster queue population
Automatic duplicate removal across files
Progress tracking with progress bars
Error handling and retry logic

Orchestrator (`orchestrate.py`)

Manages GCP spot instances dynamically based on SQS queue load. Creates and destroys instances as needed to maintain optimal processing capacity.

Usage:

python -m yt_crawl.download.orchestrate \ --project_id YOUR_GCP_PROJECT \ --bucket YOUR_GCS_BUCKET \ --n_spot_instances 2000 \ --machine_type e2-small \ --root_blob_dir videos \ --download_video \ --download_metadata \ --delete_vm \ --sqs_data_queue_url https://sqs.us-west-2.amazonaws.com/YOUR_AWS_ACCOUNT_ID/yt-download-queue \ --pull_type sqs \ --max_height 720

Key Arguments:

--project_id: GCP project ID
--bucket: GCS bucket name for storing videos
--n_spot_instances: Maximum number of spot instances to create
--machine_type: GCP machine type (e.g., e2-small, e2-medium)
--root_blob_dir: Directory in GCS bucket for organizing downloads
--sqs_data_queue_url: SQS queue URL containing video IDs
--delete_vm: Automatically delete VMs when they finish processing
--max_height: Maximum video resolution to download (e.g., 720, 1080)

Features:

Smart Scaling: Creates instances based on queue size and current running instances
Multi-Region Support: Automatically distributes instances across US regions
Regional Bucket Mapping: Uses region-appropriate GCS buckets to minimize egress costs
Zone Rotation: Distributes load across availability zones
Automatic Cleanup: Removes instances when queue is empty
Error Handling: Manages instance creation failures and retries

Regional Bucket Strategy: The orchestrator automatically maps GCP zones to appropriate regional buckets:

us-west1-* → YOUR_GCS_BUCKET
us-central1-* → YOUR_GCS_BUCKET-central
us-east1-* → YOUR_GCS_BUCKET-east
us-east4-* → YOUR_GCS_BUCKET-east-4

Worker Script (`spot_download.py`)

Runs on each spot instance to download videos from YouTube based on SQS messages.

Core Functionality:

Fetches video IDs from SQS queue
Downloads videos, audio, or metadata using yt-dlp
Stores files directly to mounted GCS buckets via gcsfuse
Handles various error conditions and rate limiting
Self-destructs when work is complete or errors accumulate

Download Options:

--download_video: Download video files
--download_audio: Download audio only
--download_metadata: Download video metadata (title, description, etc.)
--download_mp4: Force MP4 format (slower but more compatible)
--basic_metadata: Download only essential metadata fields

Error Handling: The script includes sophisticated error handling for common YouTube issues:

Ban Detection: Automatically shuts down instance if bot detection occurs
Rate Limiting: Implements sleep intervals between downloads
Content Filtering: Skips live streams and overly long videos
Retry Logic: Attempts multiple retries for transient failures
Error Logging: Saves detailed error information to GCS

Quality Control:

Maximum video length filtering (default: 2 hours)
Resolution limiting (--max_height parameter)
Live stream detection and skipping
Private/unavailable video handling

Automation Script (`publish_and_orchestrate_download.sh`)

Bash script that coordinates the entire download pipeline from start to finish.

Environment Variables:

PUBLISH_WORKERS: Number of threads for publishing messages (default: 32)
QUEUE_URL: SQS queue URL
DOWNLOAD_VIDEO: Set to "true" to download video files
DOWNLOAD_METADATA: Set to "true" to download metadata
MESSAGE_FILES: Space-separated paths to JSON files with video IDs
ROOT_BLOB_DIR: Directory name in GCS bucket
MAX_HEIGHT: Maximum video resolution (default: 720)
PURGE_QUEUE: Set to "true" to clear queue before publishing
SKIP_PUBLISH: Set to "true" to skip publishing step

Workflow:

Optionally purges existing SQS queue
Publishes video IDs to SQS queue using multiple workers
Starts orchestrator to manage spot instances
Monitors progress and scales instances automatically

Storage Organization

Videos are organized in GCS buckets with the following structure:

bucket-name/ ├── your-blob-dir/ │ ├── video_id1/ │ │ ├── video_id1.mp4 │ │ └── video_id1.json │ └── video_id2/ │ ├── video_id2.mp4 │ └── video_id2.json └── errors/ ├── failed_video_id1.txt └── failed_video_id2.txt

Cost Optimization Features

Spot Instances

Uses preemptible GCP spot instances (up to 80% cost savings)
Automatic termination handling
Fault-tolerant design with message persistence

Regional Distribution

Distributes instances across multiple US regions
Uses regional GCS buckets to minimize data transfer costs
Automatic zone rotation for resource availability

Smart Scaling

Creates instances only when needed based on queue size
Terminates instances when queue is empty
Configurable maximum instance limits

Monitoring and Logging

Instance-Level Logging

Each instance logs to /path/to/data/vid-llm/yt_crawl/logs/
Detailed error tracking with instance identification
Download progress and performance metrics

Error Tracking

Failed downloads logged with error details
Instance information included for debugging
Categorized error types (ban, unavailable, timeout, etc.)

Queue Monitoring

Real-time queue depth monitoring
In-flight message tracking
Automatic scaling based on workload

Troubleshooting

Common Issues

"No messages received for a long time"

Check SQS queue URL and AWS credentials
Verify queue has messages
Check AWS region configuration

"Failed to create instances in all zones"

Check GCP quotas and limits
Verify service account permissions
Try different machine types or regions

"Bot detection / 403 Forbidden errors"

Implement sleep intervals between downloads
Use different IP ranges or regions
Consider rotating user agents

"Instance shuts down immediately"

Check startup script execution
Verify GCS bucket permissions
Review instance logs in GCP console

Performance Tuning

For High Throughput:

Increase --n_spot_instances
Use larger machine types (e2-medium, e2-standard-2)
Distribute across more regions

For Cost Optimization:

Use smaller machine types (e2-micro, e2-small)
Implement longer sleep intervals
Enable --basic_metadata only mode

For Quality:

Set appropriate --max_height limits
Enable --download_mp4 for consistency
Use --sleep_interval to avoid rate limiting

Advanced Configuration

Custom Startup Scripts

The orchestrator generates startup scripts dynamically based on configuration. You can modify the STARTUP_SCRIPT_TEMPLATE in orchestrate.py for custom requirements.

Multi-Project Support

The system supports multiple GCP projects through the REGION_TO_BUCKET mapping. Add your project configuration:

REGION_TO_BUCKET = { "your-project": { "us-west1": "your-bucket-west", "us-central1": "your-bucket-central", # ... other regions } }

Custom Error Handling

Modify the BAN_IDENTIFIERS and UNAVAILABLE_IDENTIFIERS sets in spot_download.py to customize error detection patterns.

Security Considerations

AWS credentials are passed as environment variables to instances
GCP service accounts follow principle of least privilege
No credentials are stored in GCS or logged
Instances automatically terminate to prevent runaway costs
Error logs exclude sensitive information

Examples

Download 100K Videos with Metadata Only

export DOWNLOAD_METADATA="true" export DOWNLOAD_VIDEO="false" export MESSAGE_FILES="/data/video_ids_100k.json" export ROOT_BLOB_DIR="metadata-100k" ./scripts/publish_and_orchestrate_download.sh

High-Quality Video Download

export DOWNLOAD_VIDEO="true" export MAX_HEIGHT="1080" export MESSAGE_FILES="/data/hq_video_ids.json" export ROOT_BLOB_DIR="hq-videos" ./scripts/publish_and_orchestrate_download.sh

Audio-Only Extraction

export DOWNLOAD_AUDIO="true" export DOWNLOAD_VIDEO="false" export MESSAGE_FILES="/data/audio_video_ids.json" export ROOT_BLOB_DIR="audio-dataset" ./scripts/publish_and_orchestrate_download.sh

Support and Contributing

For issues, feature requests, or contributions, please refer to the main project repository. When reporting issues, include:

Complete error messages and logs
Configuration parameters used
GCP project and region information
Sample video IDs that failed (if applicable)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
scripts		scripts
yt_crawl		yt_crawl
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

vid-nexus: Video Download System with GCP Spot Instances

Architecture Overview

Prerequisites

AWS Setup

Creating an SQS Queue

GCP Setup

Dependencies

Quick Start

1. Prepare Video IDs

2. Set Environment Variables

3. Run the Complete Pipeline

Detailed Component Documentation

SQS Message Publisher (publish_sqs.py)

Orchestrator (orchestrate.py)

Worker Script (spot_download.py)

Automation Script (publish_and_orchestrate_download.sh)

Storage Organization

Cost Optimization Features

Spot Instances

Regional Distribution

Smart Scaling

Monitoring and Logging

Instance-Level Logging

Error Tracking

Queue Monitoring

Troubleshooting

Common Issues

Performance Tuning

Advanced Configuration

Custom Startup Scripts

Multi-Project Support

Custom Error Handling

Security Considerations

Examples

Download 100K Videos with Metadata Only

High-Quality Video Download

Audio-Only Extraction

Support and Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

SQS Message Publisher (`publish_sqs.py`)

Orchestrator (`orchestrate.py`)

Worker Script (`spot_download.py`)

Automation Script (`publish_and_orchestrate_download.sh`)

Packages