This package provides a scalable, cost-effective solution for downloading videos from YouTube and Vimeo using Google Cloud Platform (GCP) spot instances and Amazon SQS for job queuing. The system automatically manages worker instances based on queue workload and handles video downloads with error recovery.
The system consists of four main components:
- SQS Message Publisher (
publish_sqs.py) - Publishes video IDs to SQS queue - Orchestrator (
orchestrate.py) - Manages GCP spot instances based on queue load - Worker Script (
spot_download.py) - Downloads videos on spot instances - Automation Script (
publish_and_orchestrate_download.sh) - Coordinates the entire process
- AWS account with SQS access
- AWS credentials configured (
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEY) - SQS queue created in desired region
Create an SQS queue using AWS CLI:
# Standard queue (recommended for most use cases) aws sqs create-queue --queue-name test-queue \ --region us-west-2 \ --attributes VisibilityTimeout=600 # Get the queue URL (save this for later use) aws sqs get-queue-url \ --queue-name test-queue \ --region us-west-2Queue Configuration Recommendations:
VisibilityTimeoutSeconds: 600 (10 minutes) - Time for worker to process message. 10 minutes is good enough for video crawling purposes.
- GCP project with Compute Engine API enabled
- Service account with necessary permissions:
- Compute Engine Admin
- Storage Admin
- Service Account User
- GCS buckets created in multiple regions for storage
- Google Cloud SDK (
gcloud) installed and configured
- Python 3.7+
- Required Python packages (see
requirements.txt) yt-dlpfor YouTube downloadsffmpegfor video processing
Create JSON files containing lists of YouTube video IDs:
["dQw4w9WgXcQ", "jNQXAC9IVRw", "kJQP7kiw5Fk"]export AWS_ACCESS_KEY_ID="your_aws_key" export AWS_SECRET_ACCESS_KEY="your_aws_secret" export MESSAGE_FILES="/path/to/video_ids1.json /path/to/video_ids2.json" export ROOT_BLOB_DIR="your-video-folder" export QUEUE_URL="https://sqs.us-west-2.amazonaws.com/your-account/your-queue" export DOWNLOAD_VIDEO="true" # Set to download video files export DOWNLOAD_METADATA="true" # Set to download metadata./scripts/publish_and_orchestrate_download.shPublishes video IDs from JSON files to an Amazon SQS queue for processing by worker instances.
Usage:
python -m yt_crawl.publish_sqs \ --sqs_queue_url https://sqs.us-west-2.amazonaws.com/YOUR_AWS_ACCOUNT_ID/yt-download-queue \ --message_files /path/to/file1.json /path/to/file2.json \ --aws_region us-west-2 \ --max_workers 20Arguments:
--sqs_queue_url: SQS queue URL for video download jobs--message_files: Paths to JSON files containing video IDs (supports multiple files)--aws_region: AWS region where the SQS queue is located--max_workers: Number of concurrent threads for publishing messages
Features:
- Multithreaded publishing for faster queue population
- Automatic duplicate removal across files
- Progress tracking with progress bars
- Error handling and retry logic
Manages GCP spot instances dynamically based on SQS queue load. Creates and destroys instances as needed to maintain optimal processing capacity.
Usage:
python -m yt_crawl.download.orchestrate \ --project_id YOUR_GCP_PROJECT \ --bucket YOUR_GCS_BUCKET \ --n_spot_instances 2000 \ --machine_type e2-small \ --root_blob_dir videos \ --download_video \ --download_metadata \ --delete_vm \ --sqs_data_queue_url https://sqs.us-west-2.amazonaws.com/YOUR_AWS_ACCOUNT_ID/yt-download-queue \ --pull_type sqs \ --max_height 720Key Arguments:
--project_id: GCP project ID--bucket: GCS bucket name for storing videos--n_spot_instances: Maximum number of spot instances to create--machine_type: GCP machine type (e.g., e2-small, e2-medium)--root_blob_dir: Directory in GCS bucket for organizing downloads--sqs_data_queue_url: SQS queue URL containing video IDs--delete_vm: Automatically delete VMs when they finish processing--max_height: Maximum video resolution to download (e.g., 720, 1080)
Features:
- Smart Scaling: Creates instances based on queue size and current running instances
- Multi-Region Support: Automatically distributes instances across US regions
- Regional Bucket Mapping: Uses region-appropriate GCS buckets to minimize egress costs
- Zone Rotation: Distributes load across availability zones
- Automatic Cleanup: Removes instances when queue is empty
- Error Handling: Manages instance creation failures and retries
Regional Bucket Strategy: The orchestrator automatically maps GCP zones to appropriate regional buckets:
us-west1-*→YOUR_GCS_BUCKETus-central1-*→YOUR_GCS_BUCKET-centralus-east1-*→YOUR_GCS_BUCKET-eastus-east4-*→YOUR_GCS_BUCKET-east-4
Runs on each spot instance to download videos from YouTube based on SQS messages.
Core Functionality:
- Fetches video IDs from SQS queue
- Downloads videos, audio, or metadata using
yt-dlp - Stores files directly to mounted GCS buckets via gcsfuse
- Handles various error conditions and rate limiting
- Self-destructs when work is complete or errors accumulate
Download Options:
--download_video: Download video files--download_audio: Download audio only--download_metadata: Download video metadata (title, description, etc.)--download_mp4: Force MP4 format (slower but more compatible)--basic_metadata: Download only essential metadata fields
Error Handling: The script includes sophisticated error handling for common YouTube issues:
- Ban Detection: Automatically shuts down instance if bot detection occurs
- Rate Limiting: Implements sleep intervals between downloads
- Content Filtering: Skips live streams and overly long videos
- Retry Logic: Attempts multiple retries for transient failures
- Error Logging: Saves detailed error information to GCS
Quality Control:
- Maximum video length filtering (default: 2 hours)
- Resolution limiting (--max_height parameter)
- Live stream detection and skipping
- Private/unavailable video handling
Bash script that coordinates the entire download pipeline from start to finish.
Environment Variables:
PUBLISH_WORKERS: Number of threads for publishing messages (default: 32)QUEUE_URL: SQS queue URLDOWNLOAD_VIDEO: Set to "true" to download video filesDOWNLOAD_METADATA: Set to "true" to download metadataMESSAGE_FILES: Space-separated paths to JSON files with video IDsROOT_BLOB_DIR: Directory name in GCS bucketMAX_HEIGHT: Maximum video resolution (default: 720)PURGE_QUEUE: Set to "true" to clear queue before publishingSKIP_PUBLISH: Set to "true" to skip publishing step
Workflow:
- Optionally purges existing SQS queue
- Publishes video IDs to SQS queue using multiple workers
- Starts orchestrator to manage spot instances
- Monitors progress and scales instances automatically
Videos are organized in GCS buckets with the following structure:
bucket-name/ ├── your-blob-dir/ │ ├── video_id1/ │ │ ├── video_id1.mp4 │ │ └── video_id1.json │ └── video_id2/ │ ├── video_id2.mp4 │ └── video_id2.json └── errors/ ├── failed_video_id1.txt └── failed_video_id2.txt - Uses preemptible GCP spot instances (up to 80% cost savings)
- Automatic termination handling
- Fault-tolerant design with message persistence
- Distributes instances across multiple US regions
- Uses regional GCS buckets to minimize data transfer costs
- Automatic zone rotation for resource availability
- Creates instances only when needed based on queue size
- Terminates instances when queue is empty
- Configurable maximum instance limits
- Each instance logs to
/path/to/data/vid-llm/yt_crawl/logs/ - Detailed error tracking with instance identification
- Download progress and performance metrics
- Failed downloads logged with error details
- Instance information included for debugging
- Categorized error types (ban, unavailable, timeout, etc.)
- Real-time queue depth monitoring
- In-flight message tracking
- Automatic scaling based on workload
"No messages received for a long time"
- Check SQS queue URL and AWS credentials
- Verify queue has messages
- Check AWS region configuration
"Failed to create instances in all zones"
- Check GCP quotas and limits
- Verify service account permissions
- Try different machine types or regions
"Bot detection / 403 Forbidden errors"
- Implement sleep intervals between downloads
- Use different IP ranges or regions
- Consider rotating user agents
"Instance shuts down immediately"
- Check startup script execution
- Verify GCS bucket permissions
- Review instance logs in GCP console
For High Throughput:
- Increase
--n_spot_instances - Use larger machine types (e2-medium, e2-standard-2)
- Distribute across more regions
For Cost Optimization:
- Use smaller machine types (e2-micro, e2-small)
- Implement longer sleep intervals
- Enable
--basic_metadataonly mode
For Quality:
- Set appropriate
--max_heightlimits - Enable
--download_mp4for consistency - Use
--sleep_intervalto avoid rate limiting
The orchestrator generates startup scripts dynamically based on configuration. You can modify the STARTUP_SCRIPT_TEMPLATE in orchestrate.py for custom requirements.
The system supports multiple GCP projects through the REGION_TO_BUCKET mapping. Add your project configuration:
REGION_TO_BUCKET = { "your-project": { "us-west1": "your-bucket-west", "us-central1": "your-bucket-central", # ... other regions } }Modify the BAN_IDENTIFIERS and UNAVAILABLE_IDENTIFIERS sets in spot_download.py to customize error detection patterns.
- AWS credentials are passed as environment variables to instances
- GCP service accounts follow principle of least privilege
- No credentials are stored in GCS or logged
- Instances automatically terminate to prevent runaway costs
- Error logs exclude sensitive information
export DOWNLOAD_METADATA="true" export DOWNLOAD_VIDEO="false" export MESSAGE_FILES="/data/video_ids_100k.json" export ROOT_BLOB_DIR="metadata-100k" ./scripts/publish_and_orchestrate_download.shexport DOWNLOAD_VIDEO="true" export MAX_HEIGHT="1080" export MESSAGE_FILES="/data/hq_video_ids.json" export ROOT_BLOB_DIR="hq-videos" ./scripts/publish_and_orchestrate_download.shexport DOWNLOAD_AUDIO="true" export DOWNLOAD_VIDEO="false" export MESSAGE_FILES="/data/audio_video_ids.json" export ROOT_BLOB_DIR="audio-dataset" ./scripts/publish_and_orchestrate_download.shFor issues, feature requests, or contributions, please refer to the main project repository. When reporting issues, include:
- Complete error messages and logs
- Configuration parameters used
- GCP project and region information
- Sample video IDs that failed (if applicable)