Skip to content

mrsalehi/vid-nexus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

vid-nexus: Video Download System with GCP Spot Instances

This package provides a scalable, cost-effective solution for downloading videos from YouTube and Vimeo using Google Cloud Platform (GCP) spot instances and Amazon SQS for job queuing. The system automatically manages worker instances based on queue workload and handles video downloads with error recovery.

Architecture Overview

The system consists of four main components:

  1. SQS Message Publisher (publish_sqs.py) - Publishes video IDs to SQS queue
  2. Orchestrator (orchestrate.py) - Manages GCP spot instances based on queue load
  3. Worker Script (spot_download.py) - Downloads videos on spot instances
  4. Automation Script (publish_and_orchestrate_download.sh) - Coordinates the entire process

Prerequisites

AWS Setup

  • AWS account with SQS access
  • AWS credentials configured (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY)
  • SQS queue created in desired region

Creating an SQS Queue

Create an SQS queue using AWS CLI:

# Standard queue (recommended for most use cases) aws sqs create-queue --queue-name test-queue \ --region us-west-2 \ --attributes VisibilityTimeout=600 # Get the queue URL (save this for later use) aws sqs get-queue-url \ --queue-name test-queue \ --region us-west-2

Queue Configuration Recommendations:

  • VisibilityTimeoutSeconds: 600 (10 minutes) - Time for worker to process message. 10 minutes is good enough for video crawling purposes.

GCP Setup

  • GCP project with Compute Engine API enabled
  • Service account with necessary permissions:
    • Compute Engine Admin
    • Storage Admin
    • Service Account User
  • GCS buckets created in multiple regions for storage
  • Google Cloud SDK (gcloud) installed and configured

Dependencies

  • Python 3.7+
  • Required Python packages (see requirements.txt)
  • yt-dlp for YouTube downloads
  • ffmpeg for video processing

Quick Start

1. Prepare Video IDs

Create JSON files containing lists of YouTube video IDs:

["dQw4w9WgXcQ", "jNQXAC9IVRw", "kJQP7kiw5Fk"]

2. Set Environment Variables

export AWS_ACCESS_KEY_ID="your_aws_key" export AWS_SECRET_ACCESS_KEY="your_aws_secret" export MESSAGE_FILES="/path/to/video_ids1.json /path/to/video_ids2.json" export ROOT_BLOB_DIR="your-video-folder" export QUEUE_URL="https://sqs.us-west-2.amazonaws.com/your-account/your-queue" export DOWNLOAD_VIDEO="true" # Set to download video files export DOWNLOAD_METADATA="true" # Set to download metadata

3. Run the Complete Pipeline

./scripts/publish_and_orchestrate_download.sh

Detailed Component Documentation

SQS Message Publisher (publish_sqs.py)

Publishes video IDs from JSON files to an Amazon SQS queue for processing by worker instances.

Usage:

python -m yt_crawl.publish_sqs \ --sqs_queue_url https://sqs.us-west-2.amazonaws.com/YOUR_AWS_ACCOUNT_ID/yt-download-queue \ --message_files /path/to/file1.json /path/to/file2.json \ --aws_region us-west-2 \ --max_workers 20

Arguments:

  • --sqs_queue_url: SQS queue URL for video download jobs
  • --message_files: Paths to JSON files containing video IDs (supports multiple files)
  • --aws_region: AWS region where the SQS queue is located
  • --max_workers: Number of concurrent threads for publishing messages

Features:

  • Multithreaded publishing for faster queue population
  • Automatic duplicate removal across files
  • Progress tracking with progress bars
  • Error handling and retry logic

Orchestrator (orchestrate.py)

Manages GCP spot instances dynamically based on SQS queue load. Creates and destroys instances as needed to maintain optimal processing capacity.

Usage:

python -m yt_crawl.download.orchestrate \ --project_id YOUR_GCP_PROJECT \ --bucket YOUR_GCS_BUCKET \ --n_spot_instances 2000 \ --machine_type e2-small \ --root_blob_dir videos \ --download_video \ --download_metadata \ --delete_vm \ --sqs_data_queue_url https://sqs.us-west-2.amazonaws.com/YOUR_AWS_ACCOUNT_ID/yt-download-queue \ --pull_type sqs \ --max_height 720

Key Arguments:

  • --project_id: GCP project ID
  • --bucket: GCS bucket name for storing videos
  • --n_spot_instances: Maximum number of spot instances to create
  • --machine_type: GCP machine type (e.g., e2-small, e2-medium)
  • --root_blob_dir: Directory in GCS bucket for organizing downloads
  • --sqs_data_queue_url: SQS queue URL containing video IDs
  • --delete_vm: Automatically delete VMs when they finish processing
  • --max_height: Maximum video resolution to download (e.g., 720, 1080)

Features:

  • Smart Scaling: Creates instances based on queue size and current running instances
  • Multi-Region Support: Automatically distributes instances across US regions
  • Regional Bucket Mapping: Uses region-appropriate GCS buckets to minimize egress costs
  • Zone Rotation: Distributes load across availability zones
  • Automatic Cleanup: Removes instances when queue is empty
  • Error Handling: Manages instance creation failures and retries

Regional Bucket Strategy: The orchestrator automatically maps GCP zones to appropriate regional buckets:

  • us-west1-*YOUR_GCS_BUCKET
  • us-central1-*YOUR_GCS_BUCKET-central
  • us-east1-*YOUR_GCS_BUCKET-east
  • us-east4-*YOUR_GCS_BUCKET-east-4

Worker Script (spot_download.py)

Runs on each spot instance to download videos from YouTube based on SQS messages.

Core Functionality:

  • Fetches video IDs from SQS queue
  • Downloads videos, audio, or metadata using yt-dlp
  • Stores files directly to mounted GCS buckets via gcsfuse
  • Handles various error conditions and rate limiting
  • Self-destructs when work is complete or errors accumulate

Download Options:

  • --download_video: Download video files
  • --download_audio: Download audio only
  • --download_metadata: Download video metadata (title, description, etc.)
  • --download_mp4: Force MP4 format (slower but more compatible)
  • --basic_metadata: Download only essential metadata fields

Error Handling: The script includes sophisticated error handling for common YouTube issues:

  • Ban Detection: Automatically shuts down instance if bot detection occurs
  • Rate Limiting: Implements sleep intervals between downloads
  • Content Filtering: Skips live streams and overly long videos
  • Retry Logic: Attempts multiple retries for transient failures
  • Error Logging: Saves detailed error information to GCS

Quality Control:

  • Maximum video length filtering (default: 2 hours)
  • Resolution limiting (--max_height parameter)
  • Live stream detection and skipping
  • Private/unavailable video handling

Automation Script (publish_and_orchestrate_download.sh)

Bash script that coordinates the entire download pipeline from start to finish.

Environment Variables:

  • PUBLISH_WORKERS: Number of threads for publishing messages (default: 32)
  • QUEUE_URL: SQS queue URL
  • DOWNLOAD_VIDEO: Set to "true" to download video files
  • DOWNLOAD_METADATA: Set to "true" to download metadata
  • MESSAGE_FILES: Space-separated paths to JSON files with video IDs
  • ROOT_BLOB_DIR: Directory name in GCS bucket
  • MAX_HEIGHT: Maximum video resolution (default: 720)
  • PURGE_QUEUE: Set to "true" to clear queue before publishing
  • SKIP_PUBLISH: Set to "true" to skip publishing step

Workflow:

  1. Optionally purges existing SQS queue
  2. Publishes video IDs to SQS queue using multiple workers
  3. Starts orchestrator to manage spot instances
  4. Monitors progress and scales instances automatically

Storage Organization

Videos are organized in GCS buckets with the following structure:

bucket-name/ ├── your-blob-dir/ │ ├── video_id1/ │ │ ├── video_id1.mp4 │ │ └── video_id1.json │ └── video_id2/ │ ├── video_id2.mp4 │ └── video_id2.json └── errors/ ├── failed_video_id1.txt └── failed_video_id2.txt 

Cost Optimization Features

Spot Instances

  • Uses preemptible GCP spot instances (up to 80% cost savings)
  • Automatic termination handling
  • Fault-tolerant design with message persistence

Regional Distribution

  • Distributes instances across multiple US regions
  • Uses regional GCS buckets to minimize data transfer costs
  • Automatic zone rotation for resource availability

Smart Scaling

  • Creates instances only when needed based on queue size
  • Terminates instances when queue is empty
  • Configurable maximum instance limits

Monitoring and Logging

Instance-Level Logging

  • Each instance logs to /path/to/data/vid-llm/yt_crawl/logs/
  • Detailed error tracking with instance identification
  • Download progress and performance metrics

Error Tracking

  • Failed downloads logged with error details
  • Instance information included for debugging
  • Categorized error types (ban, unavailable, timeout, etc.)

Queue Monitoring

  • Real-time queue depth monitoring
  • In-flight message tracking
  • Automatic scaling based on workload

Troubleshooting

Common Issues

"No messages received for a long time"

  • Check SQS queue URL and AWS credentials
  • Verify queue has messages
  • Check AWS region configuration

"Failed to create instances in all zones"

  • Check GCP quotas and limits
  • Verify service account permissions
  • Try different machine types or regions

"Bot detection / 403 Forbidden errors"

  • Implement sleep intervals between downloads
  • Use different IP ranges or regions
  • Consider rotating user agents

"Instance shuts down immediately"

  • Check startup script execution
  • Verify GCS bucket permissions
  • Review instance logs in GCP console

Performance Tuning

For High Throughput:

  • Increase --n_spot_instances
  • Use larger machine types (e2-medium, e2-standard-2)
  • Distribute across more regions

For Cost Optimization:

  • Use smaller machine types (e2-micro, e2-small)
  • Implement longer sleep intervals
  • Enable --basic_metadata only mode

For Quality:

  • Set appropriate --max_height limits
  • Enable --download_mp4 for consistency
  • Use --sleep_interval to avoid rate limiting

Advanced Configuration

Custom Startup Scripts

The orchestrator generates startup scripts dynamically based on configuration. You can modify the STARTUP_SCRIPT_TEMPLATE in orchestrate.py for custom requirements.

Multi-Project Support

The system supports multiple GCP projects through the REGION_TO_BUCKET mapping. Add your project configuration:

REGION_TO_BUCKET = { "your-project": { "us-west1": "your-bucket-west", "us-central1": "your-bucket-central", # ... other regions } }

Custom Error Handling

Modify the BAN_IDENTIFIERS and UNAVAILABLE_IDENTIFIERS sets in spot_download.py to customize error detection patterns.

Security Considerations

  • AWS credentials are passed as environment variables to instances
  • GCP service accounts follow principle of least privilege
  • No credentials are stored in GCS or logged
  • Instances automatically terminate to prevent runaway costs
  • Error logs exclude sensitive information

Examples

Download 100K Videos with Metadata Only

export DOWNLOAD_METADATA="true" export DOWNLOAD_VIDEO="false" export MESSAGE_FILES="/data/video_ids_100k.json" export ROOT_BLOB_DIR="metadata-100k" ./scripts/publish_and_orchestrate_download.sh

High-Quality Video Download

export DOWNLOAD_VIDEO="true" export MAX_HEIGHT="1080" export MESSAGE_FILES="/data/hq_video_ids.json" export ROOT_BLOB_DIR="hq-videos" ./scripts/publish_and_orchestrate_download.sh

Audio-Only Extraction

export DOWNLOAD_AUDIO="true" export DOWNLOAD_VIDEO="false" export MESSAGE_FILES="/data/audio_video_ids.json" export ROOT_BLOB_DIR="audio-dataset" ./scripts/publish_and_orchestrate_download.sh

Support and Contributing

For issues, feature requests, or contributions, please refer to the main project repository. When reporting issues, include:

  • Complete error messages and logs
  • Configuration parameters used
  • GCP project and region information
  • Sample video IDs that failed (if applicable)

About

A versatile video crawling engine tested on crawling a petabyte of video data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors