Get Started with Video Curation#
This guide shows how to install Curator and run your first video curation pipeline.
The example pipeline processes a list of videos, splitting each into 10‑second clips using a fixed stride. It then generates clip‑level embeddings for downstream tasks such as duplicate removal and similarity search.
Overview#
This quickstart guide demonstrates how to:
Install NeMo Curator with video processing support
Set up FFmpeg with GPU-accelerated encoding
Configure embedding models (Cosmos-Embed1 or InternVideo2)
Process videos through a complete splitting and embedding pipeline
Generate outputs ready for duplicate removal, captioning, and model training
What you’ll build: A video processing pipeline that:
Splits videos into 10-second clips using fixed stride or scene detection
Generates clip-level embeddings for similarity search and deduplication
Optionally creates captions and preview images
Outputs results in formats compatible with multimodal training workflows
Prerequisites#
System Requirements#
To use NeMo Curator’s video curation capabilities, ensure your system meets these requirements:
Operating System#
Ubuntu 24.04, 22.04, or 20.04 (required for GPU-accelerated video processing)
Other Linux distributions may work but are not officially supported
Python Environment#
Python 3.10, 3.11, or 3.12
uv package manager for dependency management
Git for model and repository dependencies
GPU Requirements#
NVIDIA GPU required (CPU-only mode not supported for video processing)
Architecture: Volta™ or newer (compute capability 7.0+)
Examples: V100, T4, RTX 2080+, A100, H100
CUDA: Version 12.0 or above
VRAM: Minimum requirements by configuration:
Basic splitting + embedding: ~16GB VRAM
Full pipeline (splitting + embedding + captioning): ~38GB VRAM
Reduced configuration (lower batch sizes, FP8): ~21GB VRAM
Software Dependencies#
FFmpeg 7.0+ with H.264 encoding support
GPU encoder:
h264_nvenc(recommended for performance)CPU encoders:
libopenh264orlibx264(fallback options)
Tip
If you don’t have uv installed, refer to the Installation Guide for setup instructions, or install it quickly with:
curl -LsSf https://astral.sh/uv/0.8.22/install.sh | sh source $HOME/.local/bin/env Install#
Create and activate a virtual environment, then choose an install option:
Note
Cosmos-Embed1 (the default) is generally better than InternVideo2 for most video embedding tasks. Consider using Cosmos-Embed1 (cosmos-embed1-224p) unless you have specific requirements for InternVideo2.
uv pip install torch wheel_stub psutil setuptools setuptools_scm uv pip install --no-build-isolation "nemo-curator[video_cuda12]" git clone https://github.com/NVIDIA-NeMo/Curator.git cd Curator uv sync --extra video_cuda12 --all-groups source .venv/bin/activate # Install base dependencies uv pip install torch wheel_stub psutil setuptools setuptools_scm uv pip install --no-build-isolation "nemo-curator[video_cuda12]" # Clone and set up InternVideo2 git clone https://github.com/OpenGVLab/InternVideo.git cd InternVideo git checkout 09d872e5093296c6f36b8b3a91fc511b76433bf7 # Download and apply NeMo Curator patch curl -fsSL https://raw.githubusercontent.com/NVIDIA/NeMo-Curator/main/external/intern_video2_multimodal.patch -o intern_video2_multimodal.patch patch -p1 < intern_video2_multimodal.patch cd .. # Add InternVideo2 to the environment uv pip install InternVideo/InternVideo2/multi_modality git clone https://github.com/NVIDIA-NeMo/Curator.git cd Curator uv sync --extra video_cuda12 --all-groups bash external/intern_video2_installation.sh uv add InternVideo/InternVideo2/multi_modality source .venv/bin/activate NeMo Curator is available as a standalone container:
# Pull the container docker pull nvcr.io/nvidia/nemo-curator:25.09 # Run the container docker run --gpus all -it --rm nvcr.io/nvidia/nemo-curator:25.09 See also
For details on container environments and configurations, see Container Environments.
Install FFmpeg and Encoders#
Curator’s video pipelines rely on FFmpeg for decoding and encoding. If you plan to encode clips (for example, using --transcode-encoder libopenh264 or h264_nvenc), install FFmpeg with the corresponding encoders.
Use the maintained script in the repository to build and install FFmpeg with libopenh264 and NVIDIA NVENC support. The script enables --enable-libopenh264, --enable-cuda-nvcc, and --enable-libnpp.
Script source: docker/common/install_ffmpeg.sh
curl -fsSL https://raw.githubusercontent.com/NVIDIA-NeMo/Curator/main/docker/common/install_ffmpeg.sh -o install_ffmpeg.sh chmod +x install_ffmpeg.sh sudo bash install_ffmpeg.sh Confirm that FFmpeg is on your PATH and that at least one H.264 encoder is available:
ffmpeg -hide_banner -version | head -n 5 ffmpeg -encoders | grep -E "h264_nvenc|libopenh264|libx264" | cat If encoders are missing, reinstall FFmpeg with the required options or use the Debian/Ubuntu script above.
Refer to Clip Encoding to choose encoders and verify NVENC support on your system.
Available Models#
Embeddings convert each video clip into a numeric vector that captures visual and semantic content. Curator uses these vectors to:
Remove near-duplicate clips during duplicate removal
Enable similarity search and clustering
Support downstream analysis such as caption verification
NeMo Curator supports two embedding model families:
Cosmos-Embed1 (Recommended)#
Cosmos-Embed1 (default): Available in three variants—cosmos-embed1-224p, cosmos-embed1-336p, and cosmos-embed1-448p—which differ in input resolution and accuracy/VRAM tradeoff. All variants are automatically downloaded to MODEL_DIR on first run.
Model Variant | Resolution | VRAM Usage | Speed | Accuracy | Best For |
|---|---|---|---|---|---|
cosmos-embed1-224p | 224×224 | ~8GB | Fastest | Good | Large-scale processing, initial curation |
cosmos-embed1-336p | 336×336 | ~12GB | Medium | Better | Balanced performance and quality |
cosmos-embed1-448p | 448×448 | ~16GB | Slower | Best | High-quality embeddings, fine-grained matching |
Model links:
InternVideo2 (IV2)#
Open model that requires the IV2 checkpoint and BERT model files to be available locally; higher VRAM usage.
For this quickstart, we’re going to set up support for Cosmos-Embed1-224p.
Prepare Model Weights#
For most use cases, you only need to create a model directory. The required model files will be downloaded automatically on first run.
Create a model directory:
mkdir -p "$MODEL_DIR"
Tip
You can reuse the same
<MODEL_DIR>across runs.No additional setup is required. The model will be downloaded automatically when first used.
Set Up Data Directories#
Organize input videos and output locations before running the pipeline.
Local: For local file processing. Define paths like:
DATA_DIR=/path/to/videos OUT_DIR=/path/to/output_clips MODEL_DIR=/path/to/models
S3: For cloud storage (AWS S3, MinIO, etc.). Configure credentials in
~/.aws/credentialsand uses3://paths for--video-dirand--output-clip-path.
S3 usage notes:
Input videos can be read from S3 paths
Output clips can be written to S3 paths
Model directory should remain local for performance
Ensure IAM permissions allow read/write access to specified buckets
Run the Splitting Pipeline Example#
Use the following example script to read videos, split into clips, and write outputs. This runs a Ray pipeline with XennaExecutor under the hood.
python -m nemo_curator.examples.video.video_split_clip_example \ --video-dir "$DATA_DIR" \ --model-dir "$MODEL_DIR" \ --output-clip-path "$OUT_DIR" \ --splitting-algorithm fixed_stride \ --fixed-stride-split-duration 10.0 \ --embedding-algorithm cosmos-embed1-224p \ --transcode-encoder libopenh264 \ --verbose What this command does:
Reads all video files from
$DATA_DIRSplits each video into 10-second clips using fixed stride
Generates embeddings using Cosmos-Embed1-224p model
Encodes clips using libopenh264 codec
Writes output clips and metadata to
$OUT_DIR
Configuration Options Reference#
Option | Values | Description |
|---|---|---|
Splitting | ||
|
| Method for dividing videos into clips |
| Float (seconds) | Clip length for fixed stride (default: 10.0) |
|
| Frame decoding method for TransNetV2 |
Embedding | ||
|
| Embedding model to use |
Encoding | ||
|
| Video encoder for output clips |
| Flag | Enable hardware acceleration for encoding |
Optional Features | ||
| Flag | Generate text captions for each clip |
| Flag | Create preview images for each clip |
| Flag | Enable detailed logging output |
Tip
To use InternVideo2 instead, set --embedding-algorithm internvideo2.
Understanding Pipeline Output#
After successful execution, the output directory will contain:
$OUT_DIR/ ├── clips/ │ ├── video1_clip_0000.mp4 │ ├── video1_clip_0001.mp4 │ └── ... ├── embeddings/ │ ├── video1_clip_0000.npy │ ├── video1_clip_0001.npy │ └── ... ├── metadata/ │ └── manifest.jsonl └── previews/ (if --generate-previews enabled) ├── video1_clip_0000.jpg └── ... File descriptions:
clips/: Encoded video clips (MP4 format)
embeddings/: Numpy arrays containing clip embeddings (for similarity search)
metadata/manifest.jsonl: JSONL file with clip metadata (paths, timestamps, embeddings)
previews/: Thumbnail images for each clip (optional)
Example manifest entry:
{ "video_path": "/data/input_videos/video1.mp4", "clip_path": "/data/output_clips/clips/video1_clip_0000.mp4", "start_time": 0.0, "end_time": 10.0, "embedding_path": "/data/output_clips/embeddings/video1_clip_0000.npy", "preview_path": "/data/output_clips/previews/video1_clip_0000.jpg" } Best Practices#
Data Preparation#
Validate input videos: Ensure videos are not corrupted before processing
Consistent formats: Convert videos to a standard format (MP4 with H.264) for consistent results
Organize by content: Group similar videos together for efficient processing
Model Selection#
Start with Cosmos-Embed1-224p: Best balance of speed and quality for initial experiments
Upgrade resolution as needed: Use 336p or 448p only when higher precision is required
Monitor VRAM usage: Check GPU memory with
nvidia-smiduring processing
Pipeline Configuration#
Enable verbose logging: Use
--verboseflag for debugging and monitoringTest on small subset: Run pipeline on 5-10 videos before processing large datasets
Use GPU encoding: Enable NVENC for significant performance improvements
Save intermediate results: Keep embeddings and metadata for downstream tasks
Infrastructure#
Use shared storage: Mount shared filesystem for multi-node processing
Allocate sufficient VRAM: Plan for peak usage (captioning + embedding)
Monitor GPU utilization: Use
nvidia-smi dmonto track GPU usage during processingSchedule long-running jobs: Process large video datasets in batch jobs overnight
Next Steps#
Explore the Video Curation documentation. For encoding guidance, refer to Clip Encoding.