Skip to content

Feature: Batch- and axis-aware image/video/volume operations #313

@ternaus

Description

@ternaus

Describe what you are looking for

Feature request: Batch- and axis-aware image/video/volume operations

Summary

We need batch- and axis-aware versions of common image operations so that videos and volumes can be processed without Python loops over frames/slices. The same APIs should accept the shapes we choose:

  • Single image: (H, W, C)
  • Video / batch of frames: (N, H, W, C) — N frames (or batch size)
  • Volume (3D stack): (D, H, W, C) — D depth slices

The general form is one leading dimension (D or N) + (H, W, C): both (N, H, W, C) and (D, H, W, C) should be supported so that video (N frames) and volume (D slices) are first-class. We may also want an arbitrary batch/axis (e.g. apply along axis 0 for any ndarray). Operations should work on the exact shapes we pass, with no need to reshape or loop in Python.

Critical point: cv2.flip (and similarly other cv2 functions) does not work on videos and volumes: OpenCV expects a single image (H, W, C). For video (N, H, W, C) or volume (D, H, W, C) we must loop over frames/slices in Python, which is slow and prevents using optimized batch paths. We need flip (and all operations below) to accept both shapes—and in general one leading dimension (N or D) + (H, W, C)—so a single call can process the whole array.


Target semantics

  • Input shape: Caller passes an array of shape we want, e.g.:
    • (H, W, C) — single image
    • (N, H, W, C) — N images (video frames or batch)
    • (D, H, W, C) — D slices (volume, e.g. 3D medical imaging)
    • General form: one leading dimension (N or D) + (H, W, C). Both (N, H, W, C) and (D, H, W, C) should be explicitly supported.
    • Optionally: a generic "batch axis" (e.g. apply op along axis 0 for any ndarray).
  • Output shape: Same as input shape (or explicitly documented, e.g. resize changes H, W).
  • No Python loop: The implementation should process all batch elements (or the chosen axis) in one go, using vectorized/backend code, not a per-frame loop in Python.

List of operations we need (batch/axis-aware)

  1. Flipcv2 analogue: cv2.flip
    Flip along any axis (or multiple axes), similar to numpy.flip(array, axis=...). 2D/batch (N, H, W, C) or (H, W, C): flip along spatial axes (e.g. axis 1 = horizontal, axis 2 = vertical). 3D volume (D, H, W, C): flip along axis 0 (depth), 1 (height), 2 (width), or any combination (e.g. axis=(0, 2)). Same semantics as numpy; one API for 2D and 3D.

  2. Warp affinecv2 analogue: cv2.warpAffine
    We need two transforms:

    • 2D batch (direct cv2 extension): (N, H, W, C)(N, H', W', C). Apply 2D affine (2×3 or 3×3 matrix) per frame; one matrix per batch element or one shared. Same N; spatial size can change to (H', W').
    • 3D volume: (D, H, W, C)(D', H', W', C). Apply 3D affine (4×4 matrix) with trilinear (or nearest) interpolation. Enables true 3D rotation/scaling/translation. No cv2 equivalent; analogous to scipy.ndimage.affine_transform.
  3. Warp perspectivecv2 analogue: cv2.warpPerspective
    We need two transforms:

    • 2D batch (direct cv2 extension): (N, H, W, C)(N, H', W', C). Apply 2D perspective (3×3 matrix) per frame; one matrix per batch element or one shared. Same N; spatial size can change to (H', W').
    • 3D volume: (D, H, W, C)(D', H', W', C). Apply 3D projective transform (4×4 matrix with perspective divide). No cv2 equivalent.
  4. Resizecv2 analogue: cv2.resize
    We need two transforms:

    • 2D batch (direct cv2 extension): (N, H, W, C)(N, H', W', C). Resize spatial dimensions (height, width) per frame; same or per-frame target size. Extends cv2.resize.
    • 3D volume: (D, H, W, C)(D', H', W', C). Resize in all three spatial directions (depth, height, width). No cv2 equivalent.
  5. Remapcv2 analogue: cv2.remap
    We need two transforms:

    • 2D batch (direct cv2 extension): (N, H, W, C)(N, H', W', C) (or same shape). Generic remap with map_x, map_y; same or per-frame maps. Extends cv2.remap.
    • 3D volume: (D, H, W, C)(D', H', W', C) (or same shape). 3D remap with coordinate maps for all three axes. No cv2 equivalent.
  6. Copy make border (pad)cv2 analogue: cv2.copyMakeBorder
    We need two transforms:

    • 2D batch (direct cv2 extension): (N, H, W, C)(N, H+top+bottom, W+left+right, C). Pad height and width. Extends cv2.copyMakeBorder.
    • 3D volume: (D, H, W, C)(D+front+back, H+top+bottom, W+left+right, C). Pad in all three directions. No cv2 equivalent.
  7. Blur (box filter)cv2 analogue: cv2.blur
    We need two transforms:

    • 2D batch (direct cv2 extension): (N, H, W, C)(N, H, W, C). Rectangular kernel blur per frame. Extends cv2.blur.
    • 3D volume: See Blur3D below (3D-only op).
  8. Gaussian blurcv2 analogue: cv2.GaussianBlur
    We need two transforms:

    • 2D batch (direct cv2 extension): (N, H, W, C)(N, H, W, C). Gaussian kernel blur per frame. Extends cv2.GaussianBlur.
    • 3D volume: See GaussianBlur3D below (3D-only op).
  9. Median blurcv2 analogue: cv2.medianBlur
    We need two transforms:

    • 2D batch (direct cv2 extension): (N, H, W, C)(N, H, W, C). Median filter per frame. Extends cv2.medianBlur.
    • 3D volume: See MedianBlur3D below (3D-only op).
  10. Filter2D (2D convolution)cv2 analogue: cv2.filter2D
    We need two transforms:

    • 2D batch (direct cv2 extension): (N, H, W, C)(N, H, W, C). Custom 2D kernel per frame; same or per-element kernel. Extends cv2.filter2D.
    • 3D volume: See Filter3D below (3D-only op).
  11. SepFilter2D (separable 2D filter)cv2 analogue: cv2.sepFilter2D
    We need two transforms:

    • 2D batch (direct cv2 extension): (N, H, W, C)(N, H, W, C). Separable (kernelX, kernelY) per frame. Extends cv2.sepFilter2D.
    • 3D volume: See SepFilter3D below (3D-only op).
  12. Erode / Dilate / MorphologyExcv2 analogues: cv2.erode, cv2.dilate, cv2.morphologyEx
    We need two variants:

    • 2D batch (direct cv2 extension): (N, H, W, C)(N, H, W, C). Morphological ops per frame with 2D structuring element. Extends cv2 erode/dilate/morphologyEx.
    • 3D volume: See Erode3D / Dilate3D / MorphologyEx3D below (3D-only ops).

3D-only operations (no cv2 equivalent)

These operate on volumes (D, H, W, C) and have no 2D cv2 counterpart; they are the natural 3D extensions of the 2D ops above.

  • Blur3D — Box filter in 3D. (D, H, W, C)(D, H, W, C). Rectangular 3D kernel.
  • GaussianBlur3D — Gaussian blur in 3D. (D, H, W, C)(D, H, W, C). Kernel size and/or sigma per axis.
  • MedianBlur3D — Median filter in 3D. (D, H, W, C)(D, H, W, C).
  • Filter3D — 3D convolution with a 3D kernel. (D, H, W, C)(D, H, W, C). Custom 3D kernel.
  • SepFilter3D — Separable 3D filter (e.g. kernelD, kernelH, kernelW). (D, H, W, C)(D, H, W, C). More efficient than full 3D kernel when separable.
  • Erode3D / Dilate3D / MorphologyEx3D — Morphological ops with a 3D structuring element. (D, H, W, C)(D, H, W, C).

Why this matters

  • Videos (N, H, W, C): Today we loop over N and call cv2.flip, cv2.resize, cv2.warpAffine, etc. per frame. That loses SIMD/GPU batch optimizations and adds Python overhead.
  • Volumes (D, H, W, C): Same for 3D data: we loop over D and call the same cv2 APIs. We need one call that applies along the leading dimension. Specifying both N (video) and D (volume) makes it clear we need the same semantics for both shapes.
  • Consistency: All of the above operations are used in augmentation pipelines (e.g. AlbumentationsX). Having them in a single backend with consistent “shape we want” semantics would let us support video and volume augmentation without per-frame loops.

Summary table

2D batch = direct cv2 extension: (N, H, W, C)(N, H', W', C) (or same spatial size). 3D volume = (D, H, W, C)(D', H', W', C) (or same size).

Operation cv2 / current API 2D batch (N,H,W,C) 3D volume (D,H,W,C)
Flip cv2.flip Yes (flip along any axis) Yes (axis 0,1,2 or combo, like numpy.flip)
Warp affine cv2.warpAffine (N,H,W,C)→(N,H',W',C) (D,H,W,C)→(D',H',W',C)
Warp perspective cv2.warpPerspective (N,H,W,C)→(N,H',W',C) (D,H,W,C)→(D',H',W',C)
Resize cv2.resize (N,H,W,C)→(N,H',W',C) (D,H,W,C)→(D',H',W',C) (resize in D,H,W)
Remap cv2.remap Yes Yes (3D coordinate maps)
Copy make border cv2.copyMakeBorder Yes (pad H,W) Yes (pad D,H,W)
Blur cv2.blur Yes Blur3D
Gaussian blur cv2.GaussianBlur Yes GaussianBlur3D
Median blur cv2.medianBlur Yes MedianBlur3D
Filter2D cv2.filter2D Yes Filter3D
SepFilter2D cv2.sepFilter2D Yes SepFilter3D
Erode / Dilate / MorphologyEx cv2.erode, dilate, morphologyEx Yes Erode3D / Dilate3D / MorphologyEx3D

3D-only (no cv2 equivalent): Blur3D, GaussianBlur3D, MedianBlur3D, Filter3D, SepFilter3D, Erode3D, Dilate3D, MorphologyEx3D — all on (D, H, W, C).

All 2D batch ops should accept single image (H, W, C) and video (N, H, W, C). All 3D ops should accept volume (D, H, W, C). Process in a single call without Python loops.

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

It applies to everything

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions