python scripts/run_live_detection_seg.py \ --camera d435i \ --detect-source rgb \ --targets person \ --enable-distance \ --mask-alpha 0.5 \ --show-stereo-preview python scripts/run_live_pose.py \ --camera d435i \ --detect-source rgb \ --enable-distance \ This is the official implementation of our paper accepted to CVPR 2026
Authors: Bowen Wen, Shaurya Dewan, Stan Birchfield
Stereo foundation models achieve strong zero-shot generalization but remain computationally prohibitive for real-time applications. Efficient stereo architectures, on the other hand, sacrifice robustness for speed and require costly per-domain fine-tuning. To bridge this gap, we present Fast-FoundationStereo, a family of architectures that achieve, for the first time, strong zero-shot generalization at real-time frame rate. We employ a divide-and-conquer acceleration strategy with three components: (1) knowledge distillation to compress the hybrid backbone into a single efficient student; (2) blockwise neural architecture search for automatically discovering optimal cost filtering designs under latency budgets, reducing search complexity exponentially; and (3) structured pruning for eliminating redundancy in the iterative refinement module. Furthermore, we introduce an automatic pseudo-labeling pipeline used to curate 1.4M in-the-wild stereo pairs to supplement synthetic training data and facilitate knowledge distillation. The resulting model can run over 10× faster than FoundationStereo while closely matching its zero-shot accuracy, thus establishing a new state-of-the-art among real-time methods.
[NOTE] This model is designed for real-time applications. For offline computation for the best accuracy, please checkout our earlier work FoundationStereo.
- Option 1: Docker
docker build --network host -t ffs -f docker/dockerfile . bash docker/run_container.sh- Option 2: pip
conda create -n ffs python=3.12 && conda activate ffs pip install torch==2.6.0 torchvision==0.21.0 xformers --index-url https://download.pytorch.org/whl/cu124 pip install -r requirements.txtThe repository also includes a modular live object detection entrypoint for stereo cameras under scripts/detection/.
Example with RealSense D435i using RGB for YOLO detection while keeping IR stereo streams available:
python scripts/run_live_detection.py --camera d435i --detect-source rgb --targets person 'cell phone' --model yolov8n.pt --show-stereo-previewExample with RGB detection and Fast-FoundationStereo distance overlay projected from left IR into RGB view:
python scripts/run_live_detection.py --camera d435i --detect-source rgb --targets person 'cell phone' --model yolov8n.pt --enable-distance --show-stereo-previewExample with YOLO-Seg and mask-based distance estimation:
python scripts/run_live_detection_seg.py --camera d435i --detect-source rgb --targets person 'cell phone' --enable-distance --show-stereo-previewExample with YOLO pose estimation for head and hands:
python scripts/run_live_pose.py --camera d435i --detect-source rgb --enable-distance --show-stereo-previewExample with left-IR detection and Fast-FoundationStereo distance overlay using the bbox center region:
python scripts/run_live_detection.py --camera d435i --detect-source left_ir --targets person --model yolov8n.pt --enable-distance --show-stereo-previewExample with a DECXIN stereo camera exposed as a side-by-side V4L2 stream on /dev/video0:
python scripts/run_live_detection.py \ --camera decxin \ --detect-source left_ir \ --video-device 0 \ --video-width 1280 \ --video-height 480 \ --calibration-path calibration/v4l2_stereo/stereo_calibration.npz \ --targets person \ --enable-distance \ --show-stereo-previewKey flags:
--targets: one or more YOLO class names to keep.--detect-source: choosergb,left_ir, orright_iras YOLO input.--detect-every: run YOLO every N frames to trade accuracy for throughput.--enable-distance: estimate distance forleft_irorrgbdetections using Fast-FoundationStereo.--center-ratio: use the center portion of each bbox for depth statistics. Smaller values are usually more stable for people.--use-mask-distance: when the YOLO model outputs segmentation masks, compute distance from the masked pixels inside the center region.--mask-alpha: control segmentation mask overlay transparency.--disable-color: disable RGB stream if you only want IR-based detection.--device-serial: pick a specific RealSense device when multiple are connected.--video-device,--video-width,--video-height,--video-fourcc,--split-layout: DECXIN V4L2 stereo camera settings.--calibration-path: path to either a simpleK.txtfile or a full stereo calibration.npzfile. For DECXIN cameras, prefer the.npzoutput from the calibration script so live frames are rectified before depth inference.
Pose-specific notes:
scripts/run_live_pose.pyuses a YOLO pose model and draws skeletons instead of object boxes by default.- The pose entrypoint estimates distances for
head,left_hand, andright_handusing keypoint-centered depth patches. - Use
--show-bboxin the pose entrypoint if you also want person boxes for debugging.
If your DECXIN stereo camera appears as a single side-by-side V4L2 stream such as /dev/video0, use the calibration tool below.
- Preview the stereo stream and verify the split layout:
python scripts/calibrate_v4l2_stereo.py \ --mode preview \ --video-device 0 \ --video-width 1280 \ --video-height 480 \ --preview-frames 5 \ --save-preview calibration/v4l2_stereo/preview.png- Capture stereo chessboard pairs. Move the board around the full field of view and vary angle and distance. Press
conly when both left and right views show a valid board overlay, and pressqwhen you have enough pairs.
python scripts/calibrate_v4l2_stereo.py \ --mode capture \ --video-device 0 \ --video-width 1280 \ --video-height 480 \ --board-cols 9 \ --board-rows 6 \ --square-size-mm 24 \ --output-dir calibration/v4l2_stereo- Run stereo calibration from the captured pairs:
python scripts/calibrate_v4l2_stereo.py \ --mode calibrate \ --board-cols 9 \ --board-rows 6 \ --square-size-mm 24 \ --output-dir calibration/v4l2_stereoOutputs:
calibration/v4l2_stereo/K.txt: flattened rectified left intrinsic matrix plus baseline, compatible with the original demo format.calibration/v4l2_stereo/stereo_calibration.npz: full stereo calibration and rectification parameters. Use this file with--calibration-pathfor thedecxincamera backend.calibration/v4l2_stereo/rectified_preview.png: quick visual check for horizontal epipolar alignment after rectification.
download from here and put under the folder weights/ (e.g. ./weights/23-36-37). Below table compares the differences among some representative models of varying sizes from our trained family. They are sorted from slowest to fastest, with accuracy descending, where runtime is profiled on GPU 3090, image size 640x480.
To trade-off speed and accuracy, there are two options:
- Try with different checkpoints.
- Tune the config flags (see explanations in the "Run demo" section below).
| Checkpoint | valid_iters | Runtime-Pytorch (ms) | Runtime-TRT (ms) | Peak Memory (MB) |
|---|---|---|---|---|
23-36-37 | 8 | 49.4 | 23.4 | 653 |
23-36-37 | 4 | 41.1 | 18.4 | 653 |
20-26-39 | 8 | 43.6 | 19.4 | 651 |
20-26-39 | 4 | 37.5 | 16.4 | 651 |
20-30-48 | 8 | 38.4 | 16.6 | 646 |
20-30-48 | 4 | 29.3 | 14.0 | 646 |
python scripts/run_demo.py --model_dir weights/23-36-37/model_best_bp2_serialize.pth --left_file assets/left.png --right_file assets/right.png --intrinsic_file assets/K.txt --out_dir output/ --remove_invisible 0 --denoise_cloud 1 --scale 1 --get_pc 1 --valid_iters 8 --max_disp 192 --zfar 100 | Flag | Meaning |
|---|---|
--model_dir | Path to the trained weights/model file |
--left_file | Path to the left image file |
--right_file | Path to the right image file |
--intrinsic_file | Path to the camera intrinsic matrix and baseline file |
--out_dir | Output directory for saving results |
--remove_invisible | Whether to ignore non-overlapping region's depth (0: no, 1: yes) |
--denoise_cloud | Whether to apply denoising to the point cloud (0: no, 1: yes) |
--scale | Image scaling factor |
--get_pc | Obtain point cloud output (0: no, 1: yes) |
--valid_iters | Number of refinement updates during forward pass |
--max_disp | Maximum disparity for volume encoding, 192 should be enough, unless you need to sense very near objects (e.g. <0.1m). Increasing it runs slower and uses more memory. |
--zfar | Maximum depth to include in point cloud |
Refer to scripts/run_demo.py for comprehensive list of flags.
Tips:
- The input left and right images should be rectified and undistorted, which means there should not be fisheye kind of lens distortion and the epipolar lines are horizontal between the left/right images. If you obtain images from stereo cameras such as Zed, they usually have handled this for you.
- Do not swap left and right image. The left image should really be obtained from the left-side camera (objects will appear righter in the image).
- We recommend to use PNG files with no lossy compression
- Our method works best on stereo RGB images. However, we have also tested it on monochrome or IR stereo images (e.g. from RealSense D4XX series) and it works well too.
- To get point cloud for your own data, you need to specify the intrinsics. In the intrinsic file in args, 1st line is the flattened 1x9 intrinsic matrix, 2nd line is the baseline (distance) between the left and right camera, unit in meters.
- The model performs better for image width size <1000. You can run with smaller scale, e.g.
--scale 0.5to downsize input image, then upsize the output depth to your need with nearest neighbor interpolation. - For faster inference, you can reduce the input image resolution by e.g.
--scale 0.5, and reduce refine iterations by e.g.--valid_iters 4. - Note that the 1st time running is slower due to compilation, use a while loop after warm up for live running.
Expect to see results like below:
For TRT, we recommend first setup env in docker.
python scripts/make_onnx.py --model_dir weights/23-36-37/model_best_bp2_serialize.pth --save_path output/ --height 448 --width 640 --valid_iters 8 --max_disp 192 | Flag | Meaning |
|---|---|
--model_dir | Path to the trained weights/model file |
--save_path | Directory to save ONNX outputs and zip file |
--height | Input image height, better to be divisible by 32. Reduce image size can increase speed. |
--width | Input image width, better to be divisible by 32. Reduce image size can increase speed. |
--valid_iters | Number of updates during forward pass, reduce it for faster speed, but may drop quality |
--max_disp | Maximum disparity for volume encoding, 192 should be enough, unless you need to sense very near objects (e.g. <0.1m). Increasing it runs slower and uses more memory. |
Refer to scripts/make_onnx.py for a comprehensive list of available flags. Since some intermediate operation is not supported by TRT conversion. We split around it into 2 onnx files.
Then convert from ONNX to TRT as below.
trtexec --onnx=output/feature_runner.onnx --saveEngine=output/feature_runner.engine --fp16 --useCudaGraph trtexec --onnx=output/post_runner.onnx --saveEngine=output/post_runner.engine --fp16 --useCudaGraph To use TRT for inference:
python scripts/run_demo_tensorrt.py --onnx_dir output/ --left_file assets/left.png --right_file assets/right.png --intrinsic_file assets/K.txt --out_dir output/ --remove_invisible 0 --denoise_cloud 1 --get_pc 1 --zfar 100 Real-world data offers greater diversity and realism than synthetic data. However, obtaining real stereo images with ground-truth metric depth annotation is notoriously difficult. To address this challenge, we propose an automatic data curation pipeline to generate pseudo-labels on internet-scale stereo images from Stereo4D dataset. Top: Pseudo-labeling pipeline on in-the-wild internet stereo data. Bottom: Visualization of our generated pseudo-labels.
Below are visualizations of the intermediate results in our pseudo-labeling process. In the rightmost column, green checkmark or red cross denotes whether samples are kept for training or not, based on the percentage of positive pixels in the consistency mask. Our data curation process can automatically discover failures on noisy internet data such as images containing subtitle (bottom), mosaic (2nd last row) and overly challenging samples that are unsuitable for training (top). The final pseudo-labels can also correct erroneous predictions from FoundationStereo on sky regions (5th row).
The dataset is available at HuggingFace: https://huggingface.co/datasets/nvidia/ffs_stereo4d
@article{wen2026fastfoundationstereo, title={{Fast-FoundationStereo}: Real-Time Zero-Shot Stereo Matching}, author={Bowen Wen and Shaurya Dewan and Stan Birchfield}, journal={CVPR}, year={2026} }Please contact Bowen Wen (bowenw@nvidia.com) for questions and commercial inquiries.
We would like to thank Xutong Ren, Karsten Patzwaldt, Yonggan Fu, Saurav Muralidharan, Han Cai, Pavlo Molchanov, Yu Wang, Varun Praveen, Joseph Aribido and Jun Gao for their insightful early discussions for this project. We would also like to thank NVIDIA Isaac and TAO teams for their engineering support and valuable discussions. Thanks to the authors of FoundationStereo, Selective-IGEV, Stereo4D and RAFT-Stereo for their code release. Finally, thanks to CVPR reviewers and AC for their appreciation of this work and constructive feedback.






