添加部分功能

通过yolo seg检测人距离

python scripts/run_live_detection_seg.py \ --camera d435i \ --detect-source rgb \ --targets person \ --enable-distance \ --mask-alpha 0.5 \ --show-stereo-preview

检测身体部位各个距离

python scripts/run_live_pose.py \ --camera d435i \ --detect-source rgb \ --enable-distance \

---------------

Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching

This is the official implementation of our paper accepted to CVPR 2026

[Website] [Paper] [Video]

Authors: Bowen Wen, Shaurya Dewan, Stan Birchfield

Abstract

Stereo foundation models achieve strong zero-shot generalization but remain computationally prohibitive for real-time applications. Efficient stereo architectures, on the other hand, sacrifice robustness for speed and require costly per-domain fine-tuning. To bridge this gap, we present Fast-FoundationStereo, a family of architectures that achieve, for the first time, strong zero-shot generalization at real-time frame rate. We employ a divide-and-conquer acceleration strategy with three components: (1) knowledge distillation to compress the hybrid backbone into a single efficient student; (2) blockwise neural architecture search for automatically discovering optimal cost filtering designs under latency budgets, reducing search complexity exponentially; and (3) structured pruning for eliminating redundancy in the iterative refinement module. Furthermore, we introduce an automatic pseudo-labeling pipeline used to curate 1.4M in-the-wild stereo pairs to supplement synthetic training data and facilitate knowledge distillation. The resulting model can run over 10× faster than FoundationStereo while closely matching its zero-shot accuracy, thus establishing a new state-of-the-art among real-time methods.

[NOTE] This model is designed for real-time applications. For offline computation for the best accuracy, please checkout our earlier work FoundationStereo.

Environment setup

Option 1: Docker

docker build --network host -t ffs -f docker/dockerfile . bash docker/run_container.sh

Option 2: pip

conda create -n ffs python=3.12 && conda activate ffs pip install torch==2.6.0 torchvision==0.21.0 xformers --index-url https://download.pytorch.org/whl/cu124 pip install -r requirements.txt

Live Object Detection

The repository also includes a modular live object detection entrypoint for stereo cameras under scripts/detection/.

Example with RealSense D435i using RGB for YOLO detection while keeping IR stereo streams available:

python scripts/run_live_detection.py --camera d435i --detect-source rgb --targets person 'cell phone' --model yolov8n.pt --show-stereo-preview

Example with RGB detection and Fast-FoundationStereo distance overlay projected from left IR into RGB view:

python scripts/run_live_detection.py --camera d435i --detect-source rgb --targets person 'cell phone' --model yolov8n.pt --enable-distance --show-stereo-preview

Example with YOLO-Seg and mask-based distance estimation:

python scripts/run_live_detection_seg.py --camera d435i --detect-source rgb --targets person 'cell phone' --enable-distance --show-stereo-preview

Example with YOLO pose estimation for head and hands:

python scripts/run_live_pose.py --camera d435i --detect-source rgb --enable-distance --show-stereo-preview

Example with left-IR detection and Fast-FoundationStereo distance overlay using the bbox center region:

python scripts/run_live_detection.py --camera d435i --detect-source left_ir --targets person --model yolov8n.pt --enable-distance --show-stereo-preview

Example with a DECXIN stereo camera exposed as a side-by-side V4L2 stream on /dev/video0:

python scripts/run_live_detection.py \ --camera decxin \ --detect-source left_ir \ --video-device 0 \ --video-width 1280 \ --video-height 480 \ --calibration-path calibration/v4l2_stereo/stereo_calibration.npz \ --targets person \ --enable-distance \ --show-stereo-preview

Key flags:

--targets: one or more YOLO class names to keep.
--detect-source: choose rgb, left_ir, or right_ir as YOLO input.
--detect-every: run YOLO every N frames to trade accuracy for throughput.
--enable-distance: estimate distance for left_ir or rgb detections using Fast-FoundationStereo.
--center-ratio: use the center portion of each bbox for depth statistics. Smaller values are usually more stable for people.
--use-mask-distance: when the YOLO model outputs segmentation masks, compute distance from the masked pixels inside the center region.
--mask-alpha: control segmentation mask overlay transparency.
--disable-color: disable RGB stream if you only want IR-based detection.
--device-serial: pick a specific RealSense device when multiple are connected.
--video-device, --video-width, --video-height, --video-fourcc, --split-layout: DECXIN V4L2 stereo camera settings.
--calibration-path: path to either a simple K.txt file or a full stereo calibration .npz file. For DECXIN cameras, prefer the .npz output from the calibration script so live frames are rectified before depth inference.

Pose-specific notes:

scripts/run_live_pose.py uses a YOLO pose model and draws skeletons instead of object boxes by default.
The pose entrypoint estimates distances for head, left_hand, and right_hand using keypoint-centered depth patches.
Use --show-bbox in the pose entrypoint if you also want person boxes for debugging.

Calibrate a DECXIN Stereo Camera

If your DECXIN stereo camera appears as a single side-by-side V4L2 stream such as /dev/video0, use the calibration tool below.

Preview the stereo stream and verify the split layout:

python scripts/calibrate_v4l2_stereo.py \ --mode preview \ --video-device 0 \ --video-width 1280 \ --video-height 480 \ --preview-frames 5 \ --save-preview calibration/v4l2_stereo/preview.png

Capture stereo chessboard pairs. Move the board around the full field of view and vary angle and distance. Press c only when both left and right views show a valid board overlay, and press q when you have enough pairs.

python scripts/calibrate_v4l2_stereo.py \ --mode capture \ --video-device 0 \ --video-width 1280 \ --video-height 480 \ --board-cols 9 \ --board-rows 6 \ --square-size-mm 24 \ --output-dir calibration/v4l2_stereo

Run stereo calibration from the captured pairs:

python scripts/calibrate_v4l2_stereo.py \ --mode calibrate \ --board-cols 9 \ --board-rows 6 \ --square-size-mm 24 \ --output-dir calibration/v4l2_stereo

Outputs:

calibration/v4l2_stereo/K.txt: flattened rectified left intrinsic matrix plus baseline, compatible with the original demo format.
calibration/v4l2_stereo/stereo_calibration.npz: full stereo calibration and rectification parameters. Use this file with --calibration-path for the decxin camera backend.
calibration/v4l2_stereo/rectified_preview.png: quick visual check for horizontal epipolar alignment after rectification.

Weights and Trade-off

download from here and put under the folder weights/ (e.g. ./weights/23-36-37). Below table compares the differences among some representative models of varying sizes from our trained family. They are sorted from slowest to fastest, with accuracy descending, where runtime is profiled on GPU 3090, image size 640x480.

To trade-off speed and accuracy, there are two options:

Try with different checkpoints.
Tune the config flags (see explanations in the "Run demo" section below).

Checkpoint	valid_iters	Runtime-Pytorch (ms)	Runtime-TRT (ms)	Peak Memory (MB)
`23-36-37`	8	49.4	23.4	653
`23-36-37`	4	41.1	18.4	653
`20-26-39`	8	43.6	19.4	651
`20-26-39`	4	37.5	16.4	651
`20-30-48`	8	38.4	16.6	646
`20-30-48`	4	29.3	14.0	646

Run demo

python scripts/run_demo.py --model_dir weights/23-36-37/model_best_bp2_serialize.pth --left_file assets/left.png --right_file assets/right.png --intrinsic_file assets/K.txt --out_dir output/ --remove_invisible 0 --denoise_cloud 1 --scale 1 --get_pc 1 --valid_iters 8 --max_disp 192 --zfar 100

Flag	Meaning
`--model_dir`	Path to the trained weights/model file
`--left_file`	Path to the left image file
`--right_file`	Path to the right image file
`--intrinsic_file`	Path to the camera intrinsic matrix and baseline file
`--out_dir`	Output directory for saving results
`--remove_invisible`	Whether to ignore non-overlapping region's depth (0: no, 1: yes)
`--denoise_cloud`	Whether to apply denoising to the point cloud (0: no, 1: yes)
`--scale`	Image scaling factor
`--get_pc`	Obtain point cloud output (0: no, 1: yes)
`--valid_iters`	Number of refinement updates during forward pass
`--max_disp`	Maximum disparity for volume encoding, 192 should be enough, unless you need to sense very near objects (e.g. <0.1m). Increasing it runs slower and uses more memory.
`--zfar`	Maximum depth to include in point cloud

Refer to scripts/run_demo.py for comprehensive list of flags.

Tips:

The input left and right images should be rectified and undistorted, which means there should not be fisheye kind of lens distortion and the epipolar lines are horizontal between the left/right images. If you obtain images from stereo cameras such as Zed, they usually have handled this for you.
Do not swap left and right image. The left image should really be obtained from the left-side camera (objects will appear righter in the image).
We recommend to use PNG files with no lossy compression
Our method works best on stereo RGB images. However, we have also tested it on monochrome or IR stereo images (e.g. from RealSense D4XX series) and it works well too.
To get point cloud for your own data, you need to specify the intrinsics. In the intrinsic file in args, 1st line is the flattened 1x9 intrinsic matrix, 2nd line is the baseline (distance) between the left and right camera, unit in meters.
The model performs better for image width size <1000. You can run with smaller scale, e.g. --scale 0.5 to downsize input image, then upsize the output depth to your need with nearest neighbor interpolation.
For faster inference, you can reduce the input image resolution by e.g. --scale 0.5, and reduce refine iterations by e.g. --valid_iters 4.
Note that the 1st time running is slower due to compilation, use a while loop after warm up for live running.

Expect to see results like below:

Disparity/Depth:
Point cloud:

ONNX/TRT

For TRT, we recommend first setup env in docker.

python scripts/make_onnx.py --model_dir weights/23-36-37/model_best_bp2_serialize.pth --save_path output/ --height 448 --width 640 --valid_iters 8 --max_disp 192

Flag	Meaning
`--model_dir`	Path to the trained weights/model file
`--save_path`	Directory to save ONNX outputs and zip file
`--height`	Input image height, better to be divisible by 32. Reduce image size can increase speed.
`--width`	Input image width, better to be divisible by 32. Reduce image size can increase speed.
`--valid_iters`	Number of updates during forward pass, reduce it for faster speed, but may drop quality
`--max_disp`	Maximum disparity for volume encoding, 192 should be enough, unless you need to sense very near objects (e.g. <0.1m). Increasing it runs slower and uses more memory.

Refer to scripts/make_onnx.py for a comprehensive list of available flags. Since some intermediate operation is not supported by TRT conversion. We split around it into 2 onnx files.

Then convert from ONNX to TRT as below.

trtexec --onnx=output/feature_runner.onnx --saveEngine=output/feature_runner.engine --fp16 --useCudaGraph trtexec --onnx=output/post_runner.onnx --saveEngine=output/post_runner.engine --fp16 --useCudaGraph

To use TRT for inference:

python scripts/run_demo_tensorrt.py --onnx_dir output/ --left_file assets/left.png --right_file assets/right.png --intrinsic_file assets/K.txt --out_dir output/ --remove_invisible 0 --denoise_cloud 1 --get_pc 1 --zfar 100

Internet-Scale Pseudo-Labeling

Real-world data offers greater diversity and realism than synthetic data. However, obtaining real stereo images with ground-truth metric depth annotation is notoriously difficult. To address this challenge, we propose an automatic data curation pipeline to generate pseudo-labels on internet-scale stereo images from Stereo4D dataset. Top: Pseudo-labeling pipeline on in-the-wild internet stereo data. Bottom: Visualization of our generated pseudo-labels.

Below are visualizations of the intermediate results in our pseudo-labeling process. In the rightmost column, green checkmark or red cross denotes whether samples are kept for training or not, based on the percentage of positive pixels in the consistency mask. Our data curation process can automatically discover failures on noisy internet data such as images containing subtitle (bottom), mosaic (2nd last row) and overly challenging samples that are unsuitable for training (top). The final pseudo-labels can also correct erroneous predictions from FoundationStereo on sky regions (5th row).

The dataset is available at HuggingFace: https://huggingface.co/datasets/nvidia/ffs_stereo4d

Citation

@article{wen2026fastfoundationstereo, title={{Fast-FoundationStereo}: Real-Time Zero-Shot Stereo Matching}, author={Bowen Wen and Shaurya Dewan and Stan Birchfield}, journal={CVPR}, year={2026} }

Contact

Please contact Bowen Wen (bowenw@nvidia.com) for questions and commercial inquiries.

Acknowledgement

We would like to thank Xutong Ren, Karsten Patzwaldt, Yonggan Fu, Saurav Muralidharan, Han Cai, Pavlo Molchanov, Yu Wang, Varun Praveen, Joseph Aribido and Jun Gao for their insightful early discussions for this project. We would also like to thank NVIDIA Isaac and TAO teams for their engineering support and valuable discussions. Thanks to the authors of FoundationStereo, Selective-IGEV, Stereo4D and RAFT-Stereo for their code release. Finally, thanks to CVPR reviewers and AC for their appreciation of this work and constructive feedback.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
calibration/v4l2_stereo		calibration/v4l2_stereo
core		core
demo_data		demo_data
docker		docker
mkygogo		mkygogo
output_docker		output_docker
scripts		scripts
weights		weights
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
Utils.py		Utils.py
model_card.md		model_card.md
readme.md		readme.md
requirements.txt		requirements.txt
yolov8n-pose.pt		yolov8n-pose.pt
yolov8n-seg.pt		yolov8n-seg.pt
yolov8n.pt		yolov8n.pt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

添加部分功能

通过yolo seg检测人距离

检测身体部位各个距离

---------------

Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching

Abstract

Environment setup

Live Object Detection

Calibrate a DECXIN Stereo Camera

Weights and Trade-off

Run demo

ONNX/TRT

Internet-Scale Pseudo-Labeling

Citation

Contact

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

添加部分功能

通过yolo seg检测人距离

检测身体部位各个距离

---------------

Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching

Abstract

Environment setup

Live Object Detection

Calibrate a DECXIN Stereo Camera

Weights and Trade-off

Run demo

ONNX/TRT

Internet-Scale Pseudo-Labeling

Citation

Contact

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages