Sicheng Zuo*, Zixun Xie*, Wenzhao Zheng*
$\ddagger$ , Shaoqing Xu$\dagger$ ,
Fang Li, Shengyin Jiang, Long chen, Zhi-Xin Yang, Jiwen Lu
* Equal contributions.
DVGT, a universal visual geometry transformer for autonomous driving, directly predicts metric-scaled global 3D point maps from a sequence of unposed multi-view images, eliminating the need for post-alignment with external data.
- [2025/12/19] We have released the paper, inference code, and visualization checkpoints.
DVGT proposes a universal framework for driving geometry perception. Unlike conventional driving models that are tightly coupled to specific sensor setups or require ground-truth poses, our model leverages spatial-temporal attention to process unposed image sequences directly. By decoding global geometry in the ego-coordinate system, DVGT achieves metric-scaled dense reconstruction without LiDAR alignment, offering a robust solution that adapts seamlessly to diverse vehicles and camera configurations.
DVGT significantly outperforms existing models on various scenarios. As shown below, our method (red) demonstrates superior accuracy (
Firstly, clone this repository to your local machine, and install the dependencies (torch, torchvision, numpy, Pillow, and huggingface_hub). We tested the code with CUDA 12.8, python3.11 and torch 2.8.0.
git clone https://github.com/wzzheng/DVGT.git cd dvgt conda create -n dvgt python=3.11 conda activate dvgt pip install -r requirements.txtSecondly, download the pretrained checkpoint and save it to the ./ckpt directory.
Now, try the model with just a few lines of code:
import torch from dvgt.models.dvgt import DVGT from dvgt.utils.load_fn import load_and_preprocess_images from iopath.common.file_io import g_pathmgr checkpoint_path = 'path to your checkpoint' device = "cuda" if torch.cuda.is_available() else "cpu" # bfloat16 is supported on Ampere GPUs (Compute Capability 8.0+) dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16 # Initialize the model and load the pretrained weights. model = DVGT() with g_pathmgr.open(checkpoint_path, "rb") as f: checkpoint = torch.load(f, map_location="cpu") model.load_state_dict(checkpoint) model = model.to(device).eval() # Load and preprocess example images (replace with your own image paths) image_dir = 'examples/openscene_log-0104-scene-0007' images = load_and_preprocess_images(image_dir, start_frame=16, end_frame=24).to(device) with torch.no_grad(): with torch.amp.autocast(device, dtype=dtype): # Predict attributes including cameras, depth maps, and point maps. predictions = model(images)Click to expand
You can also optionally choose which attributes (branches) to predict, as shown below. This achieves the same result as the example above. This example uses a batch size of 1 (processing a single scene), but it naturally works for multiple scenes.
import torch from dvgt.models.dvgt import DVGT from dvgt.utils.load_fn import load_and_preprocess_images from iopath.common.file_io import g_pathmgr from dvgt.utils.pose_enc import pose_encoding_to_ego_pose from dvgt.utils.geometry import convert_point_in_ego_0_to_ray_depth_in_ego_n checkpoint_path = 'ckpt/open_ckpt.pt' device = "cuda" if torch.cuda.is_available() else "cpu" # bfloat16 is supported on Ampere GPUs (Compute Capability 8.0+) dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16 # Initialize the model and load the pretrained weights. model = DVGT() with g_pathmgr.open(checkpoint_path, "rb") as f: checkpoint = torch.load(f, map_location="cpu") model.load_state_dict(checkpoint) model = model.to(device).eval() # Load and preprocess example images (replace with your own image paths) image_dir = 'examples/openscene_log-0104-scene-0007' images = load_and_preprocess_images(image_dir, start_frame=16, end_frame=23).to(device) with torch.no_grad(): with torch.amp.autocast(device, dtype=dtype): aggregated_tokens_list, ps_idx = model.aggregator(images) # Predict ego n to ego first pose_enc = model.ego_pose_head(aggregated_tokens_list)[-1] # Ego pose following the OpenCV convention, relative to the ego-frame of the first time step. ego_n_to_ego_0 = pose_encoding_to_ego_pose(pose_enc) # Predict Point Maps in the ego-frame of the first time step point_map, point_conf = model.point_head(aggregated_tokens_list, images, ps_idx) # The predicted ray depth maps are originated from each ego-vehicle's position in its corresponding frame. ray_depth_in_ego_n = convert_point_in_ego_0_to_ray_depth_in_ego_n(point_map, ego_n_to_ego_0) Run the following command to perform reconstruction and visualize the point clouds in Viser. This script requires a path to an image folder formatted as follows:
data_dir/ βββ frame_0/ (contains view images, e.g., CAM_F.jpg, CAM_B.jpg...) βββ frame_1/ ... Note on Data Requirements:
- Consistency: The data must be sampled at 2Hz. All frames must contain the same number of views arranged in a fixed order.
- Capacity: Inference supports up to 24 frames with an arbitrary number of views per frame.
python demo_viser.py --image_folder examples/openscene_log-0104-scene-0007- Paper, inference code, and pre-trained weights (for visualization).
- Training suite: includes training code, evaluation scripts, and the data preparation pipeline.
- Dataset release: comprehensive datasets for training and testing.
Our code is based on the following brilliant repositories:
Moge-2 CUT3R Driv3R VGGT MapAnything Pi3
Many thanks to these authors!
If you find this project helpful, please consider citing the following paper:
@article{zuo2025dvgt, title={DVGT: Driving Visual Geometry Transformer}, author={Zuo, Sicheng and Xie, Zixun and Zheng, Wenzhao and Xu, Shaoqing and Li, Fang and Jiang, Shengyin and Chen, Long and Yang, Zhi-Xin and Lu, Jiwen}, journal={arXiv preprint arXiv:2512.16919}, year={2025} } 

