Skip to content

ginwind/VLA-JEPA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

25 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Paper PDF Project Page Hugging Face Code License

⭐ If our project helps you, please give us a star on GitHub to support us!

VLA-JEPA overview

Table of Contents

🚧 TODO

  • Partial training code
  • LIBERO evaluation code
  • LIBERO-Plus evaluation code
  • SimplerEnv evaluation code
  • Training codes for custom datasets

βš™οΈ Environment Setup

git clone https://github.com/ginwind/VLA-JEPA # Create conda environment conda create -n VLA_JEPA python=3.10 -y conda activate VLA_JEPA # Install requirements pip install -r requirements.txt # Install FlashAttention2 pip install flash-attn --no-build-isolation # Install project pip install -e . 

This repository's code is based on the starVLA.

πŸ”₯ Training

0️⃣ Pretrained Model Preparation

Download the Qwen3-VL-2B and the V-JEPA2 encoder.

1️⃣ Data Preparation

Download the following datasets:

For robot datasets, you need to add a modality.json file under the meta/ subdirectory of each LeRobot dataset. The modality.json files for LIBERO, BridgeV2, Fractal, and Droid are provided under ./examples (BridgeV2 and Fractal are under ./examples/SimplerEnv).

2️⃣ Start Training

Depending on whether you are conducting pre-training or post-training, select the appropriate training script and YAML configuration file from the /scripts directory.

Ensure the following configurations are updated in the YAML file:

  • framework.qwenvl.basevlm and framework.vj2_model.base_encoder should be set to the paths of your respective checkpoints.
  • Update datasets.vla_data.data_root_dir, datasets.video_data.video_dir, and datasets.video_data.text_file to match the paths of your datasets.

Once the configurations are updated, you can proceed to start the training process.

3️⃣ Optional: Custom Dataset Training

VLA-JEPA supports training on both robot datasets and human video datasets. You can run custom training by specifying robot data and/or human videos in your configuration.

  • Robot Data: We support training with datasets in the LeRobot v2.1 format. Convert your custom robot dataset to LeRobot v2.1 first.

    • Define a custom robot dataset config class in data_config.py (its video-key fields should match the values predefined in modality.json; see modality.json), and add a mapping from robot_type to the config class in ROBOT_TYPE_CONFIG_MAP.
    • robot_type is specified by DATASET_NAMED_MIXTURES in mixtures.py: the dict key corresponds to datasets.vla_data.data_mix in the YAML training config, and the value is a tuple of sub-datasets. Each sub-dataset tuple contains three items: subdirectory, version, and robot_type. The robot_type selects the corresponding config for state/action normalization and other field alignment.
    • Finally, update the YAML config accordingly and launch training.
  • Human Video: You can implement your own DataLoader and update the mapping from dataset_py to a dataloader in build_dataloader within ./starVLA/dataloader/__init__.py. Alternatively, use our video dataloader and configure datasets.video_data in the YAML file:

    • dataset_py: use our video dataloader (no change needed)
    • video_dir: directory that contains video files; each file is named by its index, and the suffix is controlled by extensions
    • text_file: a headerless CSV where the first column is index and the second column is the video text description
    • CoT_prompt: prompt template for latent-action training (no change needed)
    • extensions: list of video file extensions

πŸ“Š Evaluation

Download the model checkpoints from Hugging Face: https://huggingface.co/ginwind/VLA-JEPA

Environment: Install the required Python packages into your VLA-JEPA environment:

pip install tyro matplotlib mediapy websockets msgpack pip install numpy==1.24.4

LIBERO

  • LIBERO setup: Prepare the LIBERO benchmark in a separate conda environment following the official LIBERO instructions: https://github.com/Lifelong-Robot-Learning/LIBERO

  • Configuration: In the downloaded checkpoint folder, update config.json and config.yaml to point the following fields to your local checkpoints:

    • framework.qwenvl.basevlm: path to the Qwen3-VL-2B checkpoint
    • framework.vj2_model.base_encoder: path to the V-JEPA encoder checkpoint
  • Evaluation script: Edit examples/LIBERO/eval_libero.sh and set the LIBERO_HOME environment variable (line 4) to your local LIBERO code path, and set the sim_python variable (line 9) to the Python executable of the LIBERO conda environment. Finally, set the your_ckpt variable (line 11) to the path of the downloaded LIBERO/checkpoints/VLA-JEPA-LIBERO.pt.

  • Run evaluation: Launch the evaluation (the script runs the four task suites in parallel across 4 GPUs):

bash ./examples/LIBERO/eval_libero.sh

LIBERO-Plus

  • LIBERO-Plus setup: Clone the LIBERO-Plus repository: https://github.com/sylvestf/LIBERO-plus. In ./examples/LIBERO-Plus/libero_plus_init.py, update line 121 to point to your LIBERO-Plus/libero/libero/benchmark/task_classification.json. Replace the original LIBERO-Plus/libero/libero/benchmark/__init__.py with the provided modified implementation (see ./examples/LIBERO-Plus/libero_plus_init.py) to enable evaluation over perturbation dimensions. Finally, follow the official LIBERO-Plus installation instructions and build the benchmark in a separate conda environment.

  • Configuration: In the downloaded checkpoint folder, update config.json and config.yaml to point the following fields to your local checkpoints:

    • framework.qwenvl.basevlm: path to the Qwen3-VL-2B checkpoint
    • framework.vj2_model.base_encoder: path to the V-JEPA encoder checkpoint
  • Evaluation script: Edit examples/LIBERO-Plus/eval_libero_plus.sh and set the LIBERO_HOME environment variable (line 4) to your local LIBERO-Plus code path, and set the sim_python variable (line 9) to the Python executable of the LIBERO-Plus conda environment. Finally, set the your_ckpt variable (line 11) to the path of the downloaded LIBERO/checkpoints/VLA-JEPA-LIBERO.pt.

  • Run evaluation: Launch the evaluation (the script runs the seven pertubation dimensions in parallel across 7 GPUs):

bash ./examples/LIBERO-Plus/eval_libero_plus.sh

SimplerEnv

  • SimplerEnv setup: Clone the SimplerEnv repository: https://github.com/simpler-env/SimplerEnv and follow the official SimplerEnv installation instructions and build the benchmark in a separate conda environment.

  • Configuration: In the downloaded checkpoint folder, update config.json and config.yaml to point the following fields to your local checkpoints:

    • framework.qwenvl.basevlm: path to the Qwen3-VL-2B checkpoint
    • framework.vj2_model.base_encoder: path to the V-JEPA encoder checkpoint
  • Evaluation script: Edit examples/SimplerEnv/eval_files/auto_eval_scripts/batch_evaluate.sh and set the SimplerEnv_PATH environment variable to your local SimplerEnv code path, and set the sim_python variable to the Python executable of the SimplerEnv conda environment. Finally, set the MODEL_PATH variable to the path of the downloaded SimplerEnv/checkpoints/VLA-JEPA-Simpler.pt.

  • Run evaluation: Launch the evaluation:

bash examples/SimplerEnv/eval_files/auto_eval_scripts/batch_evaluate.sh
# <task_suite> must be one of: pick_coke_can | move_near | drawer | long_horizon_apple_in_drawer | bridge_put_on. # Note: bridge_put_on corresponds to the WidowX robot evaluation; the other four correspond to the Google Robot evaluation. # <model_path> is the path to `VLA-JEPA-Simpler.pt`, and <log_dir> is the root directory that contains the generated videos # (by default, this is saved under `./results` within the evaluation output directory). bash ./examples/SimplerEnv/eval_files/auto_eval_scripts/calc_success_rate.sh <task_suite> <model_path> <log_dir>

Notes: Ensure each process has access to a GPU and verify that all checkpoint paths in the configuration files are correct before running the evaluation. For LIBERO, we evaluate the 4 task suites in parallel on 4 GPUs. For LIBERO-Plus and SimplerEnv, we run evaluations in parallel on 8 GPUs. If you have fewer GPUs available, modify the parallelization logic in the launch scripts accordingly.

🀝 Acknowledgement

We extend our sincere gratitude to the starVLA project and the V-JEPA2 project for their invaluable open-source contributions.

πŸ“ Citation

If you find our code or models useful in your work, please cite our paper:

@misc{vlajepa2026, title={VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model}, author={Jingwen Sun and Wenyao Zhang and Zekun Qi and Shaojie Ren and Zezhi Liu and Hanxin Zhu and Guangzhong Sun and Xin Jin and Zhibo Chen}, year={2026}, eprint={2602.10098}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2602.10098}, } 

About

VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages