- Table of Contents
- π§ TODO
- βοΈ Environment Setup
- π₯ Training
- π Evaluation
- π€ Acknowledgement
- π Citation
- Partial training code
- LIBERO evaluation code
- LIBERO-Plus evaluation code
- SimplerEnv evaluation code
- Training codes for custom datasets
git clone https://github.com/ginwind/VLA-JEPA # Create conda environment conda create -n VLA_JEPA python=3.10 -y conda activate VLA_JEPA # Install requirements pip install -r requirements.txt # Install FlashAttention2 pip install flash-attn --no-build-isolation # Install project pip install -e . This repository's code is based on the starVLA.
Download the Qwen3-VL-2B and the V-JEPA2 encoder.
Download the following datasets:
For robot datasets, you need to add a modality.json file under the meta/ subdirectory of each LeRobot dataset. The modality.json files for LIBERO, BridgeV2, Fractal, and Droid are provided under ./examples (BridgeV2 and Fractal are under ./examples/SimplerEnv).
Depending on whether you are conducting pre-training or post-training, select the appropriate training script and YAML configuration file from the /scripts directory.
Ensure the following configurations are updated in the YAML file:
framework.qwenvl.basevlmandframework.vj2_model.base_encodershould be set to the paths of your respective checkpoints.- Update
datasets.vla_data.data_root_dir,datasets.video_data.video_dir, anddatasets.video_data.text_fileto match the paths of your datasets.
Once the configurations are updated, you can proceed to start the training process.
VLA-JEPA supports training on both robot datasets and human video datasets. You can run custom training by specifying robot data and/or human videos in your configuration.
-
Robot Data: We support training with datasets in the LeRobot v2.1 format. Convert your custom robot dataset to LeRobot v2.1 first.
- Define a custom robot dataset config class in
data_config.py(its video-key fields should match the values predefined inmodality.json; seemodality.json), and add a mapping fromrobot_typeto the config class inROBOT_TYPE_CONFIG_MAP. robot_typeis specified byDATASET_NAMED_MIXTURESinmixtures.py: the dict key corresponds todatasets.vla_data.data_mixin the YAML training config, and the value is a tuple of sub-datasets. Each sub-dataset tuple contains three items: subdirectory, version, androbot_type. Therobot_typeselects the corresponding config for state/action normalization and other field alignment.- Finally, update the YAML config accordingly and launch training.
- Define a custom robot dataset config class in
-
Human Video: You can implement your own DataLoader and update the mapping from
dataset_pyto a dataloader inbuild_dataloaderwithin./starVLA/dataloader/__init__.py. Alternatively, use our video dataloader and configuredatasets.video_datain the YAML file:- dataset_py: use our video dataloader (no change needed)
- video_dir: directory that contains video files; each file is named by its
index, and the suffix is controlled byextensions - text_file: a headerless CSV where the first column is
indexand the second column is the video text description - CoT_prompt: prompt template for latent-action training (no change needed)
- extensions: list of video file extensions
Download the model checkpoints from Hugging Face: https://huggingface.co/ginwind/VLA-JEPA
Environment: Install the required Python packages into your VLA-JEPA environment:
pip install tyro matplotlib mediapy websockets msgpack pip install numpy==1.24.4-
LIBERO setup: Prepare the LIBERO benchmark in a separate conda environment following the official LIBERO instructions: https://github.com/Lifelong-Robot-Learning/LIBERO
-
Configuration: In the downloaded checkpoint folder, update
config.jsonandconfig.yamlto point the following fields to your local checkpoints:framework.qwenvl.basevlm: path to the Qwen3-VL-2B checkpointframework.vj2_model.base_encoder: path to the V-JEPA encoder checkpoint
-
Evaluation script: Edit
examples/LIBERO/eval_libero.shand set theLIBERO_HOMEenvironment variable (line 4) to your local LIBERO code path, and set thesim_pythonvariable (line 9) to the Python executable of the LIBERO conda environment. Finally, set theyour_ckptvariable (line 11) to the path of the downloadedLIBERO/checkpoints/VLA-JEPA-LIBERO.pt. -
Run evaluation: Launch the evaluation (the script runs the four task suites in parallel across 4 GPUs):
bash ./examples/LIBERO/eval_libero.sh-
LIBERO-Plus setup: Clone the LIBERO-Plus repository: https://github.com/sylvestf/LIBERO-plus. In
./examples/LIBERO-Plus/libero_plus_init.py, update line 121 to point to yourLIBERO-Plus/libero/libero/benchmark/task_classification.json. Replace the originalLIBERO-Plus/libero/libero/benchmark/__init__.pywith the provided modified implementation (see./examples/LIBERO-Plus/libero_plus_init.py) to enable evaluation over perturbation dimensions. Finally, follow the official LIBERO-Plus installation instructions and build the benchmark in a separate conda environment. -
Configuration: In the downloaded checkpoint folder, update
config.jsonandconfig.yamlto point the following fields to your local checkpoints:framework.qwenvl.basevlm: path to the Qwen3-VL-2B checkpointframework.vj2_model.base_encoder: path to the V-JEPA encoder checkpoint
-
Evaluation script: Edit
examples/LIBERO-Plus/eval_libero_plus.shand set theLIBERO_HOMEenvironment variable (line 4) to your local LIBERO-Plus code path, and set thesim_pythonvariable (line 9) to the Python executable of the LIBERO-Plus conda environment. Finally, set theyour_ckptvariable (line 11) to the path of the downloadedLIBERO/checkpoints/VLA-JEPA-LIBERO.pt. -
Run evaluation: Launch the evaluation (the script runs the seven pertubation dimensions in parallel across 7 GPUs):
bash ./examples/LIBERO-Plus/eval_libero_plus.sh-
SimplerEnv setup: Clone the SimplerEnv repository: https://github.com/simpler-env/SimplerEnv and follow the official SimplerEnv installation instructions and build the benchmark in a separate conda environment.
-
Configuration: In the downloaded checkpoint folder, update
config.jsonandconfig.yamlto point the following fields to your local checkpoints:framework.qwenvl.basevlm: path to the Qwen3-VL-2B checkpointframework.vj2_model.base_encoder: path to the V-JEPA encoder checkpoint
-
Evaluation script: Edit
examples/SimplerEnv/eval_files/auto_eval_scripts/batch_evaluate.shand set theSimplerEnv_PATHenvironment variable to your local SimplerEnv code path, and set thesim_pythonvariable to the Python executable of the SimplerEnv conda environment. Finally, set theMODEL_PATHvariable to the path of the downloadedSimplerEnv/checkpoints/VLA-JEPA-Simpler.pt. -
Run evaluation: Launch the evaluation:
bash examples/SimplerEnv/eval_files/auto_eval_scripts/batch_evaluate.sh- Compute success rates: After the previous step, SimplerEnv will generate evaluation rollout videos for each sub-task. You can then compute task success rates with
examples/SimplerEnv/eval_files/auto_eval_scripts/calc_success_rate.shas follows:
# <task_suite> must be one of: pick_coke_can | move_near | drawer | long_horizon_apple_in_drawer | bridge_put_on. # Note: bridge_put_on corresponds to the WidowX robot evaluation; the other four correspond to the Google Robot evaluation. # <model_path> is the path to `VLA-JEPA-Simpler.pt`, and <log_dir> is the root directory that contains the generated videos # (by default, this is saved under `./results` within the evaluation output directory). bash ./examples/SimplerEnv/eval_files/auto_eval_scripts/calc_success_rate.sh <task_suite> <model_path> <log_dir>Notes: Ensure each process has access to a GPU and verify that all checkpoint paths in the configuration files are correct before running the evaluation. For LIBERO, we evaluate the 4 task suites in parallel on 4 GPUs. For LIBERO-Plus and SimplerEnv, we run evaluations in parallel on 8 GPUs. If you have fewer GPUs available, modify the parallelization logic in the launch scripts accordingly.
We extend our sincere gratitude to the starVLA project and the V-JEPA2 project for their invaluable open-source contributions.
If you find our code or models useful in your work, please cite our paper:
@misc{vlajepa2026, title={VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model}, author={Jingwen Sun and Wenyao Zhang and Zekun Qi and Shaojie Ren and Zezhi Liu and Hanxin Zhu and Guangzhong Sun and Xin Jin and Zhibo Chen}, year={2026}, eprint={2602.10098}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2602.10098}, } 