GitHub - ginwind/VLA-JEPA: VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model

VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model

⭐ If our project helps you, please give us a star on GitHub to support us!

🚧 TODO

Partial training code
LIBERO evaluation code
LIBERO-Plus evaluation code
SimplerEnv evaluation code
Training codes for custom datasets

⚙️ Environment Setup

git clone https://github.com/ginwind/VLA-JEPA # Create conda environment conda create -n VLA_JEPA python=3.10 -y conda activate VLA_JEPA # Install requirements pip install -r requirements.txt # Install FlashAttention2 pip install flash-attn --no-build-isolation # Install project pip install -e .

This repository's code is based on the starVLA.

🔥 Training

0️⃣ Pretrained Model Preparation

Download the Qwen3-VL-2B and the V-JEPA2 encoder.

1️⃣ Data Preparation

Download the following datasets:

For robot datasets, you need to add a modality.json file under the meta/ subdirectory of each LeRobot dataset. The modality.json files for LIBERO, BridgeV2, Fractal, and Droid are provided under ./examples (BridgeV2 and Fractal are under ./examples/SimplerEnv).

2️⃣ Start Training

Depending on whether you are conducting pre-training or post-training, select the appropriate training script and YAML configuration file from the /scripts directory.

Ensure the following configurations are updated in the YAML file:

framework.qwenvl.basevlm and framework.vj2_model.base_encoder should be set to the paths of your respective checkpoints.
Update datasets.vla_data.data_root_dir, datasets.video_data.video_dir, and datasets.video_data.text_file to match the paths of your datasets.

Once the configurations are updated, you can proceed to start the training process.

3️⃣ Optional: Custom Dataset Training

VLA-JEPA supports training on both robot datasets and human video datasets. You can run custom training by specifying robot data and/or human videos in your configuration.

Robot Data: We support training with datasets in the LeRobot v2.1 format. Convert your custom robot dataset to LeRobot v2.1 first.
- Define a custom robot dataset config class in data_config.py (its video-key fields should match the values predefined in modality.json; see modality.json), and add a mapping from robot_type to the config class in ROBOT_TYPE_CONFIG_MAP.
- robot_type is specified by DATASET_NAMED_MIXTURES in mixtures.py: the dict key corresponds to datasets.vla_data.data_mix in the YAML training config, and the value is a tuple of sub-datasets. Each sub-dataset tuple contains three items: subdirectory, version, and robot_type. The robot_type selects the corresponding config for state/action normalization and other field alignment.
- Finally, update the YAML config accordingly and launch training.
Human Video: You can implement your own DataLoader and update the mapping from dataset_py to a dataloader in build_dataloader within ./starVLA/dataloader/__init__.py. Alternatively, use our video dataloader and configure datasets.video_data in the YAML file:
- dataset_py: use our video dataloader (no change needed)
- video_dir: directory that contains video files; each file is named by its index, and the suffix is controlled by extensions
- text_file: a headerless CSV where the first column is index and the second column is the video text description
- CoT_prompt: prompt template for latent-action training (no change needed)
- extensions: list of video file extensions

📊 Evaluation

Download the model checkpoints from Hugging Face: https://huggingface.co/ginwind/VLA-JEPA

Environment: Install the required Python packages into your VLA-JEPA environment:

pip install tyro matplotlib mediapy websockets msgpack pip install numpy==1.24.4

LIBERO

LIBERO setup: Prepare the LIBERO benchmark in a separate conda environment following the official LIBERO instructions: https://github.com/Lifelong-Robot-Learning/LIBERO
Configuration: In the downloaded checkpoint folder, update config.json and config.yaml to point the following fields to your local checkpoints:
- framework.qwenvl.basevlm: path to the Qwen3-VL-2B checkpoint
- framework.vj2_model.base_encoder: path to the V-JEPA encoder checkpoint
Evaluation script: Edit examples/LIBERO/eval_libero.sh and set the LIBERO_HOME environment variable (line 4) to your local LIBERO code path, and set the sim_python variable (line 9) to the Python executable of the LIBERO conda environment. Finally, set the your_ckpt variable (line 11) to the path of the downloaded LIBERO/checkpoints/VLA-JEPA-LIBERO.pt.
Run evaluation: Launch the evaluation (the script runs the four task suites in parallel across 4 GPUs):

bash ./examples/LIBERO/eval_libero.sh

LIBERO-Plus

LIBERO-Plus setup: Clone the LIBERO-Plus repository: https://github.com/sylvestf/LIBERO-plus. In ./examples/LIBERO-Plus/libero_plus_init.py, update line 121 to point to your LIBERO-Plus/libero/libero/benchmark/task_classification.json. Replace the original LIBERO-Plus/libero/libero/benchmark/__init__.py with the provided modified implementation (see ./examples/LIBERO-Plus/libero_plus_init.py) to enable evaluation over perturbation dimensions. Finally, follow the official LIBERO-Plus installation instructions and build the benchmark in a separate conda environment.
Configuration: In the downloaded checkpoint folder, update config.json and config.yaml to point the following fields to your local checkpoints:
- framework.qwenvl.basevlm: path to the Qwen3-VL-2B checkpoint
- framework.vj2_model.base_encoder: path to the V-JEPA encoder checkpoint
Evaluation script: Edit examples/LIBERO-Plus/eval_libero_plus.sh and set the LIBERO_HOME environment variable (line 4) to your local LIBERO-Plus code path, and set the sim_python variable (line 9) to the Python executable of the LIBERO-Plus conda environment. Finally, set the your_ckpt variable (line 11) to the path of the downloaded LIBERO/checkpoints/VLA-JEPA-LIBERO.pt.
Run evaluation: Launch the evaluation (the script runs the seven pertubation dimensions in parallel across 7 GPUs):

bash ./examples/LIBERO-Plus/eval_libero_plus.sh

SimplerEnv

SimplerEnv setup: Clone the SimplerEnv repository: https://github.com/simpler-env/SimplerEnv and follow the official SimplerEnv installation instructions and build the benchmark in a separate conda environment.
Configuration: In the downloaded checkpoint folder, update config.json and config.yaml to point the following fields to your local checkpoints:
- framework.qwenvl.basevlm: path to the Qwen3-VL-2B checkpoint
- framework.vj2_model.base_encoder: path to the V-JEPA encoder checkpoint
Evaluation script: Edit examples/SimplerEnv/eval_files/auto_eval_scripts/batch_evaluate.sh and set the SimplerEnv_PATH environment variable to your local SimplerEnv code path, and set the sim_python variable to the Python executable of the SimplerEnv conda environment. Finally, set the MODEL_PATH variable to the path of the downloaded SimplerEnv/checkpoints/VLA-JEPA-Simpler.pt.
Run evaluation: Launch the evaluation:

bash examples/SimplerEnv/eval_files/auto_eval_scripts/batch_evaluate.sh

Compute success rates: After the previous step, SimplerEnv will generate evaluation rollout videos for each sub-task. You can then compute task success rates with examples/SimplerEnv/eval_files/auto_eval_scripts/calc_success_rate.sh as follows:

# <task_suite> must be one of: pick_coke_can | move_near | drawer | long_horizon_apple_in_drawer | bridge_put_on. # Note: bridge_put_on corresponds to the WidowX robot evaluation; the other four correspond to the Google Robot evaluation. # <model_path> is the path to `VLA-JEPA-Simpler.pt`, and <log_dir> is the root directory that contains the generated videos # (by default, this is saved under `./results` within the evaluation output directory). bash ./examples/SimplerEnv/eval_files/auto_eval_scripts/calc_success_rate.sh <task_suite> <model_path> <log_dir>

Notes: Ensure each process has access to a GPU and verify that all checkpoint paths in the configuration files are correct before running the evaluation. For LIBERO, we evaluate the 4 task suites in parallel on 4 GPUs. For LIBERO-Plus and SimplerEnv, we run evaluations in parallel on 8 GPUs. If you have fewer GPUs available, modify the parallelization logic in the launch scripts accordingly.

🤝 Acknowledgement

We extend our sincere gratitude to the starVLA project and the V-JEPA2 project for their invaluable open-source contributions.

📝 Citation

If you find our code or models useful in your work, please cite our paper:

@misc{vlajepa2026, title={VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model}, author={Jingwen Sun and Wenyao Zhang and Zekun Qi and Shaojie Ren and Zezhi Liu and Hanxin Zhu and Guangzhong Sun and Xin Jin and Zhibo Chen}, year={2026}, eprint={2602.10098}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2602.10098}, }

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
assets		assets
deployment		deployment
examples		examples
scripts		scripts
starVLA		starVLA
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model

Table of Contents

🚧 TODO

⚙️ Environment Setup

🔥 Training

0️⃣ Pretrained Model Preparation

1️⃣ Data Preparation

2️⃣ Start Training

3️⃣ Optional: Custom Dataset Training

📊 Evaluation

LIBERO

LIBERO-Plus

SimplerEnv

🤝 Acknowledgement

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model

Table of Contents

🚧 TODO

⚙️ Environment Setup

🔥 Training

0️⃣ Pretrained Model Preparation

1️⃣ Data Preparation

2️⃣ Start Training

3️⃣ Optional: Custom Dataset Training

📊 Evaluation

LIBERO

LIBERO-Plus

SimplerEnv

🤝 Acknowledgement

📝 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages