FLARE: Robot Learning with Implicit World Modeling

Zheng, Ruijie; Wang, Jing; Reed, Scott; Bjorck, Johan; Fang, Yu; Hu, Fengyuan; Jang, Joel; Kundalia, Kaushil; Lin, Zongyu; Magne, Loic; Narayan, Avnish; Tan, You Liang; Wang, Guanzhi; Wang, Qi; Xiang, Jiannan; Xu, Yinzhen; Ye, Seonghyeon; Kautz, Jan; Huang, Furong; Zhu, Yuke; Fan, Linxi

Computer Science > Robotics

arXiv:2505.15659 (cs)

[Submitted on 21 May 2025]

Title:FLARE: Robot Learning with Implicit World Modeling

Abstract:We introduce $\textbf{F}$uture $\textbf{LA}$tent $\textbf{RE}$presentation Alignment ($\textbf{FLARE}$), a novel framework that integrates predictive latent world modeling into robot policy learning. By aligning features from a diffusion transformer with latent embeddings of future observations, $\textbf{FLARE}$ enables a diffusion transformer policy to anticipate latent representations of future observations, allowing it to reason about long-term consequences while generating actions. Remarkably lightweight, $\textbf{FLARE}$ requires only minimal architectural modifications -- adding a few tokens to standard vision-language-action (VLA) models -- yet delivers substantial performance gains. Across two challenging multitask simulation imitation learning benchmarks spanning single-arm and humanoid tabletop manipulation, $\textbf{FLARE}$ achieves state-of-the-art performance, outperforming prior policy learning baselines by up to 26%. Moreover, $\textbf{FLARE}$ unlocks the ability to co-train with human egocentric video demonstrations without action labels, significantly boosting policy generalization to a novel object with unseen geometry with as few as a single robot demonstration. Our results establish $\textbf{FLARE}$ as a general and scalable approach for combining implicit world modeling with high-frequency robotic control.

Comments:	Project Webpage / Blogpost: this https URL
Subjects:	Robotics (cs.RO); Machine Learning (cs.LG)
Cite as:	arXiv:2505.15659 [cs.RO]
	(or arXiv:2505.15659v1 [cs.RO] for this version)
	https://doi.org/10.48550/arXiv.2505.15659

Submission history

From: Ruijie Zheng [view email]
[v1] Wed, 21 May 2025 15:33:27 UTC (13,543 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.RO

< prev | next >

new | recent | 2025-05

Change to browse by:

cs
cs.LG

References & Citations

export BibTeX citation

Bookmark

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)

Connected Papers (What is Connected Papers?)

Litmaps (What is Litmaps?)

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv (What is alphaXiv?)

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub (What is DagsHub?)

Gotit.pub (What is GotitPub?)

Hugging Face (What is Huggingface?)

Papers with Code (What is Papers with Code?)

ScienceCast (What is ScienceCast?)

Demos

Replicate (What is Replicate?)

Hugging Face Spaces (What is Spaces?)

TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Influence Flower (What are Influence Flowers?)

CORE Recommender (What is CORE?)

About arXivLabs

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)