Trending Papers

GitHub 3.7k arXiv Page

Submitted by

taesiri

SAM 3: Segment Anything with Concepts

Segment Anything Model 3 achieves state-of-the-art performance in promptable concept segmentation and tracking by leveraging a unified model architecture with decoupled recognition and localization.

AI at Meta · Nov 20, 2025

GitHub 3.7k arXiv Page

Submitted by

taesiri

SAM 3D: 3Dfy Anything in Images

SAM 3D is a generative model that reconstructs 3D objects from single images using a multi-stage training framework that includes synthetic pretraining and real-world alignment, achieving high performance in human preference tests.

AI at Meta · Published on Nov 20, 2025

83

GitHub 3.51k arXiv Page

Submitted by

taesiri

SAM 3D: 3Dfy Anything in Images

SAM 3D is a generative model that reconstructs 3D objects from single images using a multi-stage training framework that includes synthetic pretraining and real-world alignment, achieving high performance in human preference tests.

AI at Meta · Nov 20, 2025

83

GitHub 3.51k arXiv Page

Submitted by

AdinaY

Depth Anything 3: Recovering the Visual Space from Any Views

Depth Anything 3 (DA3) uses a plain transformer for geometry prediction from visual inputs, achieving state-of-the-art results in camera pose estimation, any-view geometry, visual rendering, and monocular depth estimation.

ByteDance Seed · Published on Nov 13, 2025

84

GitHub 2.7k arXiv Page

Submitted by

AdinaY

Depth Anything 3: Recovering the Visual Space from Any Views

Depth Anything 3 (DA3) uses a plain transformer for geometry prediction from visual inputs, achieving state-of-the-art results in camera pose estimation, any-view geometry, visual rendering, and monocular depth estimation.

ByteDance Seed · Nov 13, 2025

84

GitHub 2.7k arXiv Page

Submitted by

kcz358

OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

OpenMMReasoner, a two-stage training approach combining supervised fine-tuning and reinforcement learning, enhances multimodal reasoning performance through rigorous data curation and improved training strategies.

LMMs-Lab · Published on Nov 20, 2025

74

GitHub 55 arXiv Page

Submitted by

kcz358

OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

OpenMMReasoner, a two-stage training approach combining supervised fine-tuning and reinforcement learning, enhances multimodal reasoning performance through rigorous data curation and improved training strategies.

LMMs-Lab · Nov 20, 2025

74

GitHub 55 arXiv Page

Submitted by

nielsr

Back to Basics: Let Denoising Generative Models Denoise

Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "Just image Transformers", or JiT, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.

Massachusetts Institute of Technology · Published on Nov 17, 2025

GitHub 1.23k arXiv Page

Submitted by

nielsr

Back to Basics: Let Denoising Generative Models Denoise

Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "Just image Transformers", or JiT, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.

Massachusetts Institute of Technology · Nov 17, 2025

GitHub 1.23k arXiv Page

LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG improves Retrieval-Augmented Generation by integrating graph structures for enhanced contextual awareness and efficient information retrieval, achieving better accuracy and response times.

5 authors

· Published on Oct 8, 2024

18

GitHub 24.3k arXiv Page

LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG improves Retrieval-Augmented Generation by integrating graph structures for enhanced contextual awareness and efficient information retrieval, achieving better accuracy and response times.

5 authors

· Oct 8, 2024

18

GitHub 24.3k arXiv Page

Submitted by

taesiri

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

PaddleOCR-VL, a vision-language model combining NaViT-style visual encoder and ERNIE-4.5 language model, achieves state-of-the-art performance in document parsing with minimal resource consumption.

PaddlePaddle · Published on Oct 16, 2025

95

GitHub 64.8k arXiv Page

Submitted by

taesiri

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

PaddleOCR-VL, a vision-language model combining NaViT-style visual encoder and ERNIE-4.5 language model, achieves state-of-the-art performance in document parsing with minimal resource consumption.

PaddlePaddle · Oct 16, 2025

95

GitHub 64.8k arXiv Page

Submitted by

dyyyyyyyy

FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

Flawed-Aware Policy Optimization (FAPO) enhances reinforcement learning with verifiable rewards by penalizing flawed-positive rollouts, improving reasoning capability and training stability in large language models.

6 authors

· Published on Oct 26, 2025

GitHub 16.5k arXiv Page

Submitted by

dyyyyyyyy

FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

Flawed-Aware Policy Optimization (FAPO) enhances reinforcement learning with verifiable rewards by penalizing flawed-positive rollouts, improving reasoning capability and training stability in large language models.

6 authors

· Oct 26, 2025

GitHub 16.5k arXiv Page

Submitted by

LibraTree

GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

GeoVista, an agentic model integrating tool invocation and reinforcement learning, achieves high geolocalization performance on GeoBench, outperforming open-source models and matching closed-source models.

Tencent Hunyuan · Published on Nov 19, 2025

59

GitHub 94 arXiv Page

Submitted by

LibraTree

GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

GeoVista, an agentic model integrating tool invocation and reinforcement learning, achieves high geolocalization performance on GeoBench, outperforming open-source models and matching closed-source models.

Tencent Hunyuan · Nov 19, 2025

59

GitHub 94 arXiv Page

Submitted by

taesiri

MHR: Momentum Human Rig

MHR combines ATLAS's skeleton/shape paradigm with a modern rig to provide expressive, anatomically plausible human animation for AR/VR and graphics.

41 authors

· Published on Nov 19, 2025

GitHub 341 arXiv Page

Submitted by

taesiri

MHR: Momentum Human Rig

MHR combines ATLAS's skeleton/shape paradigm with a modern rig to provide expressive, anatomically plausible human animation for AR/VR and graphics.

41 authors

· Nov 19, 2025

GitHub 341 arXiv Page

Submitted by

taesiri

PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image

3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce PhysX-Anything, the first simulation-ready physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by 193x, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning and significantly improving generative quality. In addition, to overcome the limited diversity of existing physical 3D datasets, we construct a new dataset, PhysX-Mobility, which expands the object categories in prior physical 3D datasets by over 2x and includes more than 2K common real-world objects with rich physical annotations. Extensive experiments on PhysX-Mobility and in-the-wild images demonstrate that PhysX-Anything delivers strong generative performance and robust generalization. Furthermore, simulation-based experiments in a MuJoCo-style environment validate that our sim-ready assets can be directly used for contact-rich robotic policy learning. We believe PhysX-Anything can substantially empower a broad range of downstream applications, especially in embodied AI and physics-based simulation.

5 authors

· Published on Nov 17, 2025

GitHub 516 arXiv Page

Submitted by

taesiri

PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image

3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce PhysX-Anything, the first simulation-ready physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by 193x, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning and significantly improving generative quality. In addition, to overcome the limited diversity of existing physical 3D datasets, we construct a new dataset, PhysX-Mobility, which expands the object categories in prior physical 3D datasets by over 2x and includes more than 2K common real-world objects with rich physical annotations. Extensive experiments on PhysX-Mobility and in-the-wild images demonstrate that PhysX-Anything delivers strong generative performance and robust generalization. Furthermore, simulation-based experiments in a MuJoCo-style environment validate that our sim-ready assets can be directly used for contact-rich robotic policy learning. We believe PhysX-Anything can substantially empower a broad range of downstream applications, especially in embodied AI and physics-based simulation.

5 authors

· Nov 17, 2025

GitHub 516 arXiv Page

Submitted by

fengerhu

MobiAgent: A Systematic Framework for Customizable Mobile Agents

MobiAgent, a comprehensive mobile agent system, achieves state-of-the-art performance in real-world mobile scenarios through its MobiMind-series models, AgentRR framework, and MobiFlow benchmarking suite, while also reducing data annotation costs.

10 authors

· Published on Aug 30, 2025

GitHub 543 arXiv Page

Submitted by

fengerhu

MobiAgent: A Systematic Framework for Customizable Mobile Agents

MobiAgent, a comprehensive mobile agent system, achieves state-of-the-art performance in real-world mobile scenarios through its MobiMind-series models, AgentRR framework, and MobiFlow benchmarking suite, while also reducing data annotation costs.

10 authors

· Aug 30, 2025

GitHub 543 arXiv Page

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

A novel GPT-based model, OmniFlatten, enables real-time natural full-duplex spoken dialogue through a multi-stage post-training technique that integrates speech and text without altering the original model's architecture.

9 authors

· Published on Oct 23, 2024

GitHub 50k arXiv Page

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

A novel GPT-based model, OmniFlatten, enables real-time natural full-duplex spoken dialogue through a multi-stage post-training technique that integrates speech and text without altering the original model's architecture.

9 authors

· Oct 23, 2024

GitHub 50k arXiv Page

TradingAgents: Multi-Agents LLM Financial Trading Framework

A multi-agent framework using large language models for stock trading simulates real-world trading firms, improving performance metrics like cumulative returns and Sharpe ratio.

4 authors

· Published on Dec 28, 2024

GitHub 25.5k arXiv Page

TradingAgents: Multi-Agents LLM Financial Trading Framework

A multi-agent framework using large language models for stock trading simulates real-world trading firms, improving performance metrics like cumulative returns and Sharpe ratio.

4 authors

· Dec 28, 2024

GitHub 25.5k arXiv Page

Submitted by

taesiri

MiMo-Embodied: X-Embodied Foundation Model Technical Report

MiMo-Embodied, a cross-embodied foundation model, achieves state-of-the-art performance in both autonomous driving and embodied AI through multi-stage learning, curated data, and CoT/RL fine-tuning.

Xiaomi MiMo · Published on Nov 20, 2025

22

GitHub 160 arXiv Page

Submitted by

taesiri

MiMo-Embodied: X-Embodied Foundation Model Technical Report

MiMo-Embodied, a cross-embodied foundation model, achieves state-of-the-art performance in both autonomous driving and embodied AI through multi-stage learning, curated data, and CoT/RL fine-tuning.

Xiaomi MiMo · Nov 20, 2025

22

GitHub 160 arXiv Page

Submitted by

akhaliq

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory is a unified framework enabling efficient fine-tuning of large language models across various tasks using a web-based user interface.

5 authors

· Published on Mar 20, 2024

170

GitHub 63k arXiv Page

Submitted by

akhaliq

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory is a unified framework enabling efficient fine-tuning of large language models across various tasks using a web-based user interface.

5 authors

· Mar 20, 2024

170

GitHub 63k arXiv Page

Submitted by

giantPanda0906

Step-Audio-R1 Technical Report

Step-Audio-R1, using the Modality-Grounded Reasoning Distillation framework, achieves strong reasoning capabilities in audio, outperforming previous models and demonstrating the transferability of reasoning across modalities.

StepFun · Published on Nov 19, 2025

46

GitHub 192 arXiv Page

Submitted by

giantPanda0906

Step-Audio-R1 Technical Report

Step-Audio-R1, using the Modality-Grounded Reasoning Distillation framework, achieves strong reasoning capabilities in audio, outperforming previous models and demonstrating the transferability of reasoning across modalities.

StepFun · Nov 19, 2025

46

GitHub 192 arXiv Page

Submitted by

wanderkid

MinerU: An Open-Source Solution for Precise Document Content Extraction

MinerU is an open-source tool that enhances document content extraction using fine-tuned models and pre/postprocessing rules across diverse document types.

18 authors

· Published on Sep 27, 2024

32

Submitted by

wanderkid

MinerU: An Open-Source Solution for Precise Document Content Extraction

MinerU is an open-source tool that enhances document content extraction using fine-tuned models and pre/postprocessing rules across diverse document types.

18 authors

· Sep 27, 2024

32

Submitted by

taesiri

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5, a 1.2B-parameter document parsing vision-language model, achieves state-of-the-art recognition accuracy with computational efficiency through a coarse-to-fine parsing strategy.

61 authors

· Published on Sep 26, 2025

134

Submitted by

taesiri

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5, a 1.2B-parameter document parsing vision-language model, achieves state-of-the-art recognition accuracy with computational efficiency through a coarse-to-fine parsing strategy.

61 authors

· Sep 26, 2025

134

Submitted by

taesiri

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

Kandinsky 5.0 is a family of state-of-the-art generative models for high-resolution images and short videos, featuring model lineups with varying parameters and enhanced training techniques to achieve superior quality and performance.

25 authors

· Published on Nov 19, 2025

191

GitHub 441 arXiv Page

Submitted by

taesiri

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

Kandinsky 5.0 is a family of state-of-the-art generative models for high-resolution images and short videos, featuring model lineups with varying parameters and enhanced training techniques to achieve superior quality and performance.

25 authors

· Nov 19, 2025

191

GitHub 441 arXiv Page

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

IndexTTS, an enhanced text-to-speech system combining XTTS and Tortoise models, offers improved naturalness, enhanced voice cloning, and controllable usage through hybrid character-pinyin modeling and optimized vector quantization.

5 authors

· Published on Feb 8, 2025

GitHub 15.8k arXiv Page

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

IndexTTS, an enhanced text-to-speech system combining XTTS and Tortoise models, offers improved naturalness, enhanced voice cloning, and controllable usage through hybrid character-pinyin modeling and optimized vector quantization.

5 authors

· Feb 8, 2025

GitHub 15.8k arXiv Page

Submitted by

taesiri

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

We present MiroThinker v1.0, an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level, systematically training the model to handle deeper and more frequent agent-environment interactions as a third dimension of performance improvement. Unlike LLM test-time scaling, which operates in isolation and risks degradation with longer reasoning chains, interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories. Through reinforcement learning, the model achieves efficient interaction scaling: with a 256K context window, it can perform up to 600 tool calls per task, enabling sustained multi-turn reasoning and complex real-world research workflows. Across four representative benchmarks-GAIA, HLE, BrowseComp, and BrowseComp-ZH-the 72B variant achieves up to 81.9%, 37.7%, 47.1%, and 55.6% accuracy respectively, surpassing previous open-source agents and approaching commercial counterparts such as GPT-5-high. Our analysis reveals that MiroThinker benefits from interactive scaling consistently: research performance improves predictably as the model engages in deeper and more frequent agent-environment interactions, demonstrating that interaction depth exhibits scaling behaviors analogous to model size and context length. These findings establish interaction scaling as a third critical dimension for building next-generation open research agents, complementing model capacity and context windows.

54 authors

· Published on Nov 14, 2025

151

GitHub 1.07k arXiv Page

Submitted by

taesiri

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

We present MiroThinker v1.0, an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level, systematically training the model to handle deeper and more frequent agent-environment interactions as a third dimension of performance improvement. Unlike LLM test-time scaling, which operates in isolation and risks degradation with longer reasoning chains, interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories. Through reinforcement learning, the model achieves efficient interaction scaling: with a 256K context window, it can perform up to 600 tool calls per task, enabling sustained multi-turn reasoning and complex real-world research workflows. Across four representative benchmarks-GAIA, HLE, BrowseComp, and BrowseComp-ZH-the 72B variant achieves up to 81.9%, 37.7%, 47.1%, and 55.6% accuracy respectively, surpassing previous open-source agents and approaching commercial counterparts such as GPT-5-high. Our analysis reveals that MiroThinker benefits from interactive scaling consistently: research performance improves predictably as the model engages in deeper and more frequent agent-environment interactions, demonstrating that interaction depth exhibits scaling behaviors analogous to model size and context length. These findings establish interaction scaling as a third critical dimension for building next-generation open research agents, complementing model capacity and context windows.

54 authors

· Nov 14, 2025

151

GitHub 1.07k arXiv Page

Submitted by

zhangshaolei

DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

DeepAnalyze-8B, an agentic LLM, autonomously completes the data science pipeline from raw data to research reports using curriculum-based training and data-grounded trajectory synthesis.

RUC-DataLab · Published on Oct 19, 2025

GitHub 2.41k arXiv Page

Submitted by

zhangshaolei

DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

DeepAnalyze-8B, an agentic LLM, autonomously completes the data science pipeline from raw data to research reports using curriculum-based training and data-grounded trajectory synthesis.

RUC-DataLab · Oct 19, 2025

GitHub 2.41k arXiv Page

Submitted by

daixufang

Agent Lightning: Train ANY AI Agents with Reinforcement Learning

Agent Lightning is a flexible RL framework for training LLMs in various agents, using a hierarchical RL algorithm and decoupling execution from training to handle complex interactions.

8 authors

· Published on Aug 5, 2025

120

GitHub 8.82k arXiv Page

Submitted by

daixufang

Agent Lightning: Train ANY AI Agents with Reinforcement Learning

Agent Lightning is a flexible RL framework for training LLMs in various agents, using a hierarchical RL algorithm and decoupling execution from training to handle complex interactions.

8 authors

· Aug 5, 2025

120

GitHub 8.82k arXiv Page

Submitted by

akhaliq

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0, a memory-centric architecture with graph-based memory, enhances long-term conversational coherence in LLMs by efficiently extracting, consolidating, and retrieving information, outperforming existing memory systems in terms of accuracy and computational efficiency.

5 authors

· Published on Apr 28, 2025

GitHub 43.5k arXiv Page

Submitted by

akhaliq

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0, a memory-centric architecture with graph-based memory, enhances long-term conversational coherence in LLMs by efficiently extracting, consolidating, and retrieving information, outperforming existing memory systems in terms of accuracy and computational efficiency.

5 authors

· Apr 28, 2025

GitHub 43.5k arXiv Page

Submitted by

richardxp888

Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning

Agent0, a self-evolving framework utilizing multi-step co-evolution and tool integration, enhances LLM reasoning capabilities without human-curated data.

University of North Carolina at Chapel Hill · Published on Nov 20, 2025

80

GitHub 88 arXiv Page

Submitted by

richardxp888

Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning

Agent0, a self-evolving framework utilizing multi-step co-evolution and tool integration, enhances LLM reasoning capabilities without human-curated data.

University of North Carolina at Chapel Hill · Nov 20, 2025

80

GitHub 88 arXiv Page

Submitted by

huangsiteng

RynnVLA-002: A Unified Vision-Language-Action and World Model

A unified Vision-Language-Action (VLA) and world model, RynnVLA-002, jointly learns environmental dynamics and action planning, outperforming individual models in both simulation and real-world tasks.

DAMO Academy · Published on Nov 21, 2025

GitHub 564 arXiv Page

Submitted by

huangsiteng

RynnVLA-002: A Unified Vision-Language-Action and World Model

A unified Vision-Language-Action (VLA) and world model, RynnVLA-002, jointly learns environmental dynamics and action planning, outperforming individual models in both simulation and real-world tasks.

DAMO Academy · Nov 21, 2025

GitHub 564 arXiv Page

Submitted by

YuWangX

MIRIX: Multi-Agent Memory System for LLM-Based Agents

MIRIX, a modular multi-agent memory system, enhances language models' memory capabilities by integrating diverse memory types and a dynamic framework, achieving superior performance in multimodal and long-form conversation benchmarks.

2 authors

· Published on Jul 10, 2025

78

GitHub 3.14k arXiv Page

Submitted by

YuWangX

MIRIX: Multi-Agent Memory System for LLM-Based Agents

MIRIX, a modular multi-agent memory system, enhances language models' memory capabilities by integrating diverse memory types and a dynamic framework, achieving superior performance in multimodal and long-form conversation benchmarks.

2 authors

· Jul 10, 2025

78

GitHub 3.14k arXiv Page

Submitted by

taesiri

Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics

Enterprise Deep Research (EDR) is a multi-agent system that automates report generation and real-time data analysis by integrating specialized agents and tools, outperforming existing agentic systems on open benchmarks.

Salesforce · Published on Oct 20, 2025

GitHub 917 arXiv Page

Submitted by

taesiri

Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics

Enterprise Deep Research (EDR) is a multi-agent system that automates report generation and real-time data analysis by integrating specialized agents and tools, outperforming existing agentic systems on open benchmarks.

Salesforce · Oct 20, 2025

GitHub 917 arXiv Page

Submitted by

ZeqiangLai

NaTex: Seamless Texture Generation as Latent Color Diffusion

NaTex generates 3D textures directly using latent color diffusion and geometry-aware models, outperforming previous methods in coherence and alignment.

Tencent Hunyuan · Published on Nov 20, 2025

14

GitHub 76 arXiv Page

Submitted by

ZeqiangLai

NaTex: Seamless Texture Generation as Latent Color Diffusion

NaTex generates 3D textures directly using latent color diffusion and geometry-aware models, outperforming previous methods in coherence and alignment.

Tencent Hunyuan · Nov 20, 2025

14

GitHub 76 arXiv Page

Submitted by

Weiyun1025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

InternVL3 is a multimodal pre-trained language model that jointly learns from both multimodal data and text, improving performance and scalability through advanced techniques and setting a new state-of-the-art in multimodal tasks.

47 authors

· Published on Apr 14, 2025

302

arXiv Page

Submitted by

Weiyun1025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

InternVL3 is a multimodal pre-trained language model that jointly learns from both multimodal data and text, improving performance and scalability through advanced techniques and setting a new state-of-the-art in multimodal tasks.

47 authors

· Apr 14, 2025

302

arXiv Page

Submitted by

Yysrc

Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

Mantis, a VLA framework with Disentangled Visual Foresight and a diffusion Transformer, improves action prediction, comprehension, and reasoning while reducing training complexity.

DENG Lab @ SJTU · Published on Nov 20, 2025

GitHub 13 arXiv Page

Submitted by

Yysrc

Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

Mantis, a VLA framework with Disentangled Visual Foresight and a diffusion Transformer, improves action prediction, comprehension, and reasoning while reducing training complexity.

DENG Lab @ SJTU · Nov 20, 2025

GitHub 13 arXiv Page

Submitted by

taesiri

Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery

Skyfall-GS creates large-scale, high-quality 3D urban scenes using satellite imagery and diffusion models, offering real-time exploration and improved geometry and texture consistency.

9 authors

· Published on Oct 17, 2025

45

GitHub 521 arXiv Page

Submitted by

taesiri

Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery

Skyfall-GS creates large-scale, high-quality 3D urban scenes using satellite imagery and diffusion models, offering real-time exploration and improved geometry and texture consistency.

9 authors

· Oct 17, 2025

45

GitHub 521 arXiv Page

Submitted by

YunxinLi

Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the Qwen2.5-7B dense architecture, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive supervised fine-tuning strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilise RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues. Extensive evaluation across 85 benchmarks demonstrates that our model achieves SOTA or highly competitive performance against leading OLMs, surpassing Qwen2.5-Omni (trained with 1.2T tokens) on over 50 of 76 benchmarks. Key strengths include video understanding (+7% avg. of 8), omnimodallity understanding (+7% avg. of 4), and audiovisual reasoning (+4%). It also advances long-form speech processing (reducing WER by 4.2%) and leads in low-level image processing and controllable generation across 5 metrics.

Lychee Team · Published on Nov 16, 2025

GitHub 981 arXiv Page

Submitted by

YunxinLi

Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the Qwen2.5-7B dense architecture, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive supervised fine-tuning strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilise RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues. Extensive evaluation across 85 benchmarks demonstrates that our model achieves SOTA or highly competitive performance against leading OLMs, surpassing Qwen2.5-Omni (trained with 1.2T tokens) on over 50 of 76 benchmarks. Key strengths include video understanding (+7% avg. of 8), omnimodallity understanding (+7% avg. of 4), and audiovisual reasoning (+4%). It also advances long-form speech processing (reducing WER by 4.2%) and leads in low-level image processing and controllable generation across 5 metrics.

Lychee Team · Nov 16, 2025

GitHub 981 arXiv Page

Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts

A sparse Multimodal Mixture of Experts (MoE) model, Uni-MoE, effectively handles multiple data types with efficient training and improved performance through modality-specific encoders, cross-modality alignment, and Low-Rank Adaptation.

8 authors

· Published on May 18, 2024

19

GitHub 989 arXiv Page

Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts

A sparse Multimodal Mixture of Experts (MoE) model, Uni-MoE, effectively handles multiple data types with efficient training and improved performance through modality-specific encoders, cross-modality alignment, and Low-Rank Adaptation.

8 authors

· May 18, 2024

19

GitHub 989 arXiv Page

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Zep, a memory layer service, outperforms MemGPT in the DMR benchmark and LongMemEval by excelling in dynamic knowledge integration and temporal reasoning, critical for enterprise use cases.

5 authors

· Published on Jan 20, 2025

GitHub 20.4k arXiv Page

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Zep, a memory layer service, outperforms MemGPT in the DMR benchmark and LongMemEval by excelling in dynamic knowledge integration and temporal reasoning, critical for enterprise use cases.

5 authors

· Jan 20, 2025

GitHub 20.4k arXiv Page

Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding

Hulu-Med, a transparent medical vision-language model, integrates diverse data modalities and achieves state-of-the-art performance across various clinical tasks with efficient training.

25 authors

· Published on Oct 9, 2025

GitHub 471 arXiv Page

Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding

Hulu-Med, a transparent medical vision-language model, integrates diverse data modalities and achieves state-of-the-art performance across various clinical tasks with efficient training.

25 authors

· Oct 9, 2025

GitHub 471 arXiv Page

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

PyTorch Fully Sharded Data Parallel (FSDP) enables efficient and scalable training of large models across hardware configurations.

16 authors

· Published on Apr 21, 2023

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

PyTorch Fully Sharded Data Parallel (FSDP) enables efficient and scalable training of large models across hardware configurations.

16 authors

· Apr 21, 2023

Submitted by

foggyforest

UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE

UniMoE-Audio, a unified speech and music generation model using a Dynamic-Capacity Mixture-of-Experts framework, addresses data imbalance and task conflicts, achieving state-of-the-art performance and enhanced cross-domain synergy.

Lychee Team · Published on Oct 15, 2025

62

GitHub 990 arXiv Page

Submitted by

foggyforest

UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE

UniMoE-Audio, a unified speech and music generation model using a Dynamic-Capacity Mixture-of-Experts framework, addresses data imbalance and task conflicts, achieving state-of-the-art performance and enhanced cross-domain synergy.

Lychee Team · Oct 15, 2025

62

GitHub 990 arXiv Page

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

The PyTorch distributed data parallel module optimizes large-scale model training using techniques like gradient bucketing, computation-communication overlap, and selective synchronization to achieve near-linear scalability.

11 authors

· Published on Jun 28, 2020

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

The PyTorch distributed data parallel module optimizes large-scale model training using techniques like gradient bucketing, computation-communication overlap, and selective synchronization to achieve near-linear scalability.

11 authors

· Jun 28, 2020

Submitted by

Rbin

RAG-Anything: All-in-One RAG Framework

RAG-Anything is a unified framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.

Data Intelligence Lab@HKU · Published on Oct 14, 2025

GitHub 10.5k arXiv Page

Submitted by

Rbin

RAG-Anything: All-in-One RAG Framework

RAG-Anything is a unified framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.

Data Intelligence Lab@HKU · Oct 14, 2025

GitHub 10.5k arXiv Page

Submitted by

nielsr

DINOv3

DINOv3, a self-supervised learning model, achieves superior performance across various vision tasks by scaling datasets and models, addressing dense feature degradation, and enhancing flexibility with post-hoc strategies.

AI at Meta · Published on Aug 13, 2025

283

GitHub 8.48k arXiv Page

Submitted by

nielsr

DINOv3

DINOv3, a self-supervised learning model, achieves superior performance across various vision tasks by scaling datasets and models, addressing dense feature degradation, and enhancing flexibility with post-hoc strategies.

AI at Meta · Aug 13, 2025

283

GitHub 8.48k arXiv Page

Submitted by

taesiri

Code2Video: A Code-centric Paradigm for Educational Video Generation

Code2Video generates educational videos using a code-centric agent framework, improving coherence and interpretability compared to direct code generation.

Show Lab · Published on Oct 1, 2025

GitHub 991 arXiv Page

Submitted by

taesiri

Code2Video: A Code-centric Paradigm for Educational Video Generation

Code2Video generates educational videos using a code-centric agent framework, improving coherence and interpretability compared to direct code generation.

Show Lab · Oct 1, 2025

GitHub 991 arXiv Page

Submitted by

taesiri

JoyAgent-JDGenie: Technical Report on the GAIA

A generalist agent architecture combining multi-agent planning, hierarchical memory, and a refined tool suite outperforms existing systems in diverse tasks.

jingdong · Published on Oct 1, 2025

GitHub 11.1k arXiv Page

Submitted by

taesiri

JoyAgent-JDGenie: Technical Report on the GAIA

A generalist agent architecture combining multi-agent planning, hierarchical memory, and a refined tool suite outperforms existing systems in diverse tasks.

jingdong · Oct 1, 2025

GitHub 11.1k arXiv Page

Submitted by

richardxp888

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

WebWatcher, a multimodal agent with enhanced visual-language reasoning, outperforms existing agents in complex visual and textual information retrieval tasks using synthetic trajectories and reinforcement learning.

Alibaba-NLP · Published on Aug 7, 2025

139

Submitted by

richardxp888

WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

WebWatcher, a multimodal agent with enhanced visual-language reasoning, outperforms existing agents in complex visual and textual information retrieval tasks using synthetic trajectories and reinforcement learning.

Alibaba-NLP · Aug 7, 2025

139

Submitted by

callanwu

WebDancer: Towards Autonomous Information Seeking Agency

The paper proposes a framework for building end-to-end agentic information seeking agents through a combination of data construction, trajectory sampling, supervised fine-tuning, and reinforcement learning, showcasing its effectiveness on information seeking benchmarks.

12 authors

· Published on May 28, 2025

Submitted by

callanwu

WebDancer: Towards Autonomous Information Seeking Agency

The paper proposes a framework for building end-to-end agentic information seeking agents through a combination of data construction, trajectory sampling, supervised fine-tuning, and reinforcement learning, showcasing its effectiveness on information seeking benchmarks.

12 authors

· May 28, 2025

Submitted by

callanwu

WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

WebShaper, a formalization-driven framework, synthesizes information-seeking datasets using set theory and Knowledge Projections to enhance reasoning structure and achieve top performance in open-sourced benchmarks.

Alibaba-NLP · Published on Jul 20, 2025

60

Submitted by

callanwu

WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

WebShaper, a formalization-driven framework, synthesizes information-seeking datasets using set theory and Knowledge Projections to enhance reasoning structure and achieve top performance in open-sourced benchmarks.

Alibaba-NLP · Jul 20, 2025

60

Submitted by

learn3r

WebSailor: Navigating Super-human Reasoning for Web Agent

WebSailor, a post-training methodology, enhances open-source LLMs with sophisticated reasoning to match proprietary systems in complex information-seeking tasks.

19 authors

· Published on Jul 3, 2025

122

Submitted by

learn3r

WebSailor: Navigating Super-human Reasoning for Web Agent

WebSailor, a post-training methodology, enhances open-source LLMs with sophisticated reasoning to match proprietary systems in complex information-seeking tasks.

19 authors

· Jul 3, 2025

122

Submitted by

callanwu

Scaling Agents via Continual Pre-training

AgentFounder, a deep research agent model incorporating Agentic Continual Pre-training, achieves state-of-the-art performance in agentic tasks while maintaining strong tool-use ability.

22 authors

· Published on Sep 16, 2025

115

Submitted by

callanwu

Scaling Agents via Continual Pre-training

AgentFounder, a deep research agent model incorporating Agentic Continual Pre-training, achieves state-of-the-art performance in agentic tasks while maintaining strong tool-use ability.

22 authors

· Sep 16, 2025

115

Submitted by

callanwu

WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning

WebSailor, a post-training methodology, enhances open-source models with systematic uncertainty reduction, matching proprietary agents' performance in complex information-seeking tasks.

17 authors

· Published on Sep 16, 2025

90

Submitted by

callanwu

WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning

WebSailor, a post-training methodology, enhances open-source models with systematic uncertainty reduction, matching proprietary agents' performance in complex information-seeking tasks.

17 authors

· Sep 16, 2025

90