new

Get trending papers in your email inbox!

Subscribe

Trending Papers

byAK and the research community

Trending Papers
Submitted by akhaliq

WizardCoder: Empowering Code Large Language Models with Evol-Instruct

WizardCoder, a Code LLM fine-tuned with complex instructions using Evol-Instruct, outperforms other open-source and closed LLMs on several code generation benchmarks.

microsoft Microsoft · Jun 14, 2023
Submitted by taesiri

SAM 3: Segment Anything with Concepts

Segment Anything Model 3 achieves state-of-the-art performance in promptable concept segmentation and tracking by leveraging a unified model architecture with decoupled recognition and localization.

facebook AI at Meta · Nov 20, 2025
Submitted by taesiri

HunyuanOCR Technical Report

HunyuanOCR, a lightweight Vision-Language Model, achieves state-of-the-art performance in OCR tasks through a unified end-to-end architecture combining Vision Transformer and lightweight LLM, supported by data-driven and RL strategies.

Tencent-Hunyuan Tencent Hunyuan · Nov 24, 2025
Submitted by taesiri

SAM 3D: 3Dfy Anything in Images

SAM 3D is a generative model that reconstructs 3D objects from single images using a multi-stage training framework that includes synthetic pretraining and real-world alignment, achieving high performance in human preference tests.

facebook AI at Meta · Nov 20, 2025
Submitted by JiaaqiLiu

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

Agent0-VL, a self-evolving vision-language agent, incorporates tool usage into both reasoning and self-evaluation, enabling continual improvement through evidence-grounded analysis and reinforcement learning.

Submitted by taesiri

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

PaddleOCR-VL, a vision-language model combining NaViT-style visual encoder and ERNIE-4.5 language model, achieves state-of-the-art performance in document parsing with minimal resource consumption.

PaddlePaddle PaddlePaddle · Oct 16, 2025

LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG improves Retrieval-Augmented Generation by integrating graph structures for enhanced contextual awareness and efficient information retrieval, achieving better accuracy and response times.

  • 5 authors
· Oct 8, 2024
Submitted by richardxp888

Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning

Agent0, a self-evolving framework utilizing multi-step co-evolution and tool integration, enhances LLM reasoning capabilities without human-curated data.

Submitted by lz1001

General Agentic Memory Via Deep Research

GAM, a novel framework that employs JIT compilation principles, improves memory efficiency and task completion by leveraging a lightweight memorizer and researcher in conjunction with reinforcement learning.

Submitted by dyyyyyyyy

FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

Flawed-Aware Policy Optimization (FAPO) enhances reinforcement learning with verifiable rewards by penalizing flawed-positive rollouts, improving reasoning capability and training stability in large language models.

  • 6 authors
· Oct 26, 2025
Submitted by fengerhu

MobiAgent: A Systematic Framework for Customizable Mobile Agents

MobiAgent, a comprehensive mobile agent system, achieves state-of-the-art performance in real-world mobile scenarios through its MobiMind-series models, AgentRR framework, and MobiFlow benchmarking suite, while also reducing data annotation costs.

  • 10 authors
· Aug 30, 2025
Submitted by AdinaY

Depth Anything 3: Recovering the Visual Space from Any Views

Depth Anything 3 (DA3) uses a plain transformer for geometry prediction from visual inputs, achieving state-of-the-art results in camera pose estimation, any-view geometry, visual rendering, and monocular depth estimation.

ByteDance-Seed ByteDance Seed · Nov 13, 2025
Submitted by taesiri

GigaWorld-0: World Models as Data Engine to Empower Embodied AI

GigaWorld-0 is a unified world model framework that integrates video generation and 3D modeling to produce high-quality, diverse, and physically plausible VLA data, enabling strong real-world performance in embodied AI without real-world training.

  • 25 authors
· Nov 25, 2025
Submitted by akhaliq

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory is a unified framework enabling efficient fine-tuning of large language models across various tasks using a web-based user interface.

  • 5 authors
· Mar 20, 2024
Submitted by jiaruz2

Latent Collaboration in Multi-Agent Systems

LatentMAS enables efficient and effective collaboration among LLM agents using latent space representations, enhancing reasoning quality and reducing computational costs.

Gen-Verse Princeton-AI · Nov 25, 2025
Submitted by KevinQHLin

Paper2Video: Automatic Video Generation from Scientific Papers

PaperTalker is a multi-agent framework that automates academic presentation video generation by integrating slide generation, layout refinement, subtitling, speech synthesis, and talking-head rendering, outperforming existing methods.

showlab Show Lab · Oct 6, 2025

TradingAgents: Multi-Agents LLM Financial Trading Framework

A multi-agent framework using large language models for stock trading simulates real-world trading firms, improving performance metrics like cumulative returns and Sharpe ratio.

  • 4 authors
· Dec 28, 2024
Submitted by nielsr

Back to Basics: Let Denoising Generative Models Denoise

Today's denoising diffusion models do not "denoise" in the classical sense, i.e., they do not directly predict clean images. Rather, the neural networks predict noise or a noised quantity. In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. According to the manifold assumption, natural data should lie on a low-dimensional manifold, whereas noised quantities do not. With this assumption, we advocate for models that directly predict clean data, which allows apparently under-capacity networks to operate effectively in very high-dimensional spaces. We show that simple, large-patch Transformers on pixels can be strong generative models: using no tokenizer, no pre-training, and no extra loss. Our approach is conceptually nothing more than "Just image Transformers", or JiT, as we call it. We report competitive results using JiT with large patch sizes of 16 and 32 on ImageNet at resolutions of 256 and 512, where predicting high-dimensional noised quantities can fail catastrophically. With our networks mapping back to the basics of the manifold, our research goes back to basics and pursues a self-contained paradigm for Transformer-based diffusion on raw natural data.

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

A novel GPT-based model, OmniFlatten, enables real-time natural full-duplex spoken dialogue through a multi-stage post-training technique that integrates speech and text without altering the original model's architecture.

  • 9 authors
· Oct 23, 2024
Submitted by daixufang

Agent Lightning: Train ANY AI Agents with Reinforcement Learning

Agent Lightning is a flexible RL framework for training LLMs in various agents, using a hierarchical RL algorithm and decoupling execution from training to handle complex interactions.

  • 8 authors
· Aug 5, 2025
Submitted by jiamingZ

SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation

SteadyDancer, an Image-to-Video framework, ensures first-frame identity preservation and precise motion control through harmonized conditions, adaptive pose representation, and hierarchical training objectives.

Submitted by taesiri

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5, a 1.2B-parameter document parsing vision-language model, achieves state-of-the-art recognition accuracy with computational efficiency through a coarse-to-fine parsing strategy.

  • 61 authors
· Sep 26, 2025
Submitted by gordonhu

G^2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

G$^2$VLM integrates 3D geometry learning with vision-language models to enhance spatial understanding and reasoning, outperforming existing models in these tasks.

InternRobotics Intern Robotics · Nov 26, 2025
Submitted by wanderkid

MinerU: An Open-Source Solution for Precise Document Content Extraction

MinerU is an open-source tool that enhances document content extraction using fine-tuned models and pre/postprocessing rules across diverse document types.

  • 18 authors
· Sep 27, 2024
Submitted by YuWangX

MIRIX: Multi-Agent Memory System for LLM-Based Agents

MIRIX, a modular multi-agent memory system, enhances language models' memory capabilities by integrating diverse memory types and a dynamic framework, achieving superior performance in multimodal and long-form conversation benchmarks.

  • 2 authors
· Jul 10, 2025

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

IndexTTS, an enhanced text-to-speech system combining XTTS and Tortoise models, offers improved naturalness, enhanced voice cloning, and controllable usage through hybrid character-pinyin modeling and optimized vector quantization.

  • 5 authors
· Feb 8, 2025
Submitted by akhaliq

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0, a memory-centric architecture with graph-based memory, enhances long-term conversational coherence in LLMs by efficiently extracting, consolidating, and retrieving information, outperforming existing memory systems in terms of accuracy and computational efficiency.

  • 5 authors
· Apr 28, 2025
Submitted by zhangshaolei

DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

DeepAnalyze-8B, an agentic LLM, autonomously completes the data science pipeline from raw data to research reports using curriculum-based training and data-grounded trajectory synthesis.

RUC-DataLab RUC-DataLab · Oct 19, 2025
Submitted by Weiyun1025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

InternVL3 is a multimodal pre-trained language model that jointly learns from both multimodal data and text, improving performance and scalability through advanced techniques and setting a new state-of-the-art in multimodal tasks.

  • 47 authors
· Apr 14, 2025
Submitted by dkliang

Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution

ORS3D, a new task requiring language understanding, 3D grounding, and efficient scheduling, is introduced with a large dataset and an embodied multi-modal model named GRANT that uses a scheduling token mechanism for effective task management.

H-EmbodVis H-EmbodVis · Nov 24, 2025

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

PyTorch Fully Sharded Data Parallel (FSDP) enables efficient and scalable training of large models across hardware configurations.

  • 16 authors
· Apr 21, 2023

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

A simple head-specific sigmoid gate applied after Scaled Dot-Product Attention improves performance, stability, and scaling in large models, mitigating 'attention sink' and enhancing long-context extrapolation.

  • 13 authors
· May 10, 2025

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

The PyTorch distributed data parallel module optimizes large-scale model training using techniques like gradient bucketing, computation-communication overlap, and selective synchronization to achieve near-linear scalability.

  • 11 authors
· Jun 28, 2020
Submitted by giantPanda0906

Step-Audio-R1 Technical Report

Step-Audio-R1, using the Modality-Grounded Reasoning Distillation framework, achieves strong reasoning capabilities in audio, outperforming previous models and demonstrating the transferability of reasoning across modalities.

stepfun-ai StepFun · Nov 19, 2025

Gradio: Hassle-Free Sharing and Testing of ML Models in the Wild

Gradio is an open-source Python package that facilitates collaboration by enabling non-specialized users to interact with ML models through a visual interface, enhancing accessibility and usability in interdisciplinary settings.

  • 6 authors
· Jun 6, 2019
Submitted by Owen777

LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer

LucidFlux, a caption-free UIR framework using a diffusion transformer, achieves robust image restoration through adaptive conditioning and SigLIP features without text prompts.

W2GenAI Lab · Sep 26, 2025
Submitted by hamishivi

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Reinforcement Learning with Evolving Rubrics (RLER) enables training of deep research models for long-form tasks, outperforming existing models and proprietary systems while being more cost-effective.

rl-research RL ReSearch · Nov 24, 2025
Submitted by taesiri

MiMo-Embodied: X-Embodied Foundation Model Technical Report

MiMo-Embodied, a cross-embodied foundation model, achieves state-of-the-art performance in both autonomous driving and embodied AI through multi-stage learning, curated data, and CoT/RL fine-tuning.

XiaomiMiMo Xiaomi MiMo · Nov 20, 2025

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Zep, a memory layer service, outperforms MemGPT in the DMR benchmark and LongMemEval by excelling in dynamic knowledge integration and temporal reasoning, critical for enterprise use cases.

  • 5 authors
· Jan 20, 2025
Submitted by Rbin

RAG-Anything: All-in-One RAG Framework

RAG-Anything is a unified framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.

Submitted by taesiri

MHR: Momentum Human Rig

MHR combines ATLAS's skeleton/shape paradigm with a modern rig to provide expressive, anatomically plausible human animation for AR/VR and graphics.

  • 41 authors
· Nov 19, 2025
Submitted by taesiri

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

We present MiroThinker v1.0, an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level, systematically training the model to handle deeper and more frequent agent-environment interactions as a third dimension of performance improvement. Unlike LLM test-time scaling, which operates in isolation and risks degradation with longer reasoning chains, interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories. Through reinforcement learning, the model achieves efficient interaction scaling: with a 256K context window, it can perform up to 600 tool calls per task, enabling sustained multi-turn reasoning and complex real-world research workflows. Across four representative benchmarks-GAIA, HLE, BrowseComp, and BrowseComp-ZH-the 72B variant achieves up to 81.9%, 37.7%, 47.1%, and 55.6% accuracy respectively, surpassing previous open-source agents and approaching commercial counterparts such as GPT-5-high. Our analysis reveals that MiroThinker benefits from interactive scaling consistently: research performance improves predictably as the model engages in deeper and more frequent agent-environment interactions, demonstrating that interaction depth exhibits scaling behaviors analogous to model size and context length. These findings establish interaction scaling as a third critical dimension for building next-generation open research agents, complementing model capacity and context windows.

  • 54 authors
· Nov 14, 2025
Submitted by kcz358

OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

OpenMMReasoner, a two-stage training approach combining supervised fine-tuning and reinforcement learning, enhances multimodal reasoning performance through rigorous data curation and improved training strategies.

lmms-lab LMMs-Lab · Nov 20, 2025
Submitted by Jeff-Wang

GigaBrain-0: A World Model-Powered Vision-Language-Action Model

GigaBrain-0, a VLA foundation model, uses world model-generated data to enhance cross-task generalization and policy robustness, improving real-world performance on complex manipulation tasks.

open-gigaai GigaAI · Oct 22, 2025
Submitted by nielsr

DINOv3

DINOv3, a self-supervised learning model, achieves superior performance across various vision tasks by scaling datasets and models, addressing dense feature degradation, and enhancing flexibility with post-hoc strategies.

facebook AI at Meta · Aug 13, 2025
Submitted by justinxzhao

Language Model Council: Benchmarking Foundation Models on Highly Subjective Tasks by Consensus

A democratic council framework improves benchmarking of Large Language Models, particularly for subjective tasks like emotional intelligence, by reducing bias and increasing robustness.

  • 3 authors
· Jun 12, 2024
Submitted by taesiri

PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image

3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce PhysX-Anything, the first simulation-ready physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by 193x, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning and significantly improving generative quality. In addition, to overcome the limited diversity of existing physical 3D datasets, we construct a new dataset, PhysX-Mobility, which expands the object categories in prior physical 3D datasets by over 2x and includes more than 2K common real-world objects with rich physical annotations. Extensive experiments on PhysX-Mobility and in-the-wild images demonstrate that PhysX-Anything delivers strong generative performance and robust generalization. Furthermore, simulation-based experiments in a MuJoCo-style environment validate that our sim-ready assets can be directly used for contact-rich robotic policy learning. We believe PhysX-Anything can substantially empower a broad range of downstream applications, especially in embodied AI and physics-based simulation.

  • 5 authors
· Nov 17, 2025
Submitted by hiyouga

Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents

A framework called Easy Dataset synthesizes fine-tuning data from unstructured documents using a GUI and LLMs, improving domain-specific performance of LLMs while maintaining general knowledge.

  • 7 authors
· Jul 5, 2025
Submitted by xhyandwyy

UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning

Semi-online Reinforcement Learning addresses the limitations of offline and online RL by simulating online RL on offline trajectories, achieving state-of-the-art performance in dynamic benchmarks.

  • 11 authors
· Sep 15, 2025
Submitted by xhyandwyy

Mobile-Agent-v3: Foundamental Agents for GUI Automation

GUI-Owl and Mobile-Agent-v3 are open-source GUI agent models and frameworks that achieve state-of-the-art performance across various benchmarks using innovations in environment infrastructure, agent capabilities, and scalable reinforcement learning.

  • 15 authors
· Aug 21, 2025