Skip to content

yellow-binary-tree/MMDuet2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning


📖 Paper · ⭐ GitHub · 📊 Dataset · 🤗 Checkpoints

Key Features:

  • MMDuet2 is a Video MLLM for proactive interaction, which means that it can not only reply right after the user's turn, but also at any approprite and timely moment during the video playback.

  • With only a 3B model, MMDuet2 is lightweight and fast for real-time interaction.

  • Responses are neither too sparse nor too dense and repetitive, which was a common issue in previous works.

  • Example Videos:

single-question-proactive-interaction.mp4
multi-question-proactive-interaction.mp4

Quick Start: A Real-World Demo with your own laptop camera!

Here we assume you have a GPU server as backend, and a laptop with camera as frontend:

  • On the GPU server, create conda environment and start the backend server:
cd demo/ conda create -n mmduet2-infer python=3.10 conda activate mmduet2-infer pip install -r requirements.txt python api_server.py
  • Download demo/frontend.py to laptop and start the frontend:
pip install requests, opencv-python python frondend.py --server_url http://xxx.xxx.xxx.xxx:8000 # (your server ip)

After starting the frontend, you can type in the terminal to input your text, and type "RESET" to remove all previous frames and messages.

Training and Inference

  • For SFT, follow the instructions in train/README.md

  • For RL, follow the instructions in rl/README.md

  • For proactive inference and evaluation, follow the instructions in proactive_eval/README.md

  • When inference on offline video understanding (Video-MME, LongVideoBench, etc.), MMDuet2 is identical to Qwen2.5-VL-Instruct. You can use frameworks including lmms-eval just like working on Qwen2.5-VL.

Star History

Star History Chart

Acknowledgement

We thank the following projects for their open-source contributions:

About

[ICLR 2026] MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors