MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning

📖 Paper · ⭐ GitHub · 📊 Dataset · 🤗 Checkpoints

Key Features:

MMDuet2 is a Video MLLM for proactive interaction, which means that it can not only reply right after the user's turn, but also at any approprite and timely moment during the video playback.
With only a 3B model, MMDuet2 is lightweight and fast for real-time interaction.
Responses are neither too sparse nor too dense and repetitive, which was a common issue in previous works.
Example Videos:

single-question-proactive-interaction.mp4

multi-question-proactive-interaction.mp4

Quick Start: A Real-World Demo with your own laptop camera!

Here we assume you have a GPU server as backend, and a laptop with camera as frontend:

On the GPU server, create conda environment and start the backend server:

cd demo/ conda create -n mmduet2-infer python=3.10 conda activate mmduet2-infer pip install -r requirements.txt python api_server.py

Download demo/frontend.py to laptop and start the frontend:

pip install requests, opencv-python python frondend.py --server_url http://xxx.xxx.xxx.xxx:8000 # (your server ip)

After starting the frontend, you can type in the terminal to input your text, and type "RESET" to remove all previous frames and messages.

Training and Inference

For SFT, follow the instructions in train/README.md
For RL, follow the instructions in rl/README.md
For proactive inference and evaluation, follow the instructions in proactive_eval/README.md
When inference on offline video understanding (Video-MME, LongVideoBench, etc.), MMDuet2 is identical to Qwen2.5-VL-Instruct. You can use frameworks including lmms-eval just like working on Qwen2.5-VL.

Star History

Acknowledgement

We thank the following projects for their open-source contributions:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning

Quick Start: A Real-World Demo with your own laptop camera!

Training and Inference

Star History

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
demo		demo
proactive_eval		proactive_eval
rl		rl
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning

Quick Start: A Real-World Demo with your own laptop camera!

Training and Inference

Star History

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages