📖 Paper · ⭐ GitHub · 📊 Dataset · 🤗 Checkpoints
Key Features:
-
MMDuet2 is a Video MLLM for proactive interaction, which means that it can not only reply right after the user's turn, but also at any approprite and timely moment during the video playback.
-
With only a 3B model, MMDuet2 is lightweight and fast for real-time interaction.
-
Responses are neither too sparse nor too dense and repetitive, which was a common issue in previous works.
-
Example Videos:
single-question-proactive-interaction.mp4
multi-question-proactive-interaction.mp4
Here we assume you have a GPU server as backend, and a laptop with camera as frontend:
- On the GPU server, create conda environment and start the backend server:
cd demo/ conda create -n mmduet2-infer python=3.10 conda activate mmduet2-infer pip install -r requirements.txt python api_server.py- Download
demo/frontend.pyto laptop and start the frontend:
pip install requests, opencv-python python frondend.py --server_url http://xxx.xxx.xxx.xxx:8000 # (your server ip)After starting the frontend, you can type in the terminal to input your text, and type "RESET" to remove all previous frames and messages.
-
For SFT, follow the instructions in train/README.md
-
For RL, follow the instructions in rl/README.md
-
For proactive inference and evaluation, follow the instructions in proactive_eval/README.md
-
When inference on offline video understanding (Video-MME, LongVideoBench, etc.), MMDuet2 is identical to Qwen2.5-VL-Instruct. You can use frameworks including lmms-eval just like working on Qwen2.5-VL.
We thank the following projects for their open-source contributions: