[Multimodal Support] Add MiniGPT4 #390
Merged
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
This PR introduces MiniGPT4, a model that enables large language models to see. It is built on top of the Gradio interface and as part of the MLC-llm Python package. This PR supports the CUDA backend in CLI. Further performance optimization, support for other backends such as Vulkan, Metal, iPhone, Android, WebGPU, etc, and edge case detection will be introduced in follow-up PRs.
Design in this PR:
0. Searching and combining different compiled models
Since multimodal models typically come with different parts, for example, the architecture in this PR consists of MiniGPT image model and Vicuna model, we adhere to a norm that they're compiled separately and stored in different folders. When the user specifies MiniGPT with a certain quantization type, the Vicuna model with the corresponding quantization type and the MiniGPT image model with the closest dtype will be loaded. Currently no quantization is supported for MiniGPT image model alone since performance does not gain much in the image uploading stage. Details can be found in the reload_model() function in
gradio.py.1. Isolated Embed Tokens Workflow (only for llama-related models)
In order to insert the image embedding in the middle of the embedding of tokenized prompt (identified by a placeholder), we found that isolating
embed_tokens()into a separate runtime function calledembedis necessary. To handle such a new runtime function in the MiniGPT use case and minimize the impact on other use cases, I adhere to the following when introducing embed into the workflow:llama.pyto isolate the embed_tokens() function and did not change other models, since MiniGPT only relies on Vicuna models.cli_main.ccso that no matter the user has the old compiled llama-related models or the newly-compiled ones using the new llama.py, the CLI case would handle both in the workflow by detecting whether an embed function exists.PrefillStepworkflow, I introduced a newPrefillWithEmbedStepfunction to handle embeddings produced byEmbedStep. This workflow is only applied when the model is detected to have an embed function, and thus would not affect the original workflow in other use cases.EmbedSteptakes in a text input, gets the prompt, splits the prompt string by multimodal placeholders, and returns an array of embeddings on the tokenized prompts. The reason that it does not concatenate the results is that concatenation on TVM runtime NDArray is not supported, and I found it to be easier to handle the concatenation in Python and numpy (in the Gradio case).🛑 The embed workflow is still not mature enough to be introduced globally due to the following reason:
llm_chat.cc, the index is currently hardcoded for the Vicuna case, but the index could be different in other LMs.2. Prompt and New Conversation Template
In order to support Vicuna with a different prompt from its original one, I override the conv_template to be "minigpt" when Vicuna is called from the MiniGPT use case. I introduced a new separator style called
kAccumRoleMsginconversation.hto handle the fact that prompts are accumulated for all user inputs and LM responses history in MiniGPT. I also enabled splitting prompt string by placeholder by modifying the GetInputTokens() workflow inllm_chat.cc.3. Gradio Workflow
The MiniGPT image model is not loaded into llm_chat.cc since it does not fit into a LM model pipeline, so it is instead stored under vision-related attributes in GradioChatModule. When
upload_image()is triggered, the image model generates an image embedding and is stored as vision_embed. When the user resets the chat or removes the image, vision_embed would be cleared as well.Example commands and demo:
Coming soon!
Progress: