[Model] [Serve] Add support for LLaVa model in serving engine #1962

anibohara2000 · 2024-03-15T06:08:23Z

This PR adds support for LLaVa-v1.5 model on the serving engine. Use the HF weights and config from https://huggingface.co/llava-hf/llava-1.5-7b-hf

Passing image input is supported as url (reference: https://platform.openai.com/docs/guides/vision )
Example:

data = { "model": "dist/llava-1.5-7b-hf-q4f16_1-MLC/params/", "messages": [ { "role": "user", "content": [ { "type": "image_url", "image_url": "https://llava-vl.github.io/static/images/view.jpg", }, {"type": "text", "text": "What does this image represent?"}, ], } ] } response = requests.post("http://127.0.0.1:8000/v1/chat/completions", json=data) print("Response body:", response.text)

tqchen · 2024-03-16T09:52:53Z

would be great to also support bas64 images as per reference

anibohara2000 · 2024-03-17T03:00:05Z

Yes I will work on supporting base64 images as well. Can you review this PR and merge in the meantime?

MasterJH5574

Thank you @anibohara2000, great job! I did a round of quick pass.

MasterJH5574 · 2024-03-17T14:12:10Z

cpp/serve/model.cc


+ ObjectRef ImageEmbed(const NDArray& image, ObjectRef* dst, int offset) final {
+ CHECK(ft_.image_embed_func_.defined()) << "`image_embed` function is not found in the model. ";
+ auto image_dref_or_nd = ft_.CopyToWorker0(image, "image", image.Shape());


Here we want to pass the maximum possible shape in the third argument, through which we reserve the NDArray to the maximum possible size when allocating. Do all the images of the image embedding func have the same shape? If this is always true, we can pass image.Shape(). Or otherwise, we need to pass the maximum shape.

Yes, in the get_image_from_url function, we process the image to be of a fixed size. This image is then passed to the model, so all images will be of same size for a given model

MasterJH5574 · 2024-03-17T14:14:11Z

cpp/serve/model.cc

+ picojson::object vision_config = config["vision_config"].get<picojson::object>();
+ int image_size = -1;
+ int patch_size = -1;
+ if (vision_config.count("image_size")) {
+ CHECK(vision_config["image_size"].is<int64_t>());
+ } else {
+ LOG(FATAL) << "Key \"image_size\" not found in vision_config.";
+ }
+ if (vision_config.count("patch_size")) {
+ CHECK(vision_config["patch_size"].is<int64_t>());
+ } else {
+ LOG(FATAL) << "Key \"patch_size\" not found in vision_config.";
+ }
+ this->image_embed_size_ = (image_size * image_size) / (patch_size * patch_size);


Hmmm, did you set the values of image_size and patch_size?

Thanks for pointing out. These values are included in mlc-chat-config.json for the LLaVa model, but this was an old flow which I am not using right now, so removing this for now.

MasterJH5574 · 2024-03-17T14:24:30Z

python/mlc_llm/serve/entrypoints/entrypoint_utils.py


 import fastapi
+import requests
+import tvm


Let's also lazily import requests and tvm in get_image_from_url, given they are not depended by all the functions in this file.

MasterJH5574 · 2024-03-17T14:29:53Z

python/mlc_llm/serve/server/server_context.py


 _models: Dict[str, async_engine.AsyncThreadedEngine] = {}
 _conv_templates: Dict[str, Conversation] = {}
+ _model_config_paths: Dict[str, str] = {}


Let's load the model config json and save the config dictionary in ServerContext. So we don't need to parse and read the JSON every time in the entrypoint.

Yes that makes sense. Changed.

MasterJH5574 · 2024-03-17T14:31:00Z

python/mlc_llm/serve/entrypoints/entrypoint_utils.py

+def get_image_embed_size(config_file_path: str) -> int:
+ """Get the image embedding size from the model config file."""
+ with open(config_file_path, "r", encoding="utf-8") as file:
+ config = json.load(file)
+ image_size = config["model_config"]["vision_config"]["image_size"]
+ patch_size = config["model_config"]["vision_config"]["patch_size"]
+ embed_size = (image_size // patch_size) ** 2
+ return embed_size


Check out the other comment. Here we can accept the config dict and reduce the need of loading JSON.

MasterJH5574 · 2024-03-17T14:34:09Z

python/mlc_llm/serve/entrypoints/openai_entrypoints.py

+ model_config_path = ServerContext.get_model_config_path(request.model)
+ image_embed_size = entrypoint_utils.get_image_embed_size(model_config_path)
+
+ if content_has_list:
+ prompts = entrypoint_utils.process_prompts(
+ conv_template.as_prompt_list(image_embed_size=image_embed_size),
+ async_engine.tokenizer.encode,
+ )
+ else:
+ prompts = entrypoint_utils.process_prompts(
+ conv_template.as_prompt(), async_engine.tokenizer.encode
+ )


Just want to mark a future todo item. Let's unify as_prompt_list and as_prompt in the future. Could you help add a TODO in the code here?

MasterJH5574 · 2024-03-17T14:36:41Z

python/mlc_llm/serve/entrypoints/openai_entrypoints.py

 async_engine.record_event(request_id, event="invoke generate")
 finish_reasons: List[Optional[str]] = [None for _ in range(generation_cfg.n)]
- async for delta_outputs in async_engine.generate(prompt, generation_cfg, request_id):
+ async for delta_outputs in async_engine.generate(prompt, generation_cfg, request_id): # type: ignore # pylint: disable=line-too-long


The original line has 85 characters, which fits our black and pylint limit (that is 100 characters). I assume there is no need to add the ignore and disable pylint here. Is there anything wrong with your settings?

mlc-llm/pyproject.toml

Lines 22 to 34 in edffce4

[tool.black]

line-length = 100

[tool.mypy]

ignore_missing_imports = true

show_column_numbers = true

show_error_context = true

follow_imports = "skip"

ignore_errors = false

strict_optional = false

[tool.pylint.messages_control]

max-line-length = 100

I was getting mypy errors, I changed List->Sequence to handle those errors. Removing # type: ignore # pylint: disable=line-too-long now

MasterJH5574 · 2024-03-17T14:36:52Z

python/mlc_llm/serve/entrypoints/openai_entrypoints.py

 )
 async_engine.record_event(request_id, event="invoke generate")
- async for delta_outputs in async_engine.generate(prompt, generation_cfg, request_id):
+ async for delta_outputs in async_engine.generate(prompt, generation_cfg, request_id): # type: ignore # pylint: disable=line-too-long


MasterJH5574 · 2024-03-17T14:38:48Z

python/mlc_llm/model/llava/llava_model.py

+ "max_batch_size": int,
+ "max_total_seq_len": int,
+ "prefill_chunk_size": int,
+ "page_size": int,


#1967 supports another parameter support_sliding_window. You may need to rebase to the latest main and add the parameter here and in the definition of create_paged_kv_cache.

And also rebase to resolve the conflict in conversation_protocol.py.

I did something wrong when rebasing. Closing this PR and opening a new one. Sorry for that

MasterJH5574 · 2024-03-17T14:48:00Z

Besides the comments above, could you also add test cases for the llava support? Test cases are helpful since it not only ensure the correctness, but also enable others to understand the basic flow of how things work and reproduce your tests.

Specifically, here are two tests I think good to have.

could you add tests/python/serve/test_serve_engine_image.py that tests Llava through Engine.generate? The test can be adapted from test_engine_generate in test_serve_engine.py

mlc-llm/tests/python/serve/test_serve_engine.py

Lines 369 to 392 in edffce4

     def test_engine_generate():  
   # Initialize model loading info and KV cache config  
   model = ModelInfo(  
   "dist/Llama-2-7b-chat-hf-q0f16-MLC",  
   model_lib_path="dist/Llama-2-7b-chat-hf-q0f16-MLC/Llama-2-7b-chat-hf-q0f16-MLC-cuda.so",  
   )  
   kv_cache_config = KVCacheConfig(page_size=16, max_total_sequence_length=4096)  
   # Create engine  
   engine = Engine(model, kv_cache_config)  
    
   num_requests = 10  
   max_tokens = 256  
    
   # Generate output.  
   output_texts, _ = engine.generate(  
   prompts[:num_requests], GenerationConfig(max_tokens=max_tokens)  
   )  
   for req_id, outputs in enumerate(output_texts):  
   print(f"Prompt {req_id}: {prompts[req_id]}")  
   if len(outputs) == 1:  
   print(f"Output {req_id}:{outputs[0]}\n")  
   else:  
   for i, output in enumerate(outputs):  
   print(f"Output {req_id}({i}):{output}\n")  
 

could you add tests/python/serve/server/test_server_image.py that tests Llava through OpenAI API? You can refer to test_server.py for examples. One test that tests the basic functionality should enough, and we can iterate the test cases and add more in the future.

) This PR supports the detection of if FlashInfer is enabled when building TVM, so that FlashInfer won't be enabled when TVM is not built with FlashInfer enabled.

…i#1964)

….json (mlc-ai#1965)

* small fix * small fix * Update stablelm_model.py

`test_server::is_json_or_json_prefix` is used to check the output is JSON or a prefix of JSON. It uses json.loads internally. However, json.loads (i.e. json.decode) is token-based instead of char based. If half a token is left at the end of the string, it cannot be matched. This PR adds another check for the rest "half a token" if it exists.

This PR migrates the mistral model to the PagedKVCache interface which supports sliding window attention with paged attention kernel written in TensorIR. We thereby introduce a `support_sliding_window` mode for KV cache, which leaves space for supporting sliding window for any model at runtime. This PR tests the mistral on with both chat and serve. The chat performance of Mistral 7B gets improvement than before, benefitted from the paged attention implementation.

* [Docs][Upd] Server launch, examples for endpoints for MLC Serve * remove v1/completions * add api docs to rest --------- Co-authored-by: Shrey Gupta <shrey2809@gmail.com>

2. Save model config instead of path in ServerContext 3. Sliding window parameter in create_paged_kv_cache 4. Remove pylint line-too-long

Animesh Bohara added 11 commits March 14, 2024 21:48

Llava model

975c601

Working with serve (single threaded engine)

8f53cbc

Signature fix

49d1e7c

Small fixes

57d62ca

Add presets

16325dc

Working with async server

2d87f3e

Fix image device

9069273

Lint fix

f7b7221

Lint fix

cd84c8a

Lint fix

94f93eb

Lint fix

2a56fca

anibohara2000 requested a review from MasterJH5574 March 15, 2024 18:18

MasterJH5574 reviewed Mar 17, 2024

View reviewed changes

MasterJH5574 and others added 11 commits March 18, 2024 11:22

[CompilerFlag] Detect if FlashInfer is enabled from libinfo (mlc-ai#1941

af41d60

) This PR supports the detection of if FlashInfer is enabled when building TVM, so that FlashInfer won't be enabled when TVM is not built with FlashInfer enabled.

[Serving][Grammar] Add grammar termination as a stop condition (mlc-a…

cca4c51

…i#1964)

Unify schema for conversation template and embed into mlc-chat-config…

e3c3d15

….json (mlc-ai#1965)

[SLM] Small correction on Stablelm and Qwen2. (mlc-ai#1958)

fc8529f

* small fix * small fix * Update stablelm_model.py

Auto updated submodule references

bbe13a6

[REST] Update Rest API docs for the latest serve flow (mlc-ai#1972)

a59398b

* [Docs][Upd] Server launch, examples for endpoints for MLC Serve * remove v1/completions * add api docs to rest --------- Co-authored-by: Shrey Gupta <shrey2809@gmail.com>

Working with async server

8640932

Lint fix

25e0bf6

1. Remove image_embed_size_ from model.

08b301a

2. Save model config instead of path in ServerContext 3. Sliding window parameter in create_paged_kv_cache 4. Remove pylint line-too-long

anibohara2000 closed this Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Model] [Serve] Add support for LLaVa model in serving engine #1962

[Model] [Serve] Add support for LLaVa model in serving engine #1962

Uh oh!

anibohara2000 commented Mar 15, 2024

tqchen commented Mar 16, 2024

anibohara2000 commented Mar 17, 2024

MasterJH5574 left a comment

MasterJH5574 Mar 17, 2024

anibohara2000 Mar 18, 2024 •

edited

Loading

MasterJH5574 Mar 17, 2024

anibohara2000 Mar 18, 2024

MasterJH5574 Mar 17, 2024

anibohara2000 Mar 18, 2024

MasterJH5574 Mar 17, 2024

anibohara2000 Mar 18, 2024

MasterJH5574 Mar 17, 2024

MasterJH5574 Mar 17, 2024

MasterJH5574 Mar 17, 2024

anibohara2000 Mar 18, 2024

MasterJH5574 Mar 17, 2024

MasterJH5574 Mar 17, 2024

MasterJH5574 Mar 17, 2024

anibohara2000 Mar 18, 2024

MasterJH5574 commented Mar 17, 2024 •

edited

Loading

Labels

7 participants

	[tool.black]
	line-length = 100

	[tool.mypy]
	ignore_missing_imports = true
	show_column_numbers = true
	show_error_context = true
	follow_imports = "skip"
	ignore_errors = false
	strict_optional = false

	[tool.pylint.messages_control]
	max-line-length = 100

[Model] [Serve] Add support for LLaVa model in serving engine #1962

[Model] [Serve] Add support for LLaVa model in serving engine #1962

Uh oh!

Conversation

anibohara2000 commented Mar 15, 2024

tqchen commented Mar 16, 2024

anibohara2000 commented Mar 17, 2024

MasterJH5574 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anibohara2000 Mar 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MasterJH5574 commented Mar 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Labels

7 participants

anibohara2000 Mar 18, 2024 •

edited

Loading

MasterJH5574 commented Mar 17, 2024 •

edited

Loading