I'm not a Data Scientist, so bare with me please.
I have a Google Gemma 3 27B-it LLM running on a HuggingFace Inference endpoint in AWS on a machine with an A100 GPU. The endpoint is configured to run a Text Generation task on a vLLM container. I didn't change any other settings.
I was sending to it a large prompt of around 11K tokens and it was taking it approximately 6 seconds to reply. The prompt was very big because it has a large RAG context in it. So, I've decided to re-rank the context, extract top 3, and use those 3 to fetch the needed information for the prompt. This reduced my final prompt to be around 1K tokens. I've figured that this should greatly decrease the inference time, however the inference time remained largely the same for the 11K tokens prompt and the 1K tokens prompt.
I've extracted the logs from the machine, which surprised me even more:
- For the larger prompt, the log is:
Engine 000: Avg prompt throughput: 1114.5 tokens/s, Avg generation throughput: 16.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% - For the smaller prompt, the log is:
Engine 000: Avg prompt throughput: 109.3 tokens/s, Avg generation throughput: 18.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.2%, Prefix cache hit rate: 49.4%
I ran the tests multiple times in different order, the throughput remains the same but the cache behaves differently.
The question is: why is the throughput on the smaller prompt is so low? And how can I make it higher?