2
$\begingroup$

I'm not a Data Scientist, so bare with me please.

I have a Google Gemma 3 27B-it LLM running on a HuggingFace Inference endpoint in AWS on a machine with an A100 GPU. The endpoint is configured to run a Text Generation task on a vLLM container. I didn't change any other settings.

I was sending to it a large prompt of around 11K tokens and it was taking it approximately 6 seconds to reply. The prompt was very big because it has a large RAG context in it. So, I've decided to re-rank the context, extract top 3, and use those 3 to fetch the needed information for the prompt. This reduced my final prompt to be around 1K tokens. I've figured that this should greatly decrease the inference time, however the inference time remained largely the same for the 11K tokens prompt and the 1K tokens prompt.

I've extracted the logs from the machine, which surprised me even more:

  1. For the larger prompt, the log is: Engine 000: Avg prompt throughput: 1114.5 tokens/s, Avg generation throughput: 16.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
  2. For the smaller prompt, the log is: Engine 000: Avg prompt throughput: 109.3 tokens/s, Avg generation throughput: 18.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.2%, Prefix cache hit rate: 49.4%

I ran the tests multiple times in different order, the throughput remains the same but the cache behaves differently.

The question is: why is the throughput on the smaller prompt is so low? And how can I make it higher?

$\endgroup$

1 Answer 1

1
$\begingroup$

It appears that the total processing time for both prompts is nearly the same, which makes the smaller prompt seem to have worse per-token throughput. I suspect this is because the A100 is a very capable GPU and can parallelize both prompts efficiently, meaning you don’t gain much by using a shorter one.

Think of it like a restaurant kitchen designed to serve 100 guests at once. If only 10 people show up, the kitchen doesn’t suddenly cook ten times faster — it just ends up being underutilized.

$\endgroup$
2
  • $\begingroup$ Thanks for you reply. I understand that the GPU is under utilized. Question is, can I increase the throughput in anyway? Because now it seems like there is a lower bound for processing for Gemma 3 on an A100 GPU and it's quite high. $\endgroup$ Commented Jun 21 at 12:36
  • $\begingroup$ Based on the information I have, it's difficult to say, but my instinct suggests that using the same hardware and model would make it challenging to improve throughput. The only ways I could think of would require tweaking how the model decodes itself, which I am not sure is feasible in your use case $\endgroup$ Commented Jun 24 at 7:24

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.