[Guide][Performance]: vLLM Ascend v0.7.3.post1 benchmark for Qwen3

Benchmark environment

Reference: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/quick_start.html
host environment: openEuler 22.03 LTS
npu firmware: 7.7.0.1.231
npu driver: 25.0.rc1
Base docker image：quay.io/ascend/vllm-ascend:v0.7.3.post1

Download the mindie_turbo tar file from ascend website and place it in a new directory with a new Dockerfile

Dockerfile：

FROM quay.io/ascend/vllm-ascend:v0.7.3.post1 COPY ./Ascend-mindie-turbo_2.0.RC1_py310_linux_aarch64.tar.gz /tmp RUN cd /tmp && \ tar -xzvf /tmp/Ascend-mindie-turbo_2.0.RC1_py310_linux_aarch64.tar.gz && \  cd /tmp/Ascend-mindie-turbo_2.0.RC1_py310_linux_aarch64 && \ pip install --no-deps *.whl && \ pip cache purge

or

FROM quay.io/ascend/vllm-ascend:v0.7.3.post1 RUN pip install mindie-turbo==2.0rc1 && pip cache purge

and then build a new image and then run the new docker image and perform testing.

Test step

Case 1 Qwen3-32B TP4

vllm serve Qwen3-32B --gpu_memory_utilization=0.92 --port 32561 --rope-scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}' --max-model-len 131072 -tp 4

Case 2 DeepSeek-R1-0528-Qwen3-8B TP1

vllm serve DeepSeek-R1-0528-Qwen3-8B --gpu_memory_utilization=0.92 --port 32563 --rope-scaling '{"rope_type":"yarn","factor":2,"original_max_position_embeddings":32768}' --max-model-len 65536

Results

For qwen3-32b with 4 NPUs, inference speed increased from 8 tokens/s to 18 tokens/s.
For DeepSeek-R1-0528-Qwen3-8B with 1 NPU, inference speed increased from 20 tokens/s to 34 tokens/s.

But, for DeepSeek-R1-0528-Qwen3-8B, I am not sure whether the model support rope scaling, beacuse when I pulled the service, I received some messages, even though the model is running normally:

Unrecognized keys inrope_scalingfor 'rope_type'='yarn': {'attn_factor'}

Model	TP	baseline	v0.7.3.post1 + mindie turbo
Qwen3-32b	4	8 tokens/s	18 tokens/s
DeepSeek-R1-0528-Qwen3-8B	1	20 token/s	34 tokens/s

baseline: v0.8.5rc1 without any optimized

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Guide][Performance]: vLLM Ascend v0.7.3.post1 benchmark for Qwen3 #1025

Benchmark environment

Test step

Case 1 Qwen3-32B TP4

Case 2 DeepSeek-R1-0528-Qwen3-8B TP1

Results

Misc discussion on performance

Your current environment (if you think it is necessary)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Guide][Performance]: vLLM Ascend v0.7.3.post1 benchmark for Qwen3 #1025

Description

Benchmark environment

Test step

Case 1 Qwen3-32B TP4

Case 2 DeepSeek-R1-0528-Qwen3-8B TP1

Results

Misc discussion on performance

Your current environment (if you think it is necessary)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions