Benchmark environment
Download the mindie_turbo tar file from ascend website and place it in a new directory with a new Dockerfile
Dockerfile:
FROM quay.io/ascend/vllm-ascend:v0.7.3.post1 COPY ./Ascend-mindie-turbo_2.0.RC1_py310_linux_aarch64.tar.gz /tmp RUN cd /tmp && \ tar -xzvf /tmp/Ascend-mindie-turbo_2.0.RC1_py310_linux_aarch64.tar.gz && \ cd /tmp/Ascend-mindie-turbo_2.0.RC1_py310_linux_aarch64 && \ pip install --no-deps *.whl && \ pip cache purge
or
FROM quay.io/ascend/vllm-ascend:v0.7.3.post1 RUN pip install mindie-turbo==2.0rc1 && pip cache purge
and then build a new image and then run the new docker image and perform testing.
Test step
Case 1 Qwen3-32B TP4
vllm serve Qwen3-32B --gpu_memory_utilization=0.92 --port 32561 --rope-scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}' --max-model-len 131072 -tp 4
Case 2 DeepSeek-R1-0528-Qwen3-8B TP1
vllm serve DeepSeek-R1-0528-Qwen3-8B --gpu_memory_utilization=0.92 --port 32563 --rope-scaling '{"rope_type":"yarn","factor":2,"original_max_position_embeddings":32768}' --max-model-len 65536
Results
For qwen3-32b with 4 NPUs, inference speed increased from 8 tokens/s to 18 tokens/s.
For DeepSeek-R1-0528-Qwen3-8B with 1 NPU, inference speed increased from 20 tokens/s to 34 tokens/s.
But, for DeepSeek-R1-0528-Qwen3-8B, I am not sure whether the model support rope scaling, beacuse when I pulled the service, I received some messages, even though the model is running normally:
Unrecognized keys inrope_scalingfor 'rope_type'='yarn': {'attn_factor'}
| Model | TP | baseline | v0.7.3.post1 + mindie turbo |
| Qwen3-32b | 4 | 8 tokens/s | 18 tokens/s |
| DeepSeek-R1-0528-Qwen3-8B | 1 | 20 token/s | 34 tokens/s |
- baseline: v0.8.5rc1 without any optimized
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
The output of `python collect_env.py`
Benchmark environment
Download the mindie_turbo tar file from ascend website and place it in a new directory with a new Dockerfile
Dockerfile:
or
and then build a new image and then run the new docker image and perform testing.
Test step
Case 1 Qwen3-32B TP4
vllm serve Qwen3-32B --gpu_memory_utilization=0.92 --port 32561 --rope-scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}' --max-model-len 131072 -tp 4Case 2 DeepSeek-R1-0528-Qwen3-8B TP1
vllm serve DeepSeek-R1-0528-Qwen3-8B --gpu_memory_utilization=0.92 --port 32563 --rope-scaling '{"rope_type":"yarn","factor":2,"original_max_position_embeddings":32768}' --max-model-len 65536Results
For qwen3-32b with 4 NPUs, inference speed increased from 8 tokens/s to 18 tokens/s.
For DeepSeek-R1-0528-Qwen3-8B with 1 NPU, inference speed increased from 20 tokens/s to 34 tokens/s.
But, for DeepSeek-R1-0528-Qwen3-8B, I am not sure whether the model support rope scaling, beacuse when I pulled the service, I received some messages, even though the model is running normally:
Unrecognized keys inrope_scalingfor 'rope_type'='yarn': {'attn_factor'}Misc discussion on performance
No response
Your current environment (if you think it is necessary)