[Doc][0.7.3] Add performance tuning docs by shen-shanshan · Pull Request #878 · vllm-project/vllm-ascend

shen-shanshan · 2025-05-15T15:09:34Z

What this PR does / why we need it?

Add performance tuning docs.

Does this PR introduce any user-facing change?

How was this patch tested?

Signed-off-by: 申杉杉 <467638484@qq.com>

Signed-off-by: shen-shanshan <467638484@qq.com>

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

Yikun · 2025-05-16T18:29:44Z

TLDR

https://docs.google.com/spreadsheets/d/1Z6KIp54n2NUhubMPImQrVtKnV5ZaUyu0FwP9NYtXdrA/edit?usp=sharing

Baseline

Prepare

# Update DEVICE according to your device (/dev/davinci[0-7]) export DEVICE=/dev/davinci0 # Update the vllm-ascend image export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:v0.7.3 docker run --rm \ --name yikun-test \ --device $DEVICE \ --device /dev/davinci_manager \ --device /dev/devmm_svm \ --device /dev/hisi_hdc \ -v /usr/local/dcmi:/usr/local/dcmi \ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ -v /etc/ascend_install.info:/etc/ascend_install.info \ -v /root/.cache:/root/.cache \ -it $IMAGE bash export MODEL=Qwen/Qwen2.5-7B-Instruct VLLM_USE_MODELSCOPE=true python3 -m vllm.entrypoints.openai.api_server --model $MODEL \ --tensor-parallel-size 1 --swap-space 16 --disable-log-stats \ --disable-log-requests --load-format dummy export MODEL=Qwen/Qwen2.5-7B-Instruct pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple pip install pandas datasets python3 /vllm-workspace/vllm/benchmarks/benchmark_serving.py --model $MODEL --dataset-name random \ --random-input-len 200 --num-prompts 200 --request-rate 1 \ --save-result --result-dir ./

Results: vLLM Ascend v0.7.3

============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 188.91 Total input tokens: 40000 Total generated tokens: 25600 Request throughput (req/s): 1.06 Output token throughput (tok/s): 135.52 Total Token throughput (tok/s): 347.26 ---------------Time to First Token---------------- Mean TTFT (ms): 86.58 Median TTFT (ms): 85.87 P99 TTFT (ms): 107.28 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 32.57 Median TPOT (ms): 32.63 P99 TPOT (ms): 36.39 ---------------Inter-token Latency---------------- Mean ITL (ms): 32.57 Median ITL (ms): 31.54 P99 ITL (ms): 65.02 ==================================================

Results: vLLM Ascend v0.7.3 + MindIE Turbo

pip install mindie-turbo

vLLM Ascend v0.7.3 + MindIE Turbo ============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 187.55 Total input tokens: 40000 Total generated tokens: 25600 Request throughput (req/s): 1.07 Output token throughput (tok/s): 136.49 Total Token throughput (tok/s): 349.77 ---------------Time to First Token---------------- Mean TTFT (ms): 64.45 Median TTFT (ms): 63.93 P99 TTFT (ms): 84.34 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 21.94 Median TPOT (ms): 21.78 P99 TPOT (ms): 24.06 ---------------Inter-token Latency---------------- Mean ITL (ms): 21.94 Median ITL (ms): 21.19 P99 ITL (ms): 47.65 ==================================================

Optimized

Prepare

# Update DEVICE according to your device (/dev/davinci[0-7]) export DEVICE=/dev/davinci0 # Update the vllm-ascend image export IMAGE=m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10 docker run --rm \ --name yikun-test \ --device $DEVICE \ --device /dev/davinci_manager \ --device /dev/devmm_svm \ --device /dev/hisi_hdc \ -v /usr/local/dcmi:/usr/local/dcmi \ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ -v /etc/ascend_install.info:/etc/ascend_install.info \ -v /root/.cache:/root/.cache \ -it $IMAGE bash # Configure mirror echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy main restricted universe multiverse" > /etc/apt/sources.list && \ echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy main restricted universe multiverse" >> /etc/apt/sources.list && \ echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-updates main restricted universe multiverse" >> /etc/apt/sources.list && \ echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-updates main restricted universe multiverse" >> /etc/apt/sources.list && \ echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-backports main restricted universe multiverse" >> /etc/apt/sources.list && \ echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-backports main restricted universe multiverse" >> /etc/apt/sources.list && \ echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-security main restricted universe multiverse" >> /etc/apt/sources.list && \ echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-security main restricted universe multiverse" >> /etc/apt/sources.list # Install os packages apt update && apt install wget gcc g++ libnuma-dev git vim -y ## Install Python mkdir -p /workspace/tmp cd /workspace/tmp # Download prebuilt lib and packages wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libcrypto.so.1.1 wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libomp.so wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libssl.so.1.1 wget https://repo.oepkgs.net/ascend/pytorch/vllm/python/py311_bisheng.tar.gz wget https://repo.oepkgs.net/ascend/pytorch/vllm/torch/torch-2.5.1-cp311-cp311-linux_aarch64.whl wget https://repo.oepkgs.net/ascend/pytorch/vllm/torch/torch_npu-2.5.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl cd /workspace/tmp cp ./*.so* /usr/local/lib tar -zxvf ./py311_bisheng.* -C /usr/local/ mv /usr/local/py311_bisheng/ /usr/local/python sed -i "1c#\!/usr/local/python/bin/python3.11" /usr/local/python/bin/pip3 sed -i "1c#\!/usr/local/python/bin/python3.11" /usr/local/python/bin/pip3.11 ln -sf /usr/local/python/bin/python3 /usr/bin/python ln -sf /usr/local/python/bin/python3 /usr/bin/python3 ln -sf /usr/local/python/bin/python3.11 /usr/bin/python3.11 ln -sf /usr/local/python/bin/pip3 /usr/bin/pip3 ln -sf /usr/local/python/bin/pip3 /usr/bin/pip export PATH=/usr/bin:/usr/local/python/bin:$PATH cd /workspace/ pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple pip install vllm==0.7.3 python3 -c "import vllm;print(vllm.__path__)" # ['/usr/local/python/lib/python3.11/site-packages/vllm'] pip install vllm-ascend==0.7.3 python3 -c "import vllm_ascend;print(vllm_ascend.__path__)" export PATH=/usr/bin:/usr/local/python/bin:$PATH python3 -m pip install "modelscope<1.23.0" export MODEL=Qwen/Qwen2.5-7B-Instruct VLLM_USE_MODELSCOPE=true python3 -m vllm.entrypoints.openai.api_server --model $MODEL \ --tensor-parallel-size 1 --swap-space 16 --disable-log-stats \ --disable-log-requests --load-format dummy # New terminal export MODEL=Qwen/Qwen2.5-7B-Instruct pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple pip install pandas datasets git clone https://gitee.com/mirrors/vllm.git -b v0.7.3 /vllm-workspace/vllm python3 /vllm-workspace/vllm/benchmarks/benchmark_serving.py --model $MODEL --dataset-name random \ --random-input-len 200 --num-prompts 200 --request-rate 1 \ --save-result --result-dir ./

Results: vLLM Ascend v0.7.3 + Optimized Python

============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 188.38 Total input tokens: 40000 Total generated tokens: 25600 Request throughput (req/s): 1.06 Output token throughput (tok/s): 135.90 Total Token throughput (tok/s): 348.23 ---------------Time to First Token---------------- Mean TTFT (ms): 76.76 Median TTFT (ms): 75.93 P99 TTFT (ms): 96.72 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 28.73 Median TPOT (ms): 28.78 P99 TPOT (ms): 31.20 ---------------Inter-token Latency---------------- Mean ITL (ms): 28.73 Median ITL (ms): 27.82 P99 ITL (ms): 59.61 ==================================================

Prepare torch_npu and torch

cd /workspace/tmp pip install ./torch_npu-*.whl --force-reinstall --no-deps pip install ./torch-2.5.1*.whl --force-reinstall --no-deps # ctrl + c and restart vllm export MODEL=Qwen/Qwen2.5-7B-Instruct VLLM_USE_MODELSCOPE=true python3 -m vllm.entrypoints.openai.api_server --model $MODEL \ --tensor-parallel-size 1 --swap-space 16 --disable-log-stats \ --disable-log-requests --load-format dummy # New terminal export MODEL=Qwen/Qwen2.5-7B-Instruct pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple pip install pandas datasets python3 /vllm-workspace/vllm/benchmarks/benchmark_serving.py --model $MODEL --dataset-name random \ --random-input-len 200 --num-prompts 200 --request-rate 1 \ --save-result --result-dir ./

Results: vLLM Ascend v0.7.3 + Optimized Python / Torch / Torch NPU

============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 187.71 Total input tokens: 40000 Total generated tokens: 25600 Request throughput (req/s): 1.07 Output token throughput (tok/s): 136.38 Total Token throughput (tok/s): 349.47 ---------------Time to First Token---------------- Mean TTFT (ms): 69.44 Median TTFT (ms): 67.69 P99 TTFT (ms): 128.31 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 22.60 Median TPOT (ms): 22.67 P99 TPOT (ms): 24.84 ---------------Inter-token Latency---------------- Mean ITL (ms): 22.60 Median ITL (ms): 21.78 P99 ITL (ms): 52.85 ==================================================

Prepare

pip install mindie-turbo # Not sure why cann image env not working export PATH=/usr/bin:/usr/local/python/bin:$PATH export ATB_HOME_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0 export LD_LIBRARY_PATH=${ATB_HOME_PATH}/lib:${ATB_HOME_PATH}/examples:${ATB_HOME_PATH}/tests/atbopstest:${LD_LIBRARY_PATH} export PATH=${ATB_HOME_PATH}/bin:$PATH export ASDOPS_HOME_PATH=${ATB_HOME_PATH} # ctrl + c and restart vllm export MODEL=Qwen/Qwen2.5-7B-Instruct VLLM_USE_MODELSCOPE=true python3 -m vllm.entrypoints.openai.api_server --model $MODEL \ --tensor-parallel-size 1 --swap-space 16 --disable-log-stats \ --disable-log-requests --load-format dummy # New terminal export MODEL=Qwen/Qwen2.5-7B-Instruct python3 /vllm-workspace/vllm/benchmarks/benchmark_serving.py --model $MODEL --dataset-name random \ --random-input-len 200 --num-prompts 200 --request-rate 1 \ --save-result --result-dir ./

Results: vLLM Ascend v0.7.3 + Optimized Python / Torch / Torch NPU + MindIE Turbo

============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 186.92 Total input tokens: 40000 Total generated tokens: 25600 Request throughput (req/s): 1.07 Output token throughput (tok/s): 136.96 Total Token throughput (tok/s): 350.96 ---------------Time to First Token---------------- Mean TTFT (ms): 53.90 Median TTFT (ms): 53.45 P99 TTFT (ms): 78.94 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 17.24 Median TPOT (ms): 17.33 P99 TPOT (ms): 18.41 ---------------Inter-token Latency---------------- Mean ITL (ms): 17.24 Median ITL (ms): 16.80 P99 ITL (ms): 42.16 ==================================================

Prepare TCMalloc

apt update;apt install libgoogle-perftools4 libgoogle-perftools-dev;find /usr -name libtcmalloc.so* export LD_PRELOAD="$LD_PRELOAD:/usr/lib/aarch64-linux-gnu/libtcmalloc.so"

Results: vLLM Ascend v0.7.3 + Optimized Python / Torch / Torch NPU + MindIE Turbo + TCMalloc

============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 186.91 Total input tokens: 40000 Total generated tokens: 25600 Request throughput (req/s): 1.07 Output token throughput (tok/s): 136.96 Total Token throughput (tok/s): 350.97 ---------------Time to First Token---------------- Mean TTFT (ms): 52.72 Median TTFT (ms): 52.29 P99 TTFT (ms): 78.39 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 17.16 Median TPOT (ms): 17.25 P99 TPOT (ms): 18.73 ---------------Inter-token Latency---------------- Mean ITL (ms): 17.16 Median ITL (ms): 16.74 P99 ITL (ms): 42.03 ==================================================

Prepare PYTORCH_NPU_ALLOC_CONF

export PYTORCH_NPU_ALLOC_CONF="max_split_size_mb:250"

Results: vLLM Ascend v0.7.3 + Optimized Python / Torch / Torch NPU + MindIE Turbo + TCMalloc + PYTORCH_NPU_ALLOC_CONF="max_split_size_mb:250"

============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 186.91 Total input tokens: 40000 Total generated tokens: 25600 Request throughput (req/s): 1.07 Output token throughput (tok/s): 136.96 Total Token throughput (tok/s): 350.97 ---------------Time to First Token---------------- Mean TTFT (ms): 54.13 Median TTFT (ms): 52.67 P99 TTFT (ms): 150.03 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 17.21 Median TPOT (ms): 17.27 P99 TPOT (ms): 19.43 ---------------Inter-token Latency---------------- Mean ITL (ms): 17.21 Median ITL (ms): 16.80 P99 ITL (ms): 42.09 ==================================================

Prepare PYTORCH_NPU_ALLOC_CONF

export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"

Results: vLLM Ascend v0.7.3 + Optimized Python / Torch / Torch NPU + MindIE Turbo + TCMalloc + PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"

============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 186.92 Total input tokens: 40000 Total generated tokens: 25600 Request throughput (req/s): 1.07 Output token throughput (tok/s): 136.96 Total Token throughput (tok/s): 350.95 ---------------Time to First Token---------------- Mean TTFT (ms): 53.67 Median TTFT (ms): 52.36 P99 TTFT (ms): 115.07 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 17.17 Median TPOT (ms): 17.27 P99 TPOT (ms): 18.54 ---------------Inter-token Latency---------------- Mean ITL (ms): 17.17 Median ITL (ms): 16.73 P99 ITL (ms): 41.71 ==================================================

Prepare TASK_QUEUE_ENABLE

unset PYTORCH_NPU_ALLOC_CONF export TASK_QUEUE_ENABLE=2

Results: vLLM Ascend v0.7.3 + Optimized Python / Torch / Torch NPU + MindIE Turbo + TCMalloc + TASK_QUEUE_ENABLE=2

============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 186.88 Total input tokens: 40000 Total generated tokens: 25600 Request throughput (req/s): 1.07 Output token throughput (tok/s): 136.98 Total Token throughput (tok/s): 351.02 ---------------Time to First Token---------------- Mean TTFT (ms): 49.95 Median TTFT (ms): 47.00 P99 TTFT (ms): 151.43 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 16.96 Median TPOT (ms): 17.03 P99 TPOT (ms): 19.19 ---------------Inter-token Latency---------------- Mean ITL (ms): 16.96 Median ITL (ms): 16.55 P99 ITL (ms): 41.68 ==================================================

Prepare CPU_AFFINITY_CONF

unset TASK_QUEUE_ENABLE export CPU_AFFINITY_CONF=1

Results: vLLM Ascend v0.7.3 + Optimized Python / Torch / Torch NPU + MindIE Turbo + TCMalloc + CPU_AFFINITY_CONF=1

============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 186.89 Total input tokens: 40000 Total generated tokens: 25600 Request throughput (req/s): 1.07 Output token throughput (tok/s): 136.98 Total Token throughput (tok/s): 351.01 ---------------Time to First Token---------------- Mean TTFT (ms): 50.09 Median TTFT (ms): 48.15 P99 TTFT (ms): 106.44 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 17.06 Median TPOT (ms): 17.13 P99 TPOT (ms): 18.23 ---------------Inter-token Latency---------------- Mean ITL (ms): 17.06 Median ITL (ms): 16.64 P99 ITL (ms): 42.13 ==================================================

Prepare Ascend Scheduler

unset CPU_AFFINITY_CONF # ctrl + c and restart vllm export MODEL=Qwen/Qwen2.5-7B-Instruct VLLM_USE_MODELSCOPE=true python3 -m vllm.entrypoints.openai.api_server --model $MODEL \ --tensor-parallel-size 1 --swap-space 16 --disable-log-stats \ --disable-log-requests --load-format dummy \ --additional-config '{"ascend_scheduler_config":{}}'

Results: vLLM Ascend v0.7.3 + Optimized Python / Torch / Torch NPU + MindIE Turbo + TCMalloc + TASK_QUEUE_ENABLE=2 + Ascend Scheduler

============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 186.90 Total input tokens: 40000 Total generated tokens: 25600 Request throughput (req/s): 1.07 Output token throughput (tok/s): 136.97 Total Token throughput (tok/s): 350.99 ---------------Time to First Token---------------- Mean TTFT (ms): 48.35 Median TTFT (ms): 47.85 P99 TTFT (ms): 71.40 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 16.99 Median TPOT (ms): 17.10 P99 TPOT (ms): 18.18 ---------------Inter-token Latency---------------- Mean ITL (ms): 16.99 Median ITL (ms): 16.62 P99 ITL (ms): 41.84 ==================================================

wangxiyuan

The whole doc missed the content that why the step is required and how it works. Please add more content. Thanks.

docs/source/developer_guide/performance/optimization_and_tuning.md

wangxiyuan · 2025-05-17T02:44:08Z

docs/source/developer_guide/performance/optimization_and_tuning.md

+
+## Optimizations
+
+### 1. Compiler Optimization


This step is not correct? AFAIK, users should install the package first, then run models, then recompile the package again. The guide below just installs the compiled package, is that enough?

The compilation is too complex and too time-consuming for users, so after discussing with @Yikun , we finally decided to offer twice compiled packages to users directly.

wangxiyuan · 2025-05-17T02:45:54Z

@Yikun The benchmark result should be in contained in the doc as well. Right?

Yikun · 2025-05-17T03:00:00Z

@wangxiyuan Yes, for significant speedup, should be included in doc, others I believe some notes are enough.

For specific data, it's different between hardware series, maybe we should hide or just link to this issue? The percentage can be recorded as reference in doc I think

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

wangxiyuan · 2025-05-20T01:26:06Z

I'm fine with this change. Let's merge first. Feel free to update the content if needed later

add performance tuning docs

a480691

Signed-off-by: 申杉杉 <467638484@qq.com>

shen-shanshan marked this pull request as draft May 15, 2025 15:09

github-actions bot added the documentation Improvements or additions to documentation label May 15, 2025

shen-shanshan added 2 commits May 16, 2025 02:29

update

8843820

Signed-off-by: shen-shanshan <467638484@qq.com>

update

fc9cca2

Signed-off-by: shen-shanshan <467638484@qq.com>

shen-shanshan marked this pull request as ready for review May 16, 2025 02:44

shen-shanshan mentioned this pull request May 16, 2025

[Performance]: vllm-ascend + mindie-turbo Performance Optimization #815

Closed

6 tasks

shen-shanshan added 2 commits May 16, 2025 11:26

update

9450469

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

update

ec9c48e

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

wangxiyuan requested changes May 17, 2025

View reviewed changes

shen-shanshan marked this pull request as draft May 18, 2025 08:31

shen-shanshan added 2 commits May 18, 2025 17:32

update

e45ebe2

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

update

5a2edf6

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

shen-shanshan force-pushed the doc branch from e0b034f to 5a2edf6 Compare May 18, 2025 16:23

shen-shanshan added 4 commits May 19, 2025 00:28

update

f32be56

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

update

24aeab3

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

update

c1248f7

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

update

a2acc87

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

shen-shanshan marked this pull request as ready for review May 19, 2025 02:08

Yikun mentioned this pull request May 19, 2025

[Guide] Official Guide Index #840

Open

wangxiyuan approved these changes May 20, 2025

View reviewed changes

wangxiyuan merged commit 0d68d8d into vllm-project:v0.7.3-dev May 20, 2025
4 checks passed

wangxiyuan mentioned this pull request May 21, 2025

[release] 0.7.3.post1 release checklist #871

Closed

11 tasks

wangxiyuan changed the title ~~[Doc] Add performance tuning docs~~ [Doc][0.7.3] Add performance tuning docs May 21, 2025

Yikun mentioned this pull request Jun 10, 2025

[Bug]: import mindie_turbo时出现undefined symbol: _ZNK6AsdOps3Ops18GetOperationByNameERKSs #1144

Closed

Yikun mentioned this pull request Jun 25, 2025

[Doc] Add performance tuning doc to main #1392

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Doc][0.7.3] Add performance tuning docs#878

[Doc][0.7.3] Add performance tuning docs#878
wangxiyuan merged 11 commits intovllm-project:v0.7.3-devfrom
shen-shanshan:doc

shen-shanshan commented May 15, 2025 •

edited by Yikun

Loading

Yikun commented May 16, 2025

wangxiyuan left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wangxiyuan May 17, 2025

shen-shanshan May 18, 2025 •

edited

Loading

wangxiyuan commented May 17, 2025

Yikun commented May 17, 2025 •

edited

Loading

wangxiyuan commented May 20, 2025

Uh oh!

Labels

3 participants

Conversation

shen-shanshan commented May 15, 2025 • edited by Yikun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Yikun commented May 16, 2025

TLDR

Baseline

Prepare

Results: vLLM Ascend v0.7.3

Results: vLLM Ascend v0.7.3 + MindIE Turbo

Optimized

Prepare

Results: vLLM Ascend v0.7.3 + Optimized Python

Prepare torch_npu and torch

Results: vLLM Ascend v0.7.3 + Optimized Python / Torch / Torch NPU

Prepare

Results: vLLM Ascend v0.7.3 + Optimized Python / Torch / Torch NPU + MindIE Turbo

Prepare TCMalloc

Results: vLLM Ascend v0.7.3 + Optimized Python / Torch / Torch NPU + MindIE Turbo + TCMalloc

Prepare PYTORCH_NPU_ALLOC_CONF

Results: vLLM Ascend v0.7.3 + Optimized Python / Torch / Torch NPU + MindIE Turbo + TCMalloc + PYTORCH_NPU_ALLOC_CONF="max_split_size_mb:250"

Prepare PYTORCH_NPU_ALLOC_CONF

Results: vLLM Ascend v0.7.3 + Optimized Python / Torch / Torch NPU + MindIE Turbo + TCMalloc + PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"

Prepare TASK_QUEUE_ENABLE

Results: vLLM Ascend v0.7.3 + Optimized Python / Torch / Torch NPU + MindIE Turbo + TCMalloc + TASK_QUEUE_ENABLE=2

Prepare CPU_AFFINITY_CONF

Results: vLLM Ascend v0.7.3 + Optimized Python / Torch / Torch NPU + MindIE Turbo + TCMalloc + CPU_AFFINITY_CONF=1

Prepare Ascend Scheduler

Results: vLLM Ascend v0.7.3 + Optimized Python / Torch / Torch NPU + MindIE Turbo + TCMalloc + TASK_QUEUE_ENABLE=2 + Ascend Scheduler

wangxiyuan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wangxiyuan May 17, 2025

Choose a reason for hiding this comment

shen-shanshan May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

wangxiyuan commented May 17, 2025

Yikun commented May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

wangxiyuan commented May 20, 2025

Uh oh!

Labels

3 participants

shen-shanshan commented May 15, 2025 •

edited by Yikun

Loading

shen-shanshan May 18, 2025 •

edited

Loading

Yikun commented May 17, 2025 •

edited

Loading