Skip to content

[Doc][0.7.3] Add performance tuning docs#878

Merged
wangxiyuan merged 11 commits intovllm-project:v0.7.3-devfrom
shen-shanshan:doc
May 20, 2025
Merged

[Doc][0.7.3] Add performance tuning docs#878
wangxiyuan merged 11 commits intovllm-project:v0.7.3-devfrom
shen-shanshan:doc

Conversation

@shen-shanshan
Copy link
Copy Markdown
Collaborator

@shen-shanshan shen-shanshan commented May 15, 2025

What this PR does / why we need it?

Add performance tuning docs.

Does this PR introduce any user-facing change?

How was this patch tested?

Signed-off-by: 申杉杉 <467638484@qq.com>
@shen-shanshan shen-shanshan marked this pull request as draft May 15, 2025 15:09
@github-actions github-actions bot added the documentation Improvements or additions to documentation label May 15, 2025
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
@shen-shanshan shen-shanshan marked this pull request as ready for review May 16, 2025 02:44
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
@Yikun
Copy link
Copy Markdown
Member

Yikun commented May 16, 2025

TLDR

https://docs.google.com/spreadsheets/d/1Z6KIp54n2NUhubMPImQrVtKnV5ZaUyu0FwP9NYtXdrA/edit?usp=sharing
image

Baseline

Prepare

# Update DEVICE according to your device (/dev/davinci[0-7]) export DEVICE=/dev/davinci0 # Update the vllm-ascend image export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:v0.7.3 docker run --rm \ --name yikun-test \ --device $DEVICE \ --device /dev/davinci_manager \ --device /dev/devmm_svm \ --device /dev/hisi_hdc \ -v /usr/local/dcmi:/usr/local/dcmi \ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ -v /etc/ascend_install.info:/etc/ascend_install.info \ -v /root/.cache:/root/.cache \ -it $IMAGE bash export MODEL=Qwen/Qwen2.5-7B-Instruct VLLM_USE_MODELSCOPE=true python3 -m vllm.entrypoints.openai.api_server --model $MODEL \ --tensor-parallel-size 1 --swap-space 16 --disable-log-stats \ --disable-log-requests --load-format dummy export MODEL=Qwen/Qwen2.5-7B-Instruct pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple pip install pandas datasets python3 /vllm-workspace/vllm/benchmarks/benchmark_serving.py --model $MODEL --dataset-name random \ --random-input-len 200 --num-prompts 200 --request-rate 1 \ --save-result --result-dir ./ 

Results: vLLM Ascend v0.7.3

============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 188.91 Total input tokens: 40000 Total generated tokens: 25600 Request throughput (req/s): 1.06 Output token throughput (tok/s): 135.52 Total Token throughput (tok/s): 347.26 ---------------Time to First Token---------------- Mean TTFT (ms): 86.58 Median TTFT (ms): 85.87 P99 TTFT (ms): 107.28 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 32.57 Median TPOT (ms): 32.63 P99 TPOT (ms): 36.39 ---------------Inter-token Latency---------------- Mean ITL (ms): 32.57 Median ITL (ms): 31.54 P99 ITL (ms): 65.02 ================================================== 

Results: vLLM Ascend v0.7.3 + MindIE Turbo

pip install mindie-turbo 
vLLM Ascend v0.7.3 + MindIE Turbo ============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 187.55 Total input tokens: 40000 Total generated tokens: 25600 Request throughput (req/s): 1.07 Output token throughput (tok/s): 136.49 Total Token throughput (tok/s): 349.77 ---------------Time to First Token---------------- Mean TTFT (ms): 64.45 Median TTFT (ms): 63.93 P99 TTFT (ms): 84.34 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 21.94 Median TPOT (ms): 21.78 P99 TPOT (ms): 24.06 ---------------Inter-token Latency---------------- Mean ITL (ms): 21.94 Median ITL (ms): 21.19 P99 ITL (ms): 47.65 ================================================== 

Optimized

Prepare

# Update DEVICE according to your device (/dev/davinci[0-7]) export DEVICE=/dev/davinci0 # Update the vllm-ascend image export IMAGE=m.daocloud.io/quay.io/ascend/cann:8.1.rc1-910b-ubuntu22.04-py3.10 docker run --rm \ --name yikun-test \ --device $DEVICE \ --device /dev/davinci_manager \ --device /dev/devmm_svm \ --device /dev/hisi_hdc \ -v /usr/local/dcmi:/usr/local/dcmi \ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ -v /etc/ascend_install.info:/etc/ascend_install.info \ -v /root/.cache:/root/.cache \ -it $IMAGE bash # Configure mirror echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy main restricted universe multiverse" > /etc/apt/sources.list && \ echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy main restricted universe multiverse" >> /etc/apt/sources.list && \ echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-updates main restricted universe multiverse" >> /etc/apt/sources.list && \ echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-updates main restricted universe multiverse" >> /etc/apt/sources.list && \ echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-backports main restricted universe multiverse" >> /etc/apt/sources.list && \ echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-backports main restricted universe multiverse" >> /etc/apt/sources.list && \ echo "deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-security main restricted universe multiverse" >> /etc/apt/sources.list && \ echo "deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu-ports/ jammy-security main restricted universe multiverse" >> /etc/apt/sources.list # Install os packages apt update && apt install wget gcc g++ libnuma-dev git vim -y ## Install Python mkdir -p /workspace/tmp cd /workspace/tmp # Download prebuilt lib and packages wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libcrypto.so.1.1 wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libomp.so wget https://repo.oepkgs.net/ascend/pytorch/vllm/lib/libssl.so.1.1 wget https://repo.oepkgs.net/ascend/pytorch/vllm/python/py311_bisheng.tar.gz wget https://repo.oepkgs.net/ascend/pytorch/vllm/torch/torch-2.5.1-cp311-cp311-linux_aarch64.whl wget https://repo.oepkgs.net/ascend/pytorch/vllm/torch/torch_npu-2.5.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl cd /workspace/tmp cp ./*.so* /usr/local/lib tar -zxvf ./py311_bisheng.* -C /usr/local/ mv /usr/local/py311_bisheng/ /usr/local/python sed -i "1c#\!/usr/local/python/bin/python3.11" /usr/local/python/bin/pip3 sed -i "1c#\!/usr/local/python/bin/python3.11" /usr/local/python/bin/pip3.11 ln -sf /usr/local/python/bin/python3 /usr/bin/python ln -sf /usr/local/python/bin/python3 /usr/bin/python3 ln -sf /usr/local/python/bin/python3.11 /usr/bin/python3.11 ln -sf /usr/local/python/bin/pip3 /usr/bin/pip3 ln -sf /usr/local/python/bin/pip3 /usr/bin/pip export PATH=/usr/bin:/usr/local/python/bin:$PATH cd /workspace/ pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple pip install vllm==0.7.3 python3 -c "import vllm;print(vllm.__path__)" # ['/usr/local/python/lib/python3.11/site-packages/vllm'] pip install vllm-ascend==0.7.3 python3 -c "import vllm_ascend;print(vllm_ascend.__path__)" export PATH=/usr/bin:/usr/local/python/bin:$PATH python3 -m pip install "modelscope<1.23.0" export MODEL=Qwen/Qwen2.5-7B-Instruct VLLM_USE_MODELSCOPE=true python3 -m vllm.entrypoints.openai.api_server --model $MODEL \ --tensor-parallel-size 1 --swap-space 16 --disable-log-stats \ --disable-log-requests --load-format dummy # New terminal export MODEL=Qwen/Qwen2.5-7B-Instruct pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple pip install pandas datasets git clone https://gitee.com/mirrors/vllm.git -b v0.7.3 /vllm-workspace/vllm python3 /vllm-workspace/vllm/benchmarks/benchmark_serving.py --model $MODEL --dataset-name random \ --random-input-len 200 --num-prompts 200 --request-rate 1 \ --save-result --result-dir ./ 

Results: vLLM Ascend v0.7.3 + Optimized Python

============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 188.38 Total input tokens: 40000 Total generated tokens: 25600 Request throughput (req/s): 1.06 Output token throughput (tok/s): 135.90 Total Token throughput (tok/s): 348.23 ---------------Time to First Token---------------- Mean TTFT (ms): 76.76 Median TTFT (ms): 75.93 P99 TTFT (ms): 96.72 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 28.73 Median TPOT (ms): 28.78 P99 TPOT (ms): 31.20 ---------------Inter-token Latency---------------- Mean ITL (ms): 28.73 Median ITL (ms): 27.82 P99 ITL (ms): 59.61 ================================================== 

Prepare torch_npu and torch

cd /workspace/tmp pip install ./torch_npu-*.whl --force-reinstall --no-deps pip install ./torch-2.5.1*.whl --force-reinstall --no-deps # ctrl + c and restart vllm export MODEL=Qwen/Qwen2.5-7B-Instruct VLLM_USE_MODELSCOPE=true python3 -m vllm.entrypoints.openai.api_server --model $MODEL \ --tensor-parallel-size 1 --swap-space 16 --disable-log-stats \ --disable-log-requests --load-format dummy # New terminal export MODEL=Qwen/Qwen2.5-7B-Instruct pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple pip install pandas datasets python3 /vllm-workspace/vllm/benchmarks/benchmark_serving.py --model $MODEL --dataset-name random \ --random-input-len 200 --num-prompts 200 --request-rate 1 \ --save-result --result-dir ./ 

Results: vLLM Ascend v0.7.3 + Optimized Python / Torch / Torch NPU

============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 187.71 Total input tokens: 40000 Total generated tokens: 25600 Request throughput (req/s): 1.07 Output token throughput (tok/s): 136.38 Total Token throughput (tok/s): 349.47 ---------------Time to First Token---------------- Mean TTFT (ms): 69.44 Median TTFT (ms): 67.69 P99 TTFT (ms): 128.31 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 22.60 Median TPOT (ms): 22.67 P99 TPOT (ms): 24.84 ---------------Inter-token Latency---------------- Mean ITL (ms): 22.60 Median ITL (ms): 21.78 P99 ITL (ms): 52.85 ================================================== 

Prepare

pip install mindie-turbo # Not sure why cann image env not working export PATH=/usr/bin:/usr/local/python/bin:$PATH export ATB_HOME_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0 export LD_LIBRARY_PATH=${ATB_HOME_PATH}/lib:${ATB_HOME_PATH}/examples:${ATB_HOME_PATH}/tests/atbopstest:${LD_LIBRARY_PATH} export PATH=${ATB_HOME_PATH}/bin:$PATH export ASDOPS_HOME_PATH=${ATB_HOME_PATH} # ctrl + c and restart vllm export MODEL=Qwen/Qwen2.5-7B-Instruct VLLM_USE_MODELSCOPE=true python3 -m vllm.entrypoints.openai.api_server --model $MODEL \ --tensor-parallel-size 1 --swap-space 16 --disable-log-stats \ --disable-log-requests --load-format dummy # New terminal export MODEL=Qwen/Qwen2.5-7B-Instruct python3 /vllm-workspace/vllm/benchmarks/benchmark_serving.py --model $MODEL --dataset-name random \ --random-input-len 200 --num-prompts 200 --request-rate 1 \ --save-result --result-dir ./ 

Results: vLLM Ascend v0.7.3 + Optimized Python / Torch / Torch NPU + MindIE Turbo

============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 186.92 Total input tokens: 40000 Total generated tokens: 25600 Request throughput (req/s): 1.07 Output token throughput (tok/s): 136.96 Total Token throughput (tok/s): 350.96 ---------------Time to First Token---------------- Mean TTFT (ms): 53.90 Median TTFT (ms): 53.45 P99 TTFT (ms): 78.94 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 17.24 Median TPOT (ms): 17.33 P99 TPOT (ms): 18.41 ---------------Inter-token Latency---------------- Mean ITL (ms): 17.24 Median ITL (ms): 16.80 P99 ITL (ms): 42.16 ================================================== 

Prepare TCMalloc

apt update;apt install libgoogle-perftools4 libgoogle-perftools-dev;find /usr -name libtcmalloc.so* export LD_PRELOAD="$LD_PRELOAD:/usr/lib/aarch64-linux-gnu/libtcmalloc.so" 

Results: vLLM Ascend v0.7.3 + Optimized Python / Torch / Torch NPU + MindIE Turbo + TCMalloc

============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 186.91 Total input tokens: 40000 Total generated tokens: 25600 Request throughput (req/s): 1.07 Output token throughput (tok/s): 136.96 Total Token throughput (tok/s): 350.97 ---------------Time to First Token---------------- Mean TTFT (ms): 52.72 Median TTFT (ms): 52.29 P99 TTFT (ms): 78.39 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 17.16 Median TPOT (ms): 17.25 P99 TPOT (ms): 18.73 ---------------Inter-token Latency---------------- Mean ITL (ms): 17.16 Median ITL (ms): 16.74 P99 ITL (ms): 42.03 ================================================== 

Prepare PYTORCH_NPU_ALLOC_CONF

export PYTORCH_NPU_ALLOC_CONF="max_split_size_mb:250" 

Results: vLLM Ascend v0.7.3 + Optimized Python / Torch / Torch NPU + MindIE Turbo + TCMalloc + PYTORCH_NPU_ALLOC_CONF="max_split_size_mb:250"

============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 186.91 Total input tokens: 40000 Total generated tokens: 25600 Request throughput (req/s): 1.07 Output token throughput (tok/s): 136.96 Total Token throughput (tok/s): 350.97 ---------------Time to First Token---------------- Mean TTFT (ms): 54.13 Median TTFT (ms): 52.67 P99 TTFT (ms): 150.03 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 17.21 Median TPOT (ms): 17.27 P99 TPOT (ms): 19.43 ---------------Inter-token Latency---------------- Mean ITL (ms): 17.21 Median ITL (ms): 16.80 P99 ITL (ms): 42.09 ================================================== 

Prepare PYTORCH_NPU_ALLOC_CONF

export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True" 

Results: vLLM Ascend v0.7.3 + Optimized Python / Torch / Torch NPU + MindIE Turbo + TCMalloc + PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"

============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 186.92 Total input tokens: 40000 Total generated tokens: 25600 Request throughput (req/s): 1.07 Output token throughput (tok/s): 136.96 Total Token throughput (tok/s): 350.95 ---------------Time to First Token---------------- Mean TTFT (ms): 53.67 Median TTFT (ms): 52.36 P99 TTFT (ms): 115.07 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 17.17 Median TPOT (ms): 17.27 P99 TPOT (ms): 18.54 ---------------Inter-token Latency---------------- Mean ITL (ms): 17.17 Median ITL (ms): 16.73 P99 ITL (ms): 41.71 ================================================== 

Prepare TASK_QUEUE_ENABLE

unset PYTORCH_NPU_ALLOC_CONF export TASK_QUEUE_ENABLE=2 

Results: vLLM Ascend v0.7.3 + Optimized Python / Torch / Torch NPU + MindIE Turbo + TCMalloc + TASK_QUEUE_ENABLE=2

============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 186.88 Total input tokens: 40000 Total generated tokens: 25600 Request throughput (req/s): 1.07 Output token throughput (tok/s): 136.98 Total Token throughput (tok/s): 351.02 ---------------Time to First Token---------------- Mean TTFT (ms): 49.95 Median TTFT (ms): 47.00 P99 TTFT (ms): 151.43 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 16.96 Median TPOT (ms): 17.03 P99 TPOT (ms): 19.19 ---------------Inter-token Latency---------------- Mean ITL (ms): 16.96 Median ITL (ms): 16.55 P99 ITL (ms): 41.68 ================================================== 

Prepare CPU_AFFINITY_CONF

unset TASK_QUEUE_ENABLE export CPU_AFFINITY_CONF=1 

Results: vLLM Ascend v0.7.3 + Optimized Python / Torch / Torch NPU + MindIE Turbo + TCMalloc + CPU_AFFINITY_CONF=1

============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 186.89 Total input tokens: 40000 Total generated tokens: 25600 Request throughput (req/s): 1.07 Output token throughput (tok/s): 136.98 Total Token throughput (tok/s): 351.01 ---------------Time to First Token---------------- Mean TTFT (ms): 50.09 Median TTFT (ms): 48.15 P99 TTFT (ms): 106.44 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 17.06 Median TPOT (ms): 17.13 P99 TPOT (ms): 18.23 ---------------Inter-token Latency---------------- Mean ITL (ms): 17.06 Median ITL (ms): 16.64 P99 ITL (ms): 42.13 ================================================== 

Prepare Ascend Scheduler

unset CPU_AFFINITY_CONF # ctrl + c and restart vllm export MODEL=Qwen/Qwen2.5-7B-Instruct VLLM_USE_MODELSCOPE=true python3 -m vllm.entrypoints.openai.api_server --model $MODEL \ --tensor-parallel-size 1 --swap-space 16 --disable-log-stats \ --disable-log-requests --load-format dummy \ --additional-config '{"ascend_scheduler_config":{}}' 

Results: vLLM Ascend v0.7.3 + Optimized Python / Torch / Torch NPU + MindIE Turbo + TCMalloc + TASK_QUEUE_ENABLE=2 + Ascend Scheduler

============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 186.90 Total input tokens: 40000 Total generated tokens: 25600 Request throughput (req/s): 1.07 Output token throughput (tok/s): 136.97 Total Token throughput (tok/s): 350.99 ---------------Time to First Token---------------- Mean TTFT (ms): 48.35 Median TTFT (ms): 47.85 P99 TTFT (ms): 71.40 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 16.99 Median TPOT (ms): 17.10 P99 TPOT (ms): 18.18 ---------------Inter-token Latency---------------- Mean ITL (ms): 16.99 Median ITL (ms): 16.62 P99 ITL (ms): 41.84 ================================================== 
Copy link
Copy Markdown
Collaborator

@wangxiyuan wangxiyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole doc missed the content that why the step is required and how it works. Please add more content. Thanks.


## Optimizations

### 1. Compiler Optimization
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step is not correct? AFAIK, users should install the package first, then run models, then recompile the package again. The guide below just installs the compiled package, is that enough?

Copy link
Copy Markdown
Collaborator Author

@shen-shanshan shen-shanshan May 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compilation is too complex and too time-consuming for users, so after discussing with @Yikun , we finally decided to offer twice compiled packages to users directly.

@wangxiyuan
Copy link
Copy Markdown
Collaborator

@Yikun The benchmark result should be in contained in the doc as well. Right?

@Yikun
Copy link
Copy Markdown
Member

Yikun commented May 17, 2025

@wangxiyuan Yes, for significant speedup, should be included in doc, others I believe some notes are enough.

For specific data, it's different between hardware series, maybe we should hide or just link to this issue? The percentage can be recorded as reference in doc I think

@shen-shanshan shen-shanshan marked this pull request as draft May 18, 2025 08:31
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
@shen-shanshan shen-shanshan marked this pull request as ready for review May 19, 2025 02:08
@wangxiyuan
Copy link
Copy Markdown
Collaborator

I'm fine with this change. Let's merge first. Feel free to update the content if needed later

@wangxiyuan wangxiyuan merged commit 0d68d8d into vllm-project:v0.7.3-dev May 20, 2025
4 checks passed
@wangxiyuan wangxiyuan changed the title [Doc] Add performance tuning docs [Doc][0.7.3] Add performance tuning docs May 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

3 participants