🚀 The feature, motivation and pitch
1、获取 llmcompressor 量化模型
安装llmcompressor:
pip install llmcompressor
下载校准集(https://huggingface.co/datasets/HuggingFaceH4/no_robots)
克隆 https://github.com/vllm-project/llm-compressor.git
用 llm-compressor/examples/quantization_w8a8_int8/llama3_example.py 脚本量化模型
2、用 vllm 离线推理脚本推理量化模型
根据 https://vllm-ascend.readthedocs.io/en/latest/installation.html 构建 docker 容器环境,安装 vllm 和 vllm-ascend。
使用如下推理脚本测试
from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # The first run will take about 3-5 mins (10 MB/s) to download models # llm = LLM(model="/data/models/llama3-8b-instruct") llm = LLM(model="/data/models/llama3-8b-instruct-W8A8-Dynamic-Per-Token-llmcompressor") outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") 3.结果是目前不支持

Alternatives
No response
Additional context
No response
🚀 The feature, motivation and pitch
1、获取 llmcompressor 量化模型
安装llmcompressor:
下载校准集(https://huggingface.co/datasets/HuggingFaceH4/no_robots)
克隆 https://github.com/vllm-project/llm-compressor.git
用 llm-compressor/examples/quantization_w8a8_int8/llama3_example.py 脚本量化模型
2、用 vllm 离线推理脚本推理量化模型
根据 https://vllm-ascend.readthedocs.io/en/latest/installation.html 构建 docker 容器环境,安装 vllm 和 vllm-ascend。
使用如下推理脚本测试
3.结果是目前不支持

Alternatives
No response
Additional context
No response