When I started creating an instance from the LLM class, an error occurred.
It says AttributeError: 'PyNcclCommunicator' object has no attribute 'device', I don't know what to do then.
The code to test it:
from vllm import LLM def init_model() -> LLM: llm = LLM( model="Qwen/Qwen2-7B-Instruct", tokenizer_mode="auto", trust_remote_code=True, download_dir="./.cache", tensor_parallel_size=2, # How many GPUs to use gpu_memory_utilization=0.85, pipeline_parallel_size=1, dtype="bfloat16", # max_model_len=20480, # Model context length enable_prefix_caching=True, enable_chunked_prefill=False, num_scheduler_steps=8, ) return llm if __name__ == "__main__": llm = init_model() print(llm.generate("Hello, world!"))
INFO 09-13 00:13:27 config.py:890] Defaulting to use mp for distributed inference WARNING 09-13 00:13:27 arg_utils.py:880] Enabled BlockSpaceManagerV2 because it is required for multi-step (--num-scheduler-steps > 1) INFO 09-13 00:13:27 llm_engine.py:213] Initializing an LLM engine (v0.6.0) with config: model='Qwen/Qwen2-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir='./.cache', load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2-7B-Instruct, use_v2_block_manager=True, num_scheduler_steps=8, enable_prefix_caching=True, use_async_output_proc=True) WARNING 09-13 00:13:28 multiproc_gpu_executor.py:56] Reducing Torch parallelism from 36 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 09-13 00:13:29 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager WARNING 09-13 00:13:29 registry.py:190] `mm_limits` has already been set for model=Qwen/Qwen2-7B-Instruct, and will be overwritten by the new values. (VllmWorkerProcess pid=18643) WARNING 09-13 00:13:29 registry.py:190] `mm_limits` has already been set for model=Qwen/Qwen2-7B-Instruct, and will be overwritten by the new values. (VllmWorkerProcess pid=18643) INFO 09-13 00:13:29 multiproc_worker_utils.py:215] Worker ready; awaiting tasks INFO 09-13 00:13:29 utils.py:977] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=18643) INFO 09-13 00:13:29 utils.py:977] Found nccl from library libnccl.so.2 ERROR 09-13 00:13:29 pynccl_wrapper.py:196] Failed to load NCCL library from libnccl.so.2 .It is expected if you are not running on NVIDIA/AMD GPUs.Otherwise, the nccl library might not exist, be corrupted or it does not support the current platform Linux-5.4.0-150-generic-x86_64-with-glibc2.27.If you already have the library, please set the environment variable VLLM_NCCL_SO_PATH to point to the correct nccl library path. (VllmWorkerProcess pid=18643) ERROR 09-13 00:13:29 pynccl_wrapper.py:196] Failed to load NCCL library from libnccl.so.2 .It is expected if you are not running on NVIDIA/AMD GPUs.Otherwise, the nccl library might not exist, be corrupted or it does not support the current platform Linux-5.4.0-150-generic-x86_64-with-glibc2.27.If you already have the library, please set the environment variable VLLM_NCCL_SO_PATH to point to the correct nccl library path. (VllmWorkerProcess pid=18643) INFO 09-13 00:13:29 custom_all_reduce_utils.py:242] reading GPU P2P access cache from ~/.cache/vllm/gpu_p2p_access_cache_for_0,1.json INFO 09-13 00:13:29 custom_all_reduce_utils.py:242] reading GPU P2P access cache from ~/.cache/vllm/gpu_p2p_access_cache_for_0,1.json WARNING 09-13 00:13:29 custom_all_reduce.py:131] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. (VllmWorkerProcess pid=18643) WARNING 09-13 00:13:29 custom_all_reduce.py:131] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. INFO 09-13 00:13:29 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fac3ec98730>, local_subscribe_port=46869, remote_subscribe_port=None) INFO 09-13 00:13:29 model_runner.py:915] Starting to load model Qwen/Qwen2-7B-Instruct... (VllmWorkerProcess pid=18643) INFO 09-13 00:13:29 model_runner.py:915] Starting to load model Qwen/Qwen2-7B-Instruct... INFO 09-13 00:13:30 weight_utils.py:236] Using model weights format ['*.safetensors'] (VllmWorkerProcess pid=18643) INFO 09-13 00:13:31 weight_utils.py:236] Using model weights format ['*.safetensors'] Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:01<00:04, 1.54s/it] Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:03<00:03, 1.71s/it] Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:05<00:01, 1.75s/it] (VllmWorkerProcess pid=18643) INFO 09-13 00:13:38 model_runner.py:926] Loading model weights took 7.1216 GB Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:07<00:00, 1.80s/it] Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:07<00:00, 1.76s/it] INFO 09-13 00:13:38 model_runner.py:926] Loading model weights took 7.1216 GB INFO 09-13 00:13:44 distributed_gpu_executor.py:57] # GPU blocks: 22639, # CPU blocks: 9362 INFO 09-13 00:13:50 model_runner.py:1217] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 09-13 00:13:50 model_runner.py:1221] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. (VllmWorkerProcess pid=18643) INFO 09-13 00:13:50 model_runner.py:1217] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. (VllmWorkerProcess pid=18643) INFO 09-13 00:13:50 model_runner.py:1221] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. [rank0]: Traceback (most recent call last): [rank0]: File "~/psp/Reasoning-Carefully/test.py", line 25, in <module> [rank0]: llm = init_model() [rank0]: File "~/psp/Reasoning-Carefully/test.py", line 6, in init_model [rank0]: llm = LLM( [rank0]: File "~/anaconda3/envs/psp/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 177, in __init__ [rank0]: self.llm_engine = LLMEngine.from_engine_args( [rank0]: File "~/anaconda3/envs/psp/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 538, in from_engine_args [rank0]: engine = cls( [rank0]: File "~/anaconda3/envs/psp/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 319, in __init__ [rank0]: self._initialize_kv_caches() [rank0]: File "~/anaconda3/envs/psp/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 461, in _initialize_kv_caches [rank0]: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks) [rank0]: File "~/anaconda3/envs/psp/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 63, in initialize_cache [rank0]: self._run_workers("initialize_cache", [rank0]: File "~/anaconda3/envs/psp/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 199, in _run_workers [rank0]: driver_worker_output = driver_worker_method(*args, **kwargs) [rank0]: File "~/anaconda3/envs/psp/lib/python3.10/site-packages/vllm/worker/worker.py", line 265, in initialize_cache [rank0]: self._warm_up_model() [rank0]: File "~/anaconda3/envs/psp/lib/python3.10/site-packages/vllm/worker/worker.py", line 281, in _warm_up_model [rank0]: self.model_runner.capture_model(self.gpu_cache) [rank0]: File "~/anaconda3/envs/psp/lib/python3.10/site-packages/vllm/worker/multi_step_model_runner.py", line 543, in capture_model [rank0]: return self._base_model_runner.capture_model(kv_caches) [rank0]: File "~/anaconda3/envs/psp/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank0]: return func(*args, **kwargs) [rank0]: File "~/anaconda3/envs/psp/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1327, in capture_model [rank0]: graph_runner.capture(**capture_inputs) [rank0]: File "~/anaconda3/envs/psp/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1569, in capture [rank0]: self.model( [rank0]: File "~/anaconda3/envs/psp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "~/anaconda3/envs/psp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: File "~/anaconda3/envs/psp/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 361, in forward [rank0]: hidden_states = self.model(input_ids, positions, kv_caches, [rank0]: File "~/anaconda3/envs/psp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "~/anaconda3/envs/psp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: File "~/anaconda3/envs/psp/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 269, in forward [rank0]: hidden_states = self.embed_tokens(input_ids) [rank0]: File "~/anaconda3/envs/psp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "~/anaconda3/envs/psp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: File "~/anaconda3/envs/psp/lib/python3.10/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 406, in forward [rank0]: output = tensor_model_parallel_all_reduce(output_parallel) [rank0]: File "~/anaconda3/envs/psp/lib/python3.10/site-packages/vllm/distributed/communication_op.py", line 11, in tensor_model_parallel_all_reduce [rank0]: return get_tp_group().all_reduce(input_) [rank0]: File "~/anaconda3/envs/psp/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 288, in all_reduce [rank0]: pynccl_comm.all_reduce(input_) [rank0]: File "~/anaconda3/envs/psp/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 113, in all_reduce [rank0]: assert tensor.device == self.device, ( [rank0]: AttributeError: 'PyNcclCommunicator' object has no attribute 'device' INFO 09-13 00:13:54 multiproc_worker_utils.py:123] Killing local vLLM worker processes Fatal Python error: _enter_buffered_busy: could not acquire lock for <_io.BufferedWriter name='<stdout>'> at interpreter shutdown, possibly due to daemon threads Python runtime state: finalizing (tstate=0x0000000000fadf60) Current thread 0x00007facf08ba100 (most recent call first): <no Python frame> Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, _brotli, yaml._yaml, msgspec._core, psutil._psutil_linux, psutil._psutil_posix, sentencepiece._sentencepiece, PIL._imaging, PIL._imagingft, gmpy2.gmpy2, regex._regex, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, multidict._multidict, yarl._helpers_c, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, zmq.backend.cython._zmq (total: 44) ~/anaconda3/envs/psp/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' Aborted (core dumped)
Your current environment
The output of `python collect_env.py`
Model Input Dumps
No response
🐛 Describe the bug
When I started creating an instance from the LLM class, an error occurred.
It says
AttributeError: 'PyNcclCommunicator' object has no attribute 'device', I don't know what to do then.The code to test it:
Before submitting a new issue...