I am trying to run the DeepSeek-V3 model inference on a remote machine (SSH). This machine does not have any GPU, but has many CPU cores.
1rst method/
I try to run the model inference using the DeepSeek-Infer Demo method:
generate.py --ckpt-path /path/to/DeepSeek-V3-Demo --config configs/config_671B.json --interactive --temperature 0.7 --max-new-tokens 200
This produced the following error message:
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
2nd method/
I then try to use a second method, using the Hugging-Face Transformer library.
I installed the Transformers Python package v4.51.3 (which supports DeepSeek-V3).
I then implemented the script described in the Transformers/DeepSeek-V3 documentation:
# `run_deepseek_v1.py` from transformers import AutoModelForCausalLM, AutoTokenizer import torch torch.manual_seed(30) tokenizer = AutoTokenizer.from_pretrained("path/to/local/deepseek-v3") chat = [ {"role": "user", "content": "Hello, how are you?"}, {"role": "assistant", "content": "I'm doing great. How can I help you today?"}, {"role": "user", "content": "I'd like to show off how chat templating works!"}, ] model = AutoModelForCausalLM.from_pretrained("path/to/local/deepseek-v3", device_map="auto", torch_dtype=torch.bfloat16) inputs = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device) import time start = time.time() outputs = model.generate(inputs, max_new_tokens=50) print(tokenizer.batch_decode(outputs)) print(time.time()-start) A got a similar error message when running this:
transformers/quantizers/quantizer_finegrained_fp8.py, line 51, in validate_environment raise RuntimeError("No GPU found. A GPU is needed for FP8 quantization.").
I tried to change device_map="auto" to device_map="cpu", but it did not change anything (I still got the same error message)
So my question is the following, is there any way to run DeepSeek on CPU only (without any GPU), ideally using one of this method (or another method that I would not know) ?
P.S.: I am new to the Data Science website. So if this question is too "implementation/runtime environment details" oriented, don't hesitate to tell it to me and I will close this post.