I am trying to configure mpirun and mpiexec to run software called Materials Studio on a 1 node, 2 processor, 12 core cluster. The submission scheme is PBS. I had everything set up properly (with some help) and where I could submit jobs and they would work well but after a few days I ran into issues where I would get this sort of error:
mpiexec_server.org: cannot connect to local mpd (/tmp/mpd2.console_user); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option)
It seemed like the daemon for mpd was somehow set up but eventually terminated. I had luck adding this (bold part) to my submission script:
export PATH=/data1/opt/MD/Linux-x86_64/IntelMPI/bin:$PATH export LD_LIBRARY_PATH=/data1/opt/MD/Linux-x86_64/IntelMPI/lib:/data1/opt/MD/Linux-x86_64/IntelMPI/bin:/data1/opt/MD/Linux-x86_64/IntelMKL/lib **mpdboot -n 1 -f ~/mpd.hosts** nohup mpd & /data1/opt/MD/Linux-x86_64/IntelMPI/bin/mpiexec -n 6 /data1/opt/MD/2.0/TaskServer/Tools/vasp5.3.3/Linux-x86_64/vasp_parallel The job now submits and runs properly but times out after 30 minutes or so. I tried adding '-r ssh' without quotes to the end of the mpdboot line but I am not sure if that is the right strategy to take. Also, I am a little confused about why I need to run this daemon in this script and why I need to call a hosts file when I run- I thought that PBS creates that when the job picks up. Could anyone please give me some advice on where to go next? Basically how can I prevent a job that is running from quitting because of something to do with the mpi daemon.
EDIT: Could anyone shed any light on what is involved with running that mpiexec that I have on the last line? If I properly link to the folder where it is, do I need to run a boot command? I must admit that I am confused why I need to run mpdboot/mpd when then whole point of mpiexec is to eliminate the need for mpd (at least according to the mpiexec website).