2

I am trying to configure mpirun and mpiexec to run software called Materials Studio on a 1 node, 2 processor, 12 core cluster. The submission scheme is PBS. I had everything set up properly (with some help) and where I could submit jobs and they would work well but after a few days I ran into issues where I would get this sort of error:

mpiexec_server.org: cannot connect to local mpd (/tmp/mpd2.console_user); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option)

It seemed like the daemon for mpd was somehow set up but eventually terminated. I had luck adding this (bold part) to my submission script:

export PATH=/data1/opt/MD/Linux-x86_64/IntelMPI/bin:$PATH export LD_LIBRARY_PATH=/data1/opt/MD/Linux-x86_64/IntelMPI/lib:/data1/opt/MD/Linux-x86_64/IntelMPI/bin:/data1/opt/MD/Linux-x86_64/IntelMKL/lib **mpdboot -n 1 -f ~/mpd.hosts** nohup mpd & /data1/opt/MD/Linux-x86_64/IntelMPI/bin/mpiexec -n 6 /data1/opt/MD/2.0/TaskServer/Tools/vasp5.3.3/Linux-x86_64/vasp_parallel 

The job now submits and runs properly but times out after 30 minutes or so. I tried adding '-r ssh' without quotes to the end of the mpdboot line but I am not sure if that is the right strategy to take. Also, I am a little confused about why I need to run this daemon in this script and why I need to call a hosts file when I run- I thought that PBS creates that when the job picks up. Could anyone please give me some advice on where to go next? Basically how can I prevent a job that is running from quitting because of something to do with the mpi daemon.

EDIT: Could anyone shed any light on what is involved with running that mpiexec that I have on the last line? If I properly link to the folder where it is, do I need to run a boot command? I must admit that I am confused why I need to run mpdboot/mpd when then whole point of mpiexec is to eliminate the need for mpd (at least according to the mpiexec website).

1
  • I guess I am a little confused why I need to run mpdboot and mpd in the first place. It seems like only the latest and greatest intel compiler suggests doing this. Is there a way to revert to previous functionality that would be present in say mpi 3.2 which I am told this code was compiled against? Thanks again! Commented Jun 10, 2013 at 0:44

1 Answer 1

0

I'm running a MD simulation. But, once I want to run the simulation in DL-POLY the simulation is not started. I used these commands:

$ ps aux | grep mpd $ nohup mpd > mpd.out 2> mpd.err < /dev/null/ & $ mpiexec -n 4 DLPOLY.X >> job.out 2> job.err < /dev/null & $ top 

So that when I use the last command to see the process, I would see that the DL_POLY didn't appear. In the meanwhile, using the ll command I see that mpd.out has a zero value. I don't know why?

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.