Re: [OMPI users] problems with the -xterm option
jody <jody.xha <at> gmail.com>
2011-05-02 08:34:15 GMT
Hi Ralph
I rebuilt open MPI 1.4.2 with the debug option on both chefli and squid_0.
The results are interesting!
I wrote a small HelloMPI app which basically calls usleep for a pause
of 5 seconds.
Now calling it as i did before, no MPI errors appear anymore, only the
display problems:
jody <at> chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca
plm_rsh_agent "ssh -Y" --xterm 0 ./HelloMPI
/usr/bin/xterm Xt error: Can't open display: localhost:10.0
When i do the same call *with* the debug option, the xterm appears and
shows the output of HelloMPI!
I attach the output in ompidbg_1.txt (It also works if i call with
'-np 4' and '--xterm 0,1,2,3'
Calling hostname the same way does not open an xterm (cf. ompidbg_2.txt).
If i use the hold-option, the xterm appears with the output of
'hostrname' (cf. ompidbg_3.txt)
The xterm opens after the line "launch complete for job..." has been
written (line 59)
I just found that everything works as expected if i use the the
'--leave-session-attached' option (without the debug options):
jody <at> chefli ~/share/neander $ mpirun -np 4 -host squid_0 -mca
plm_rsh_agent "ssh -Y" --leave-session-attached --xterm 0,1,2,3!
./HelloMPI
The xterms are also opened if i do not use the '!' hold option.
What does *not* work is
jody <at> aim-triops ~/share/neander $ mpirun -np 2 -host squid_0 -mca
plm_rsh_agent "ssh -Y" --leave-session-attached xterm
xterm Xt error: Can't open display:
xterm: DISPLAY is not set
xterm Xt error: Can't open display:
xterm: DISPLAY is not set
But then again, this call works (i.e. an xterm is opened) if all the
debug-options are used (ompidbg_4.txt).
Here the '--leave-session-attached' is necessary - without it, no xterm.
>From these results i would say that there is no basic mishandling of
'ssh', though i have no idea
what internal differences the use of the '-leave-session-attached'
option or the debug options make.
I hope these observations are helpful
Jody
On Fri, Apr 29, 2011 at 12:08 AM, jody <jody.xha <at> gmail.com> wrote:
> Hi Ralph
>
> Thank you for your suggestions.
> I'll be happy to help you.
> I'm not sure if i'll get around to this tomorrow,
> but i certainly will do so on Monday.
>
> Thanks
> Jody
>
> On Thu, Apr 28, 2011 at 11:53 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>> Hi Jody
>>
>> I'm not sure when I'll get a chance to work on this - got a deadline to meet. I do have a couple of suggestions,
if you wouldn't mind helping debug the problem?
>>
>> It looks to me like the problem is that mpirun is crashing or terminating early for some reason - hence the
failures to send msgs to it, and the "lifeline lost" error that leads to the termination of the daemon. If
you build a debug version of the code (i.e., --enable-debug on configure), you can get a lot of debug info
that traces the behavior.
>>
>> If you could then run your program with
>>
>> -mca plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached
>>
>> and send it to me, we'll see what ORTE thinks it is doing.
>>
>> You could also take a look at the code for implementing the xterm option. You'll find it in
>>
>> orte/mca/odls/base/odls_base_default_fns.c
>>
>> around line 1115. The xterm command syntax is defined in
>>
>> orte/mca/odls/base/odls_base_open.c
>>
>> around line 233 and following. Note that we use "xterm -T" as the cmd. Perhaps you can spot an error in the
way we treat xterm?
>>
>> Also, remember that you have to specify that you want us to "hold" the xterm window open even after the
process terminates. If you don't specify it, the window automatically closes upon completion of the
process. So a fast-running cmd like "hostname" might disappear so quickly that it causes a race condition problem.
>>
>> You might want to try a spinner application - i.e.., output something and then sit in a loop or sleep for
some period of time. Or, use the "hold" option to keep the window open - you designate "hold" by putting a '!'
before the rank, e.g., "mpirun -np 2 -xterm \!2 hostname"
>>
>>
>> On Apr 28, 2011, at 8:38 AM, jody wrote:
>>
>>> Hi
>>>
>>> Unfortunately this does not solve my problem.
>>> While i can do
>>> ssh -Y squid_0 xterm
>>> and this will open an xterm on m,y machiine (chefli),
>>> i run into problems with the -xterm option of openmpi:
>>>
>>> jody <at> chefli ~/share/neander $ mpirun -np 4 -mca plm_rsh_agent "ssh
>>> -Y" -host squid_0 --xterm 1 hostname
>>> squid_0
>>> [squid_0:28046] [[35219,0],1]->[[35219,0],0]
>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>> [sd = 8]
>>> [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
>>> lifeline [[35219,0],0] lost
>>> [squid_0:28046] [[35219,0],1]->[[35219,0],0]
>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>> [sd = 8]
>>> [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
>>> lifeline [[35219,0],0] lost
>>> /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>
>>> By the way when i look at the DISPLAY variable in the xterm window
>>> opened via squid_0,
>>> i also have the display variable "localhost:11.0"
>>>
>>> Actually, the difference with using the "-mca plm_rsh_agent" is that
>>> the lines wiht the warnings about "xauth" and "untrusted X" do not
>>> appear:
>>>
>>> jody <at> chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1 hostname
>>> Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>> Warning: No xauth data; using fake authentication data for X11 forwarding.
>>> squid_0
>>> [squid_0:28337] [[34926,0],1]->[[34926,0],0]
>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>> [sd = 8]
>>> [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
>>> lifeline [[34926,0],0] lost
>>> [squid_0:28337] [[34926,0],1]->[[34926,0],0]
>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>> [sd = 8]
>>> [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
>>> lifeline [[34926,0],0] lost
>>> /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>
>>>
>>> I have doubts that the "-Y" is passed correctly:
>>> jody <at> triops ~/share/neander $ mpirun -np -mca plm_rsh_agent "ssh
>>> -Y" -host squid_0 xterm
>>> xterm Xt error: Can't open display:
>>> xterm: DISPLAY is not set
>>> xterm Xt error: Can't open display:
>>> xterm: DISPLAY is not set
>>>
>>>
>>> ---> as a matter of fact i noticed that the xterm option doesn't work locally:
>>> mpirun -np 4 -xterm 1 /usr/bin/printenv
>>> prints verything onto the console.
>>>
>>> Do you have any other suggestions i could try?
>>>
>>> Thank You
>>> Jody
>>>
>>> On Thu, Apr 28, 2011 at 3:06 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>> Should be able to just set
>>>>
>>>> -mca plm_rsh_agent "ssh -Y"
>>>>
>>>> on your cmd line, I believe
>>>>
>>>> On Apr 28, 2011, at 12:53 AM, jody wrote:
>>>>
>>>>> Hi Ralph
>>>>>
>>>>> Is there an easy way i could modify the OpenMPI code so that it would use
>>>>> the -Y option for ssh when connecting to remote machines?
>>>>>
>>>>> Thank You
>>>>> Jody
>>>>>
>>>>> On Thu, Apr 7, 2011 at 4:01 PM, jody <jody.xha <at> gmail.com> wrote:
>>>>>> Hi Ralph
>>>>>> thank you for your suggestions. After some fiddling, i found that after my
>>>>>> last update (gentoo) my sshd_config had been overwritten
>>>>>> (X11Forwarding was set to 'no').
>>>>>>
>>>>>> After correcting that, i can now open remote terminals with 'ssh -Y'
>>>>>> and with 'ssh -X'
>>>>>> (but with '-X' is till get those xauth warnings)
>>>>>>
>>>>>> But the xterm option still doesn't work:
>>>>>> jody <at> chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1,2
>>>>>> printenv | grep WORLD_RANK
>>>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>>>>> Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>> /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>> /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>> OMPI_COMM_WORLD_RANK=0
>>>>>> [aim-squid_0:09856] [[54132,0],1]->[[54132,0],0]
>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>> [sd = 8]
>>>>>> [aim-squid_0:09856] [[54132,0],1] routed:binomial: Connection to
>>>>>> lifeline [[54132,0],0] lost
>>>>>>
>>>>>> So it looks like the two processes from squid_0 can't open the display this way,
>>>>>> but one of them writes the output to the console...
>>>>>> Surprisingly, they are trying 'localhost:11.0' whereas when i use 'ssh -Y' the
>>>>>> DISPLAY variable is set to 'localhost:10.0'
>>>>>>
>>>>>> So in what way would OMPI have to be adapted, so -xterm would work?
>>>>>>
>>>>>> Thank You
>>>>>> Jody
>>>>>>
>>>>>> On Wed, Apr 6, 2011 at 8:32 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>>>>> Here's a little more info - it's for Cygwin, but I don't see anything
>>>>>>> Cygwin-specific in the answers:
>>>>>>> http://x.cygwin.com/docs/faq/cygwin-x-faq.html#q-ssh-no-x11forwarding
>>>>>>>
>>>>>>> On Apr 6, 2011, at 12:30 PM, Ralph Castain wrote:
>>>>>>>
>>>>>>> Sorry Jody - I should have read your note more carefully to see that you
>>>>>>> already tried -Y.
>>>>>>> Not sure what to suggest...
>>>>>>>
>>>>>>> On Apr 6, 2011, at 12:29 PM, Ralph Castain wrote:
>>>>>>>
>>>>>>> Like I said, I'm not expert. However, a quick "google" of revealed this
>>>>>>> result:
>>>>>>>
>>>>>>> When trying to set up x11 forwarding over an ssh session to a remote server
>>>>>>> with the -X switch, I was getting an error like Warning: No xauth
>>>>>>> data; using fake authentication data for X11 forwarding.
>>>>>>>
>>>>>>> When doing something like:
>>>>>>> ssh -Xl root 10.1.1.9 to a remote server, the authentication worked, but I
>>>>>>> got an error message like:
>>>>>>>
>>>>>>>
>>>>>>> jason <at> badman ~/bin $ ssh -Xl root 10.1.1.9
>>>>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>>>>>> Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>> Last login: Wed Apr 14 18:18:39 2010 from 10.1.1.5
>>>>>>> [root <at> RHEL ~]#
>>>>>>> and any X programs I ran would not display on my local system..
>>>>>>>
>>>>>>> Turns out the solution is to use the -Y switch instead.
>>>>>>>
>>>>>>> ssh -Yl root 10.1.1.9
>>>>>>>
>>>>>>> and that worked fine.
>>>>>>>
>>>>>>> See if that works for you - if it does, we may have to modify OMPI to
>>>>>>> accommodate.
>>>>>>>
>>>>>>> On Apr 6, 2011, at 9:19 AM, jody wrote:
>>>>>>>
>>>>>>> Hi Ralph
>>>>>>> No, after the above error message mpirun has exited.
>>>>>>>
>>>>>>> But i also noticed that it is to ssh into squid_0 and open a xterm there:
>>>>>>>
>>>>>>> jody <at> chefli ~/share/neander $ ssh -Y squid_0
>>>>>>> Last login: Wed Apr 6 17:14:02 CEST 2011 from chefli.uzh.ch on pts/0
>>>>>>> jody <at> squid_0 ~ $ xterm
>>>>>>> xterm Xt error: Can't open display:
>>>>>>> xterm: DISPLAY is not set
>>>>>>> jody <at> squid_0 ~ $ export DISPLAY=130.60.126.74:0.0
>>>>>>> jody <at> squid_0 ~ $ xterm
>>>>>>> xterm Xt error: Can't open display: 130.60.126.74:0.0
>>>>>>> jody <at> squid_0 ~ $ export DISPLAY=chefli.uzh.ch:0.0
>>>>>>> jody <at> squid_0 ~ $ xterm
>>>>>>> xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>> jody <at> squid_0 ~ $ exit
>>>>>>> logout
>>>>>>>
>>>>>>> same thing with ssh -X, but here i get the same warning/error message
>>>>>>> as with mpirun:
>>>>>>>
>>>>>>> jody <at> chefli ~/share/neander $ ssh -X squid_0
>>>>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not
>>>>>>> generated
>>>>>>> Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>> Last login: Wed Apr 6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh
>>>>>>>
>>>>>>> So perhaps the whole problem is linked to that xauth-thing.
>>>>>>> Do you have a suggestion how this can be solved?
>>>>>>>
>>>>>>> Thank You
>>>>>>> Jody
>>>>>>>
>>>>>>> On Wed, Apr 6, 2011 at 4:41 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>>>>>
>>>>>>> If I read your error messages correctly, it looks like mpirun is crashing -
>>>>>>> the daemon is complaining that it lost the socket connection back to mpirun,
>>>>>>> and hence will abort.
>>>>>>>
>>>>>>> Are you seeing mpirun still alive?
>>>>>>>
>>>>>>>
>>>>>>> On Apr 5, 2011, at 4:46 AM, jody wrote:
>>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> On my workstation and the cluster i set up OpenMPI (v 1.4.2) so that
>>>>>>>
>>>>>>> it works in "text-mode":
>>>>>>>
>>>>>>> $ mpirun -np 4 -x DISPLAY -host squid_0 printenv | grep WORLD_RANK
>>>>>>>
>>>>>>> OMPI_COMM_WORLD_RANK=0
>>>>>>>
>>>>>>> OMPI_COMM_WORLD_RANK=1
>>>>>>>
>>>>>>> OMPI_COMM_WORLD_RANK=2
>>>>>>>
>>>>>>> OMPI_COMM_WORLD_RANK=3
>>>>>>>
>>>>>>> but when i use the -xterm option to mpirun, it doesn't work
>>>>>>>
>>>>>>> $ mpirun -np 4 -x DISPLAY -host squid_0 -xterm 1,2 printenv | grep
>>>>>>> WORLD_RANK
>>>>>>>
>>>>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not
>>>>>>> generated
>>>>>>>
>>>>>>> Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>>
>>>>>>> OMPI_COMM_WORLD_RANK=0
>>>>>>>
>>>>>>> [squid_0:05266] [[55607,0],1]->[[55607,0],0]
>>>>>>>
>>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>>>
>>>>>>> [sd = 8]
>>>>>>>
>>>>>>> [squid_0:05266] [[55607,0],1] routed:binomial: Connection to
>>>>>>>
>>>>>>> lifeline [[55607,0],0] lost
>>>>>>>
>>>>>>> /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>
>>>>>>> /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>
>>>>>>> (strange: somebody wrote his message to the console)
>>>>>>>
>>>>>>> No matter whether i set the DISPLAY variable to the full hostname of
>>>>>>>
>>>>>>> the workstation,
>>>>>>>
>>>>>>> to the IP-Adress of the workstation or simply to ":0.0", it doesn't work
>>>>>>>
>>>>>>> But i do have xauth data (as far as i know):
>>>>>>>
>>>>>>> On the remote (squid_0):
>>>>>>>
>>>>>>> jody <at> squid_0 ~ $ xauth list
>>>>>>>
>>>>>>> chefli/unix:10 MIT-MAGIC-COOKIE-1 5293e179bc7b2036d87cbcdf14891d0c
>>>>>>>
>>>>>>> chefli/unix:0 MIT-MAGIC-COOKIE-1 146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>
>>>>>>> chefli.uzh.ch:0 MIT-MAGIC-COOKIE-1 146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>
>>>>>>> on the workstation:
>>>>>>>
>>>>>>> $ xauth list
>>>>>>>
>>>>>>> chefli/unix:10 MIT-MAGIC-COOKIE-1 5293e179bc7b2036d87cbcdf14891d0c
>>>>>>>
>>>>>>> chefli/unix:0 MIT-MAGIC-COOKIE-1 146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>
>>>>>>> localhost.localdomain/unix:0 MIT-MAGIC-COOKIE-1
>>>>>>>
>>>>>>> 146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>
>>>>>>> chefli.uzh.ch/unix:0 MIT-MAGIC-COOKIE-1 146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>
>>>>>>> In sshd_config on the workstation i have 'X11Forwarding yes'
>>>>>>>
>>>>>>> I have also done
>>>>>>>
>>>>>>> xhost + squid_0
>>>>>>>
>>>>>>> on the workstation.
>>>>>>>
>>>>>>>
>>>>>>> How can i get the -xterm option running?
>>>>>>>
>>>>>>> Thank You
>>>>>>>
>>>>>>> Jody
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>>
>>>>>>> users mailing list
>>>>>>>
>>>>>>> users <at> open-mpi.org
>>>>>>>
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>>
>>>>>>> users mailing list
>>>>>>>
>>>>>>> users <at> open-mpi.org
>>>>>>>
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users <at> open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users <at> open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users <at> open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users <at> open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users <at> open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users <at> open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
jody <at> chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca plm_rsh_agent "ssh -Y" -mca
plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached --xterm 0 ./HelloMPI
[chefli:02420] mca:base:select:( plm) Querying component [rsh]
[chefli:02420] mca:base:select:( plm) Query of component [rsh] set priority to 10
[chefli:02420] mca:base:select:( plm) Querying component [slurm]
[chefli:02420] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module
[chefli:02420] mca:base:select:( plm) Selected component [rsh]
[chefli:02420] plm:base:set_hnp_name: initial bias 2420 nodename hash 72192778
[chefli:02420] plm:base:set_hnp_name: final jobfam 40499
[chefli:02420] [[40499,0],0] plm:base:receive start comm
[chefli:02420] mca:base:select:( odls) Querying component [default]
[chefli:02420] mca:base:select:( odls) Query of component [default] set priority to 1
[chefli:02420] mca:base:select:( odls) Selected component [default]
[chefli:02420] [[40499,0],0] plm:rsh: setting up job [40499,1]
[chefli:02420] [[40499,0],0] plm:base:setup_job for job [40499,1]
[chefli:02420] [[40499,0],0] plm:rsh: local shell: 0 (bash)
[chefli:02420] [[40499,0],0] plm:rsh: assuming same remote shell as local shell
[chefli:02420] [[40499,0],0] plm:rsh: remote shell: 0 (bash)
[chefli:02420] [[40499,0],0] plm:rsh: final template argv:
/usr/bin/ssh -Y -X <template> orted -mca ess env -mca orte_ess_jobid 2654142464 -mca orte_ess_vpid
<template> -mca orte_ess_num_procs 2 --hnp-uri "2654142464.0;tcp://192.168.0.14:39093" -mca
plm_base_verbose 5 -mca odls_base_verbose 5 --xterm 0 -mca plm_rsh_agent "ssh -Y"
[chefli:02420] [[40499,0],0] plm:rsh: launching on node squid_0
[chefli:02420] [[40499,0],0] plm:rsh: recording launch of daemon [[40499,0],1]
[chefli:02420] [[40499,0],0] plm:rsh: executing: (//usr/bin/ssh) [/usr/bin/ssh -Y -X squid_0 orted
-mca ess env -mca orte_ess_jobid 2654142464 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri
"2654142464.0;tcp://192.168.0.14:39093" -mca plm_base_verbose 5 -mca odls_base_verbose 5 --xterm
0 -mca plm_rsh_agent "ssh -Y"]
[squid_0:19442] mca:base:select:( odls) Querying component [default]
[squid_0:19442] mca:base:select:( odls) Query of component [default] set priority to 1
[squid_0:19442] mca:base:select:( odls) Selected component [default]
[chefli:02420] [[40499,0],0] plm:base:daemon_callback
[chefli:02420] [[40499,0],0] plm:base:orted_report_launch from daemon [[40499,0],1]
[chefli:02420] [[40499,0],0] plm:base:orted_report_launch completed for daemon [[40499,0],1]
[chefli:02420] [[40499,0],0] plm:base:daemon_callback completed
[chefli:02420] [[40499,0],0] plm:base:launch_apps for job [40499,1]
[chefli:02420] [[40499,0],0] plm:base:report_launched for job [40499,1]
[chefli:02420] [[40499,0],0] odls:constructing child list
[chefli:02420] [[40499,0],0] odls:construct_child_list unpacking data to launch job [40499,1]
[chefli:02420] [[40499,0],0] odls:construct_child_list adding new jobdat for job [40499,1]
[chefli:02420] [[40499,0],0] odls:construct_child_list unpacking 1 app_contexts
[chefli:02420] [[40499,0],0] odls:constructing child list - checking proc 0 on node 1 with daemon 1
[chefli:02420] [[40499,0],0] odls:construct:child: num_participating 1
[chefli:02420] [[40499,0],0] odls:launch found 12 processors for 0 children and set oversubscribed to false
[chefli:02420] [[40499,0],0] odls:launch reporting job [40499,1] launch status
[chefli:02420] [[40499,0],0] odls:launch setting waitpids
[chefli:02420] [[40499,0],0] plm:base:app_report_launch from daemon [[40499,0],0]
[chefli:02420] [[40499,0],0] plm:base:app_report_launch completed processing
[squid_0:19442] [[40499,0],1] odls:constructing child list
[squid_0:19442] [[40499,0],1] odls:construct_child_list unpacking data to launch job [40499,1]
[squid_0:19442] [[40499,0],1] odls:construct_child_list adding new jobdat for job [40499,1]
[squid_0:19442] [[40499,0],1] odls:construct_child_list unpacking 1 app_contexts
[squid_0:19442] [[40499,0],1] odls:constructing child list - checking proc 0 on node 1 with daemon 1
[squid_0:19442] [[40499,0],1] odls:constructing child list - found proc 0 for me!
[squid_0:19442] [[40499,0],1] odls:construct:child: num_participating 1
[squid_0:19442] [[40499,0],1] odls:launch found 4 processors for 1 children and set oversubscribed to false
[squid_0:19442] [[40499,0],1] odls:launch reporting job [40499,1] launch status
[squid_0:19442] [[40499,0],1] odls:launch setting waitpids
[chefli:02420] [[40499,0],0] plm:base:app_report_launch reissuing non-blocking recv
[chefli:02420] [[40499,0],0] plm:base:app_report_launch from daemon [[40499,0],1]
[chefli:02420] [[40499,0],0] plm:base:app_report_launched for proc [[40499,1],0] from daemon
[[40499,0],1]: pid 19446 state 2 exit 0
[chefli:02420] [[40499,0],0] plm:base:app_report_launch completed processing
[chefli:02420] [[40499,0],0] plm:base:report_launched all apps reported
[chefli:02420] [[40499,0],0] plm:base:launch wiring up iof
[chefli:02420] [[40499,0],0] plm:base:launch completed for job [40499,1]
[squid_0:19442] [[40499,0],1] odls: registering sync on child [[40499,1],0]
[squid_0:19442] [[40499,0],1] odls:sync nidmap requested for job [40499,1]
[squid_0:19442] [[40499,0],1] odls: sending sync ack to child [[40499,1],0] with 144 bytes of data
[squid_0:19442] [[40499,0],1] odls: sending contact info to HNP
[squid_0:19442] [[40499,0],1] odls: collecting data from child [[40499,1],0]
[squid_0:19442] [[40499,0],1] odls: executing collective
[squid_0:19442] [[40499,0],1] odls: daemon collective called
[squid_0:19442] [[40499,0],1] odls: daemon collective for job [40499,1] from [[40499,0],1] type 2
num_collected 1 num_participating 1 num_contributors 1
[squid_0:19442] [[40499,0],1] odls: daemon collective not the HNP - sending to parent [[40499,0],0]
[squid_0:19442] [[40499,0],1] odls: collective completed
[chefli:02420] [[40499,0],0] odls: daemon collective called
[chefli:02420] [[40499,0],0] odls: daemon collective for job [40499,1] from [[40499,0],1] type 2
num_collected 1 num_participating 1 num_contributors 1
[chefli:02420] [[40499,0],0] odls: daemon collective HNP - xcasting to job [40499,1]
[squid_0:19442] [[40499,0],1] odls: sending message to tag 15 on child [[40499,1],0]
[squid_0:19442] [[40499,0],1] odls: collecting data from child [[40499,1],0]
[squid_0:19442] [[40499,0],1] odls: executing collective
[squid_0:19442] [[40499,0],1] odls: daemon collective called
[squid_0:19442] [[40499,0],1] odls: daemon collective for job [40499,1] from [[40499,0],1] type 1
num_collected 1 num_participating 1 num_contributors 1
[squid_0:19442] [[40499,0],1] odls: daemon collective not the HNP - sending to parent [[40499,0],0]
[squid_0:19442] [[40499,0],1] odls: collective completed
[chefli:02420] [[40499,0],0] odls: daemon collective called
[chefli:02420] [[40499,0],0] odls: daemon collective for job [40499,1] from [[40499,0],1] type 1
num_collected 1 num_participating 1 num_contributors 1
[chefli:02420] [[40499,0],0] odls: daemon collective HNP - xcasting to job [40499,1]
[squid_0:19442] [[40499,0],1] odls: sending message to tag 17 on child [[40499,1],0]
[squid_0:19442] [[40499,0],1] odls: collecting data from child [[40499,1],0]
[squid_0:19442] [[40499,0],1] odls: executing collective
[squid_0:19442] [[40499,0],1] odls: daemon collective called
[squid_0:19442] [[40499,0],1] odls: daemon collective for job [40499,1] from [[40499,0],1] type 1
num_collected 1 num_participating 1 num_contributors 1
[squid_0:19442] [[40499,0],1] odls: daemon collective not the HNP - sending to parent [[40499,0],0]
[squid_0:19442] [[40499,0],1] odls: collective completed
[chefli:02420] [[40499,0],0] odls: daemon collective called
[chefli:02420] [[40499,0],0] odls: daemon collective for job [40499,1] from [[40499,0],1] type 1
num_collected 1 num_participating 1 num_contributors 1
[chefli:02420] [[40499,0],0] odls: daemon collective HNP - xcasting to job [40499,1]
[squid_0:19442] [[40499,0],1] odls: sending message to tag 17 on child [[40499,1],0]
[squid_0:19442] [[40499,0],1] odls: registering sync on child [[40499,1],0]
[squid_0:19442] [[40499,0],1] odls: sending sync ack to child [[40499,1],0] with 0 bytes of data
[chefli:02420] [[40499,0],0] plm:base:receive got message from [[40499,0],1]
[chefli:02420] [[40499,0],0] plm:base:receive got update_proc_state for job [40499,1]
[chefli:02420] [[40499,0],0] plm:base:receive got update_proc_state for proc [[40499,1],0] curnt
state 4 new state 80 exit_code 0
[chefli:02420] [[40499,0],0] plm:base:check_job_completed for job [40499,1] - num_terminated 1
num_procs 1
[chefli:02420] [[40499,0],0] plm:base:check_job_completed declared job [40499,1] normally
terminated - checking all jobs
[chefli:02420] [[40499,0],0] plm:base:check_job_completed all jobs terminated - waking up
[chefli:02420] [[40499,0],0] plm:base:orted_cmd sending orted_exit commands
[chefli:02420] [[40499,0],0] odls:kill_local_proc working on job [WILDCARD]
[chefli:02420] [[40499,0],0] plm:base:check_job_completed for job [40499,0] - num_terminated 1
num_procs 2
[squid_0:19442] [[40499,0],1] odls:wait_local_proc child process 19446 terminated
[squid_0:19442] [[40499,0],1] odls:notify_iof_complete for child [[40499,1],0]
[squid_0:19442] [[40499,0],1] odls:waitpid_fired checking abort file /tmp/openmpi-sessions-jody <at> squid_0_0/2654142465/0/abort
[chefli:02420] [[40499,0],0] plm:base:receive got message from [[40499,0],1]
[chefli:02420] [[40499,0],0] plm:base:receive got update_proc_state for job [40499,0]
[chefli:02420] [[40499,0],0] plm:base:receive got update_proc_state for proc [[40499,0],1] curnt
state 4 new state 80 exit_code 0
[chefli:02420] [[40499,0],0] plm:base:check_job_completed for job [40499,0] - num_terminated 2
num_procs 2
[chefli:02420] [[40499,0],0] plm:base:check_job_completed declared job [40499,0] normally
terminated - checking all jobs
[chefli:02420] [[40499,0],0] plm:base:receive stop comm
[squid_0:19442] [[40499,0],1] odls:waitpid_fired child process [[40499,1],0] terminated normally
[squid_0:19442] [[40499,0],1] odls:proc_complete reporting all procs in [40499,1] terminated
[squid_0:19442] [[40499,0],1] odls:kill_local_proc working on job [WILDCARD]
jody <at> chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca plm_rsh_agent "ssh -Y" -mca
plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached --xterm 0 hostname
[chefli:02476] mca:base:select:( plm) Querying component [rsh]
[chefli:02476] mca:base:select:( plm) Query of component [rsh] set priority to 10
[chefli:02476] mca:base:select:( plm) Querying component [slurm]
[chefli:02476] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module
[chefli:02476] mca:base:select:( plm) Selected component [rsh]
[chefli:02476] plm:base:set_hnp_name: initial bias 2476 nodename hash 72192778
[chefli:02476] plm:base:set_hnp_name: final jobfam 40683
[chefli:02476] [[40683,0],0] plm:base:receive start comm
[chefli:02476] mca:base:select:( odls) Querying component [default]
[chefli:02476] mca:base:select:( odls) Query of component [default] set priority to 1
[chefli:02476] mca:base:select:( odls) Selected component [default]
[chefli:02476] [[40683,0],0] plm:rsh: setting up job [40683,1]
[chefli:02476] [[40683,0],0] plm:base:setup_job for job [40683,1]
[chefli:02476] [[40683,0],0] plm:rsh: local shell: 0 (bash)
[chefli:02476] [[40683,0],0] plm:rsh: assuming same remote shell as local shell
[chefli:02476] [[40683,0],0] plm:rsh: remote shell: 0 (bash)
[chefli:02476] [[40683,0],0] plm:rsh: final template argv:
/usr/bin/ssh -Y -X <template> orted -mca ess env -mca orte_ess_jobid 2666201088 -mca orte_ess_vpid
<template> -mca orte_ess_num_procs 2 --hnp-uri "2666201088.0;tcp://192.168.0.14:53879" -mca
plm_base_verbose 5 -mca odls_base_verbose 5 --xterm 0 -mca plm_rsh_agent "ssh -Y"
[chefli:02476] [[40683,0],0] plm:rsh: launching on node squid_0
[chefli:02476] [[40683,0],0] plm:rsh: recording launch of daemon [[40683,0],1]
[chefli:02476] [[40683,0],0] plm:rsh: executing: (//usr/bin/ssh) [/usr/bin/ssh -Y -X squid_0 orted
-mca ess env -mca orte_ess_jobid 2666201088 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri
"2666201088.0;tcp://192.168.0.14:53879" -mca plm_base_verbose 5 -mca odls_base_verbose 5 --xterm
0 -mca plm_rsh_agent "ssh -Y"]
[squid_0:19579] mca:base:select:( odls) Querying component [default]
[squid_0:19579] mca:base:select:( odls) Query of component [default] set priority to 1
[squid_0:19579] mca:base:select:( odls) Selected component [default]
[chefli:02476] [[40683,0],0] plm:base:daemon_callback
[chefli:02476] [[40683,0],0] plm:base:orted_report_launch from daemon [[40683,0],1]
[chefli:02476] [[40683,0],0] plm:base:orted_report_launch completed for daemon [[40683,0],1]
[chefli:02476] [[40683,0],0] plm:base:daemon_callback completed
[chefli:02476] [[40683,0],0] plm:base:launch_apps for job [40683,1]
[chefli:02476] [[40683,0],0] plm:base:report_launched for job [40683,1]
[chefli:02476] [[40683,0],0] odls:constructing child list
[chefli:02476] [[40683,0],0] odls:construct_child_list unpacking data to launch job [40683,1]
[chefli:02476] [[40683,0],0] odls:construct_child_list adding new jobdat for job [40683,1]
[chefli:02476] [[40683,0],0] odls:construct_child_list unpacking 1 app_contexts
[chefli:02476] [[40683,0],0] odls:constructing child list - checking proc 0 on node 1 with daemon 1
[chefli:02476] [[40683,0],0] odls:construct:child: num_participating 1
[chefli:02476] [[40683,0],0] odls:launch found 12 processors for 0 children and set oversubscribed to false
[chefli:02476] [[40683,0],0] odls:launch reporting job [40683,1] launch status
[chefli:02476] [[40683,0],0] odls:launch setting waitpids
[chefli:02476] [[40683,0],0] plm:base:app_report_launch from daemon [[40683,0],0]
[chefli:02476] [[40683,0],0] plm:base:app_report_launch completed processing
[squid_0:19579] [[40683,0],1] odls:constructing child list
[squid_0:19579] [[40683,0],1] odls:construct_child_list unpacking data to launch job [40683,1]
[squid_0:19579] [[40683,0],1] odls:construct_child_list adding new jobdat for job [40683,1]
[squid_0:19579] [[40683,0],1] odls:construct_child_list unpacking 1 app_contexts
[squid_0:19579] [[40683,0],1] odls:constructing child list - checking proc 0 on node 1 with daemon 1
[squid_0:19579] [[40683,0],1] odls:constructing child list - found proc 0 for me!
[squid_0:19579] [[40683,0],1] odls:construct:child: num_participating 1
[squid_0:19579] [[40683,0],1] odls:launch found 4 processors for 1 children and set oversubscribed to false
[squid_0:19579] [[40683,0],1] odls:launch reporting job [40683,1] launch status
[squid_0:19579] [[40683,0],1] odls:launch setting waitpids
[chefli:02476] [[40683,0],0] plm:base:app_report_launch reissuing non-blocking recv
[chefli:02476] [[40683,0],0] plm:base:app_report_launch from daemon [[40683,0],1]
[chefli:02476] [[40683,0],0] plm:base:app_report_launched for proc [[40683,1],0] from daemon
[[40683,0],1]: pid 19583 state 2 exit 0
[chefli:02476] [[40683,0],0] plm:base:app_report_launch completed processing
[chefli:02476] [[40683,0],0] plm:base:report_launched all apps reported
[chefli:02476] [[40683,0],0] plm:base:launch wiring up iof
[chefli:02476] [[40683,0],0] plm:base:launch completed for job [40683,1]
[squid_0:19579] [[40683,0],1] odls:wait_local_proc child process 19583 terminated
[squid_0:19579] [[40683,0],1] odls:waitpid_fired checking abort file /tmp/openmpi-sessions-jody <at> squid_0_0/2666201089/0/abort
[squid_0:19579] [[40683,0],1] odls:waitpid_fired child process [[40683,1],0] terminated normally
[squid_0:19579] [[40683,0],1] odls:notify_iof_complete for child [[40683,1],0]
[chefli:02476] [[40683,0],0] plm:base:receive got message from [[40683,0],1]
[chefli:02476] [[40683,0],0] plm:base:receive got update_proc_state for job [40683,1]
[chefli:02476] [[40683,0],0] plm:base:receive got update_proc_state for proc [[40683,1],0] curnt
state 2 new state 80 exit_code 0
[chefli:02476] [[40683,0],0] plm:base:check_job_completed for job [40683,1] - num_terminated 1
num_procs 1
[chefli:02476] [[40683,0],0] plm:base:check_job_completed declared job [40683,1] normally
terminated - checking all jobs
[chefli:02476] [[40683,0],0] plm:base:check_job_completed all jobs terminated - waking up
[chefli:02476] [[40683,0],0] plm:base:orted_cmd sending orted_exit commands
[chefli:02476] [[40683,0],0] odls:kill_local_proc working on job [WILDCARD]
[chefli:02476] [[40683,0],0] plm:base:check_job_completed for job [40683,0] - num_terminated 1
num_procs 2
[squid_0:19579] [[40683,0],1] odls:proc_complete reporting all procs in [40683,1] terminated
[chefli:02476] [[40683,0],0] plm:base:receive got message from [[40683,0],1]
[chefli:02476] [[40683,0],0] plm:base:receive got update_proc_state for job [40683,0]
[chefli:02476] [[40683,0],0] plm:base:receive got update_proc_state for proc [[40683,0],1] curnt
state 4 new state 80 exit_code 0
[chefli:02476] [[40683,0],0] plm:base:check_job_completed for job [40683,0] - num_terminated 2
num_procs 2
[chefli:02476] [[40683,0],0] plm:base:check_job_completed declared job [40683,0] normally
terminated - checking all jobs
[chefli:02476] [[40683,0],0] plm:base:receive stop comm
[squid_0:19579] [[40683,0],1] odls:kill_local_proc working on job [WILDCARD]
jody <at> chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca plm_rsh_agent "ssh -Y" -mca
plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached --xterm 0! hostname
[chefli:02487] mca:base:select:( plm) Querying component [rsh]
[chefli:02487] mca:base:select:( plm) Query of component [rsh] set priority to 10
[chefli:02487] mca:base:select:( plm) Querying component [slurm]
[chefli:02487] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module
[chefli:02487] mca:base:select:( plm) Selected component [rsh]
[chefli:02487] plm:base:set_hnp_name: initial bias 2487 nodename hash 72192778
[chefli:02487] plm:base:set_hnp_name: final jobfam 40688
[chefli:02487] [[40688,0],0] plm:base:receive start comm
[chefli:02487] mca:base:select:( odls) Querying component [default]
[chefli:02487] mca:base:select:( odls) Query of component [default] set priority to 1
[chefli:02487] mca:base:select:( odls) Selected component [default]
[chefli:02487] [[40688,0],0] plm:rsh: setting up job [40688,1]
[chefli:02487] [[40688,0],0] plm:base:setup_job for job [40688,1]
[chefli:02487] [[40688,0],0] plm:rsh: local shell: 0 (bash)
[chefli:02487] [[40688,0],0] plm:rsh: assuming same remote shell as local shell
[chefli:02487] [[40688,0],0] plm:rsh: remote shell: 0 (bash)
[chefli:02487] [[40688,0],0] plm:rsh: final template argv:
/usr/bin/ssh -Y -X <template> orted -mca ess env -mca orte_ess_jobid 2666528768 -mca orte_ess_vpid
<template> -mca orte_ess_num_procs 2 --hnp-uri "2666528768.0;tcp://192.168.0.14:36402" -mca
plm_base_verbose 5 -mca odls_base_verbose 5 --xterm 0! -mca plm_rsh_agent "ssh -Y"
[chefli:02487] [[40688,0],0] plm:rsh: launching on node squid_0
[chefli:02487] [[40688,0],0] plm:rsh: recording launch of daemon [[40688,0],1]
[chefli:02487] [[40688,0],0] plm:rsh: executing: (//usr/bin/ssh) [/usr/bin/ssh -Y -X squid_0 orted
-mca ess env -mca orte_ess_jobid 2666528768 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri
"2666528768.0;tcp://192.168.0.14:36402" -mca plm_base_verbose 5 -mca odls_base_verbose 5 --xterm
0! -mca plm_rsh_agent "ssh -Y"]
[squid_0:19613] mca:base:select:( odls) Querying component [default]
[squid_0:19613] mca:base:select:( odls) Query of component [default] set priority to 1
[squid_0:19613] mca:base:select:( odls) Selected component [default]
[chefli:02487] [[40688,0],0] plm:base:daemon_callback
[chefli:02487] [[40688,0],0] plm:base:orted_report_launch from daemon [[40688,0],1]
[chefli:02487] [[40688,0],0] plm:base:orted_report_launch completed for daemon [[40688,0],1]
[chefli:02487] [[40688,0],0] plm:base:daemon_callback completed
[chefli:02487] [[40688,0],0] plm:base:launch_apps for job [40688,1]
[chefli:02487] [[40688,0],0] plm:base:report_launched for job [40688,1]
[chefli:02487] [[40688,0],0] odls:constructing child list
[chefli:02487] [[40688,0],0] odls:construct_child_list unpacking data to launch job [40688,1]
[chefli:02487] [[40688,0],0] odls:construct_child_list adding new jobdat for job [40688,1]
[chefli:02487] [[40688,0],0] odls:construct_child_list unpacking 1 app_contexts
[chefli:02487] [[40688,0],0] odls:constructing child list - checking proc 0 on node 1 with daemon 1
[chefli:02487] [[40688,0],0] odls:construct:child: num_participating 1
[chefli:02487] [[40688,0],0] odls:launch found 12 processors for 0 children and set oversubscribed to false
[chefli:02487] [[40688,0],0] odls:launch reporting job [40688,1] launch status
[chefli:02487] [[40688,0],0] odls:launch setting waitpids
[chefli:02487] [[40688,0],0] plm:base:app_report_launch from daemon [[40688,0],0]
[chefli:02487] [[40688,0],0] plm:base:app_report_launch completed processing
[squid_0:19613] [[40688,0],1] odls:constructing child list
[squid_0:19613] [[40688,0],1] odls:construct_child_list unpacking data to launch job [40688,1]
[squid_0:19613] [[40688,0],1] odls:construct_child_list adding new jobdat for job [40688,1]
[squid_0:19613] [[40688,0],1] odls:construct_child_list unpacking 1 app_contexts
[squid_0:19613] [[40688,0],1] odls:constructing child list - checking proc 0 on node 1 with daemon 1
[squid_0:19613] [[40688,0],1] odls:constructing child list - found proc 0 for me!
[squid_0:19613] [[40688,0],1] odls:construct:child: num_participating 1
[squid_0:19613] [[40688,0],1] odls:launch found 4 processors for 1 children and set oversubscribed to false
[squid_0:19613] [[40688,0],1] odls:launch reporting job [40688,1] launch status
[squid_0:19613] [[40688,0],1] odls:launch setting waitpids
[chefli:02487] [[40688,0],0] plm:base:app_report_launch reissuing non-blocking recv
[chefli:02487] [[40688,0],0] plm:base:app_report_launch from daemon [[40688,0],1]
[chefli:02487] [[40688,0],0] plm:base:app_report_launched for proc [[40688,1],0] from daemon
[[40688,0],1]: pid 19617 state 2 exit 0
[chefli:02487] [[40688,0],0] plm:base:app_report_launch completed processing
[chefli:02487] [[40688,0],0] plm:base:report_launched all apps reported
[chefli:02487] [[40688,0],0] plm:base:launch wiring up iof
[chefli:02487] [[40688,0],0] plm:base:launch completed for job [40688,1]
[squid_0:19613] [[40688,0],1] odls:wait_local_proc child process 19617 terminated
[squid_0:19613] [[40688,0],1] odls:waitpid_fired checking abort file /tmp/openmpi-sessions-jody <at> squid_0_0/2666528769/0/abort
[squid_0:19613] [[40688,0],1] odls:waitpid_fired child process [[40688,1],0] terminated normally
[squid_0:19613] [[40688,0],1] odls:notify_iof_complete for child [[40688,1],0]
[squid_0:19613] [[40688,0],1] odls:proc_complete reporting all procs in [40688,1] terminated
[chefli:02487] [[40688,0],0] plm:base:receive got message from [[40688,0],1]
[chefli:02487] [[40688,0],0] plm:base:receive got update_proc_state for job [40688,1]
[chefli:02487] [[40688,0],0] plm:base:receive got update_proc_state for proc [[40688,1],0] curnt
state 2 new state 80 exit_code 0
[chefli:02487] [[40688,0],0] plm:base:check_job_completed for job [40688,1] - num_terminated 1
num_procs 1
[chefli:02487] [[40688,0],0] plm:base:check_job_completed declared job [40688,1] normally
terminated - checking all jobs
[chefli:02487] [[40688,0],0] plm:base:check_job_completed all jobs terminated - waking up
[chefli:02487] [[40688,0],0] plm:base:orted_cmd sending orted_exit commands
[chefli:02487] [[40688,0],0] odls:kill_local_proc working on job [WILDCARD]
[chefli:02487] [[40688,0],0] plm:base:check_job_completed for job [40688,0] - num_terminated 1
num_procs 2
[chefli:02487] [[40688,0],0] plm:base:receive got message from [[40688,0],1]
[chefli:02487] [[40688,0],0] plm:base:receive got update_proc_state for job [40688,0]
[chefli:02487] [[40688,0],0] plm:base:receive got update_proc_state for proc [[40688,0],1] curnt
state 4 new state 80 exit_code 0
[chefli:02487] [[40688,0],0] plm:base:check_job_completed for job [40688,0] - num_terminated 2
num_procs 2
[chefli:02487] [[40688,0],0] plm:base:check_job_completed declared job [40688,0] normally
terminated - checking all jobs
[squid_0:19613] [[40688,0],1] odls:kill_local_proc working on job [WILDCARD]
[chefli:02487] [[40688,0],0] plm:base:receive stop comm
jody <at> chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca plm_rsh_agent "ssh -Y" -mca
plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached xterm
[chefli:02619] mca:base:select:( plm) Querying component [rsh]
[chefli:02619] mca:base:select:( plm) Query of component [rsh] set priority to 10
[chefli:02619] mca:base:select:( plm) Querying component [slurm]
[chefli:02619] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module
[chefli:02619] mca:base:select:( plm) Selected component [rsh]
[chefli:02619] plm:base:set_hnp_name: initial bias 2619 nodename hash 72192778
[chefli:02619] plm:base:set_hnp_name: final jobfam 40316
[chefli:02619] [[40316,0],0] plm:base:receive start comm
[chefli:02619] mca:base:select:( odls) Querying component [default]
[chefli:02619] mca:base:select:( odls) Query of component [default] set priority to 1
[chefli:02619] mca:base:select:( odls) Selected component [default]
[chefli:02619] [[40316,0],0] plm:rsh: setting up job [40316,1]
[chefli:02619] [[40316,0],0] plm:base:setup_job for job [40316,1]
[chefli:02619] [[40316,0],0] plm:rsh: local shell: 0 (bash)
[chefli:02619] [[40316,0],0] plm:rsh: assuming same remote shell as local shell
[chefli:02619] [[40316,0],0] plm:rsh: remote shell: 0 (bash)
[chefli:02619] [[40316,0],0] plm:rsh: final template argv:
/usr/bin/ssh -Y <template> orted -mca ess env -mca orte_ess_jobid 2642149376 -mca orte_ess_vpid
<template> -mca orte_ess_num_procs 2 --hnp-uri "2642149376.0;tcp://192.168.0.14:57848" -mca
plm_base_verbose 5 -mca odls_base_verbose 5 -mca plm_rsh_agent "ssh -Y"
[chefli:02619] [[40316,0],0] plm:rsh: launching on node squid_0
[chefli:02619] [[40316,0],0] plm:rsh: recording launch of daemon [[40316,0],1]
[chefli:02619] [[40316,0],0] plm:rsh: executing: (//usr/bin/ssh) [/usr/bin/ssh -Y squid_0 orted
-mca ess env -mca orte_ess_jobid 2642149376 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri
"2642149376.0;tcp://192.168.0.14:57848" -mca plm_base_verbose 5 -mca odls_base_verbose 5 -mca
plm_rsh_agent "ssh -Y"]
[squid_0:20023] mca:base:select:( odls) Querying component [default]
[squid_0:20023] mca:base:select:( odls) Query of component [default] set priority to 1
[squid_0:20023] mca:base:select:( odls) Selected component [default]
[chefli:02619] [[40316,0],0] plm:base:daemon_callback
[chefli:02619] [[40316,0],0] plm:base:orted_report_launch from daemon [[40316,0],1]
[chefli:02619] [[40316,0],0] plm:base:orted_report_launch completed for daemon [[40316,0],1]
[chefli:02619] [[40316,0],0] plm:base:daemon_callback completed
[chefli:02619] [[40316,0],0] plm:base:launch_apps for job [40316,1]
[chefli:02619] [[40316,0],0] plm:base:report_launched for job [40316,1]
[chefli:02619] [[40316,0],0] odls:constructing child list
[chefli:02619] [[40316,0],0] odls:construct_child_list unpacking data to launch job [40316,1]
[chefli:02619] [[40316,0],0] odls:construct_child_list adding new jobdat for job [40316,1]
[chefli:02619] [[40316,0],0] odls:construct_child_list unpacking 1 app_contexts
[chefli:02619] [[40316,0],0] odls:constructing child list - checking proc 0 on node 1 with daemon 1
[chefli:02619] [[40316,0],0] odls:construct:child: num_participating 1
[chefli:02619] [[40316,0],0] odls:launch found 12 processors for 0 children and set oversubscribed to false
[chefli:02619] [[40316,0],0] odls:launch reporting job [40316,1] launch status
[chefli:02619] [[40316,0],0] odls:launch setting waitpids
[chefli:02619] [[40316,0],0] plm:base:app_report_launch from daemon [[40316,0],0]
[chefli:02619] [[40316,0],0] plm:base:app_report_launch completed processing
[squid_0:20023] [[40316,0],1] odls:constructing child list
[squid_0:20023] [[40316,0],1] odls:construct_child_list unpacking data to launch job [40316,1]
[squid_0:20023] [[40316,0],1] odls:construct_child_list adding new jobdat for job [40316,1]
[squid_0:20023] [[40316,0],1] odls:construct_child_list unpacking 1 app_contexts
[squid_0:20023] [[40316,0],1] odls:constructing child list - checking proc 0 on node 1 with daemon 1
[squid_0:20023] [[40316,0],1] odls:constructing child list - found proc 0 for me!
[squid_0:20023] [[40316,0],1] odls:construct:child: num_participating 1
[squid_0:20023] [[40316,0],1] odls:launch found 4 processors for 1 children and set oversubscribed to false
[chefli:02619] [[40316,0],0] plm:base:app_report_launch reissuing non-blocking recv
[chefli:02619] [[40316,0],0] plm:base:app_report_launch from daemon [[40316,0],1]
[chefli:02619] [[40316,0],0] plm:base:app_report_launched for proc [[40316,1],0] from daemon
[[40316,0],1]: pid 20027 state 2 exit 0
[chefli:02619] [[40316,0],0] plm:base:app_report_launch completed processing
[chefli:02619] [[40316,0],0] plm:base:report_launched all apps reported
[chefli:02619] [[40316,0],0] plm:base:launch wiring up iof
[chefli:02619] [[40316,0],0] plm:base:launch completed for job [40316,1]
[squid_0:20023] [[40316,0],1] odls:launch reporting job [40316,1] launch status
[squid_0:20023] [[40316,0],1] odls:launch setting waitpids
[chefli:02619] [[40316,0],0] plm:base:receive got message from [[40316,0],1]
[squid_0:20023] [[40316,0],1] odls:wait_local_proc child process 20027 terminated
[squid_0:20023] [[40316,0],1] odls:waitpid_fired checking abort file /tmp/openmpi-sessions-jody <at> squid_0_0/2642149377/0/abort
[squid_0:20023] [[40316,0],1] odls:waitpid_fired child process [[40316,1],0] terminated normally
[squid_0:20023] [[40316,0],1] odls:notify_iof_complete for child [[40316,1],0]
[chefli:02619] [[40316,0],0] plm:base:receive got update_proc_state for job [40316,1]
[squid_0:20023] [[40316,0],1] odls:proc_complete reporting all procs in [40316,1] terminated
[chefli:02619] [[40316,0],0] plm:base:receive got update_proc_state for proc [[40316,1],0] curnt
state 2 new state 80 exit_code 0
[chefli:02619] [[40316,0],0] plm:base:check_job_completed for job [40316,1] - num_terminated 1
num_procs 1
[chefli:02619] [[40316,0],0] plm:base:check_job_completed declared job [40316,1] normally
terminated - checking all jobs
[chefli:02619] [[40316,0],0] plm:base:check_job_completed all jobs terminated - waking up
[chefli:02619] [[40316,0],0] plm:base:orted_cmd sending orted_exit commands
[chefli:02619] [[40316,0],0] odls:kill_local_proc working on job [WILDCARD]
[chefli:02619] [[40316,0],0] plm:base:check_job_completed for job [40316,0] - num_terminated 1
num_procs 2
[chefli:02619] [[40316,0],0] plm:base:receive got message from [[40316,0],1]
[chefli:02619] [[40316,0],0] plm:base:receive got update_proc_state for job [40316,0]
[chefli:02619] [[40316,0],0] plm:base:receive got update_proc_state for proc [[40316,0],1] curnt
state 4 new state 80 exit_code 0
[chefli:02619] [[40316,0],0] plm:base:check_job_completed for job [40316,0] - num_terminated 2
num_procs 2
[chefli:02619] [[40316,0],0] plm:base:check_job_completed declared job [40316,0] normally
terminated - checking all jobs
[chefli:02619] [[40316,0],0] plm:base:receive stop comm
[squid_0:20023] [[40316,0],1] odls:kill_local_proc working on job [WILDCARD]
_______________________________________________
users mailing list
users <at> open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users