Jack Bryan | 1 May 2011 02:52
Picon
Favicon

[OMPI users] OMPI vs. network socket communcation

Hi, All:

What is the relationship between MPI communication and socket communication ? 

Is the network socket programming better than MPI ? 

I am a newbie of   network socket programming. 

I do not know which one is better for parallel/distributed computing ? 

I know that network socket is unix-based file communication between server and client. 

If they can also be used for parallel computing, how MPI can work better than them ? 

I know MPI is for homogeneous cluster system and network socket is based on internet TCP/IP. 

Any help is really appreciated. 

Thanks   
_______________________________________________
users mailing list
users <at> open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
jody | 2 May 2011 10:34
Picon

Re: [OMPI users] problems with the -xterm option

Hi Ralph

I rebuilt open MPI 1.4.2 with the debug option on both chefli and squid_0.
The results are interesting!

I wrote a small HelloMPI app which basically calls usleep for a pause
of 5 seconds.

Now calling it as i did before, no MPI errors appear anymore, only the
display problems:
  jody <at> chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca
plm_rsh_agent "ssh -Y" --xterm 0 ./HelloMPI
  /usr/bin/xterm Xt error: Can't open display: localhost:10.0

When i do the same call *with* the debug option, the xterm appears and
shows the output of HelloMPI!
I attach the output in ompidbg_1.txt (It also works if i call with
'-np 4' and '--xterm 0,1,2,3'

Calling hostname the same way does not open an xterm (cf. ompidbg_2.txt).

If i use the hold-option, the xterm appears with the output of
'hostrname' (cf. ompidbg_3.txt)
The xterm opens after the line "launch complete for job..." has been
written (line 59)

I just found that everything works as expected if i use the the
'--leave-session-attached' option (without the debug options):
  jody <at> chefli ~/share/neander $ mpirun -np 4 -host squid_0 -mca
plm_rsh_agent "ssh -Y"  --leave-session-attached  --xterm 0,1,2,3!
./HelloMPI
The xterms are also opened if i do not use the '!' hold option.

What does *not* work is
  jody <at> aim-triops ~/share/neander $ mpirun -np 2 -host squid_0 -mca
plm_rsh_agent "ssh -Y"  --leave-session-attached  xterm
  xterm Xt error: Can't open display:
  xterm:  DISPLAY is not set
  xterm Xt error: Can't open display:
  xterm:  DISPLAY is not set

But then again, this call works (i.e. an xterm is opened) if all the
debug-options are used (ompidbg_4.txt).
Here the '--leave-session-attached' is necessary - without it, no xterm.

>From these results i would say that there is no basic mishandling of
'ssh', though i have no idea
what internal differences the use of the '-leave-session-attached'
option or the debug options make.

I hope these observations are helpful
  Jody

On Fri, Apr 29, 2011 at 12:08 AM, jody <jody.xha <at> gmail.com> wrote:
> Hi Ralph
>
> Thank you for your suggestions.
> I'll be happy to help  you.
> I'm not sure if i'll get around to this tomorrow,
> but i certainly will do so on Monday.
>
> Thanks
>  Jody
>
> On Thu, Apr 28, 2011 at 11:53 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>> Hi Jody
>>
>> I'm not sure when I'll get a chance to work on this - got a deadline to meet. I do have a couple of suggestions,
if you wouldn't mind helping debug the problem?
>>
>> It looks to me like the problem is that mpirun is crashing or terminating early for some reason - hence the
failures to send msgs to it, and the "lifeline lost" error that leads to the termination of the daemon. If
you build a debug version of the code (i.e., --enable-debug on configure), you can get a lot of debug info
that traces the behavior.
>>
>> If you could then run your program with
>>
>>  -mca plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached
>>
>> and send it to me, we'll see what ORTE thinks it is doing.
>>
>> You could also take a look at the code for implementing the xterm option. You'll find it in
>>
>> orte/mca/odls/base/odls_base_default_fns.c
>>
>> around line 1115. The xterm command syntax is defined in
>>
>> orte/mca/odls/base/odls_base_open.c
>>
>> around line 233 and following. Note that we use "xterm -T" as the cmd. Perhaps you can spot an error in the
way we treat xterm?
>>
>> Also, remember that you have to specify that you want us to "hold" the xterm window open even after the
process terminates. If you don't specify it, the window automatically closes upon completion of the
process. So a fast-running cmd like "hostname" might disappear so quickly that it causes a race condition problem.
>>
>> You might want to try a spinner application - i.e.., output something and then sit in a loop or sleep for
some period of time. Or, use the "hold" option to keep the window open - you designate "hold" by putting a '!'
before the rank, e.g., "mpirun -np 2 -xterm \!2 hostname"
>>
>>
>> On Apr 28, 2011, at 8:38 AM, jody wrote:
>>
>>> Hi
>>>
>>> Unfortunately this does not solve my problem.
>>> While i can do
>>>  ssh -Y squid_0 xterm
>>> and this will open an xterm on m,y machiine (chefli),
>>> i run into problems with the -xterm option of openmpi:
>>>
>>>  jody <at> chefli ~/share/neander $ mpirun -np 4  -mca plm_rsh_agent "ssh
>>> -Y" -host squid_0 --xterm 1 hostname
>>>  squid_0
>>>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>> [sd = 8]
>>>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
>>> lifeline [[35219,0],0] lost
>>>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>> [sd = 8]
>>>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
>>> lifeline [[35219,0],0] lost
>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>
>>> By the way when i look at the DISPLAY variable in the xterm window
>>> opened via squid_0,
>>> i also have the display variable "localhost:11.0"
>>>
>>> Actually, the difference with using the "-mca plm_rsh_agent" is that
>>> the lines wiht the warnings about "xauth" and "untrusted X" do not
>>> appear:
>>>
>>>  jody <at> chefli ~/share/neander $ mpirun -np 4   -host squid_0 -xterm 1 hostname
>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>  squid_0
>>>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>> [sd = 8]
>>>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
>>> lifeline [[34926,0],0] lost
>>>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>> [sd = 8]
>>>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
>>> lifeline [[34926,0],0] lost
>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>
>>>
>>> I have doubts that the "-Y" is passed correctly:
>>>   jody <at> triops ~/share/neander $ mpirun -np   -mca plm_rsh_agent "ssh
>>> -Y" -host squid_0 xterm
>>>  xterm Xt error: Can't open display:
>>>  xterm:  DISPLAY is not set
>>>  xterm Xt error: Can't open display:
>>>  xterm:  DISPLAY is not set
>>>
>>>
>>> ---> as a matter of fact i noticed that the xterm option doesn't work locally:
>>>  mpirun -np 4    -xterm 1 /usr/bin/printenv
>>> prints verything onto the console.
>>>
>>> Do you have any other suggestions i could try?
>>>
>>> Thank You
>>> Jody
>>>
>>> On Thu, Apr 28, 2011 at 3:06 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>> Should be able to just set
>>>>
>>>> -mca plm_rsh_agent "ssh -Y"
>>>>
>>>> on your cmd line, I believe
>>>>
>>>> On Apr 28, 2011, at 12:53 AM, jody wrote:
>>>>
>>>>> Hi Ralph
>>>>>
>>>>> Is there an easy way i could modify the OpenMPI code so that it would use
>>>>> the -Y option for ssh when connecting to remote machines?
>>>>>
>>>>> Thank You
>>>>>   Jody
>>>>>
>>>>> On Thu, Apr 7, 2011 at 4:01 PM, jody <jody.xha <at> gmail.com> wrote:
>>>>>> Hi Ralph
>>>>>> thank you for your suggestions. After some fiddling, i found that after my
>>>>>> last update (gentoo) my sshd_config had been overwritten
>>>>>> (X11Forwarding was set to 'no').
>>>>>>
>>>>>> After correcting that, i can now open remote terminals with 'ssh -Y'
>>>>>> and with 'ssh -X'
>>>>>> (but with '-X' is till get those xauth warnings)
>>>>>>
>>>>>> But the xterm option still doesn't work:
>>>>>>  jody <at> chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1,2
>>>>>> printenv | grep WORLD_RANK
>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>>>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>  [aim-squid_0:09856] [[54132,0],1]->[[54132,0],0]
>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>> [sd = 8]
>>>>>>  [aim-squid_0:09856] [[54132,0],1] routed:binomial: Connection to
>>>>>> lifeline [[54132,0],0] lost
>>>>>>
>>>>>> So it looks like the two processes from squid_0 can't open the display this way,
>>>>>> but one of them writes the output to the console...
>>>>>> Surprisingly, they are trying 'localhost:11.0' whereas when i use 'ssh -Y' the
>>>>>> DISPLAY variable is set to 'localhost:10.0'
>>>>>>
>>>>>> So in what way would OMPI have to be adapted, so -xterm would work?
>>>>>>
>>>>>> Thank You
>>>>>>  Jody
>>>>>>
>>>>>> On Wed, Apr 6, 2011 at 8:32 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>>>>> Here's a little more info - it's for Cygwin, but I don't see anything
>>>>>>> Cygwin-specific in the answers:
>>>>>>> http://x.cygwin.com/docs/faq/cygwin-x-faq.html#q-ssh-no-x11forwarding
>>>>>>>
>>>>>>> On Apr 6, 2011, at 12:30 PM, Ralph Castain wrote:
>>>>>>>
>>>>>>> Sorry Jody - I should have read your note more carefully to see that you
>>>>>>> already tried -Y. :-(
>>>>>>> Not sure what to suggest...
>>>>>>>
>>>>>>> On Apr 6, 2011, at 12:29 PM, Ralph Castain wrote:
>>>>>>>
>>>>>>> Like I said, I'm not expert. However, a quick "google" of revealed this
>>>>>>> result:
>>>>>>>
>>>>>>> When trying to set up x11 forwarding over an ssh session to a remote server
>>>>>>> with the -X switch, I was getting an error like Warning: No xauth
>>>>>>> data; using fake authentication data for X11 forwarding.
>>>>>>>
>>>>>>> When doing something like:
>>>>>>> ssh -Xl root 10.1.1.9 to a remote server, the authentication worked, but I
>>>>>>> got an error message like:
>>>>>>>
>>>>>>>
>>>>>>> jason <at> badman ~/bin $ ssh -Xl root 10.1.1.9
>>>>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>>>>>> Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>> Last login: Wed Apr 14 18:18:39 2010 from 10.1.1.5
>>>>>>> [root <at> RHEL ~]#
>>>>>>> and any X programs I ran would not display on my local system..
>>>>>>>
>>>>>>> Turns out the solution is to use the -Y switch instead.
>>>>>>>
>>>>>>> ssh -Yl root 10.1.1.9
>>>>>>>
>>>>>>> and that worked fine.
>>>>>>>
>>>>>>> See if that works for you - if it does, we may have to modify OMPI to
>>>>>>> accommodate.
>>>>>>>
>>>>>>> On Apr 6, 2011, at 9:19 AM, jody wrote:
>>>>>>>
>>>>>>> Hi Ralph
>>>>>>> No, after the above error message mpirun has exited.
>>>>>>>
>>>>>>> But i also noticed that it is to ssh into squid_0 and open a xterm there:
>>>>>>>
>>>>>>>  jody <at> chefli ~/share/neander $ ssh -Y squid_0
>>>>>>>  Last login: Wed Apr  6 17:14:02 CEST 2011 from chefli.uzh.ch on pts/0
>>>>>>>  jody <at> squid_0 ~ $ xterm
>>>>>>>  xterm Xt error: Can't open display:
>>>>>>>  xterm:  DISPLAY is not set
>>>>>>>  jody <at> squid_0 ~ $ export DISPLAY=130.60.126.74:0.0
>>>>>>>  jody <at> squid_0 ~ $ xterm
>>>>>>>  xterm Xt error: Can't open display: 130.60.126.74:0.0
>>>>>>>  jody <at> squid_0 ~ $ export DISPLAY=chefli.uzh.ch:0.0
>>>>>>>  jody <at> squid_0 ~ $ xterm
>>>>>>>  xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>  jody <at> squid_0 ~ $ exit
>>>>>>>  logout
>>>>>>>
>>>>>>> same thing with ssh -X, but here i get the same warning/error message
>>>>>>> as with mpirun:
>>>>>>>
>>>>>>>  jody <at> chefli ~/share/neander $ ssh -X squid_0
>>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not
>>>>>>> generated
>>>>>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>>  Last login: Wed Apr  6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh
>>>>>>>
>>>>>>> So perhaps the whole problem is linked to that xauth-thing.
>>>>>>> Do you have a suggestion how this can be solved?
>>>>>>>
>>>>>>> Thank You
>>>>>>>  Jody
>>>>>>>
>>>>>>> On Wed, Apr 6, 2011 at 4:41 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>>>>>
>>>>>>> If I read your error messages correctly, it looks like mpirun is crashing -
>>>>>>> the daemon is complaining that it lost the socket connection back to mpirun,
>>>>>>> and hence will abort.
>>>>>>>
>>>>>>> Are you seeing mpirun still alive?
>>>>>>>
>>>>>>>
>>>>>>> On Apr 5, 2011, at 4:46 AM, jody wrote:
>>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> On my workstation and  the cluster i set up OpenMPI (v 1.4.2) so that
>>>>>>>
>>>>>>> it works in "text-mode":
>>>>>>>
>>>>>>>  $ mpirun -np 4  -x DISPLAY -host squid_0   printenv | grep WORLD_RANK
>>>>>>>
>>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>>
>>>>>>>  OMPI_COMM_WORLD_RANK=1
>>>>>>>
>>>>>>>  OMPI_COMM_WORLD_RANK=2
>>>>>>>
>>>>>>>  OMPI_COMM_WORLD_RANK=3
>>>>>>>
>>>>>>> but when i use  the -xterm option to mpirun, it doesn't work
>>>>>>>
>>>>>>> $ mpirun -np 4  -x DISPLAY -host squid_0 -xterm 1,2  printenv | grep
>>>>>>> WORLD_RANK
>>>>>>>
>>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not
>>>>>>> generated
>>>>>>>
>>>>>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>>
>>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>>
>>>>>>>  [squid_0:05266] [[55607,0],1]->[[55607,0],0]
>>>>>>>
>>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>>>
>>>>>>> [sd = 8]
>>>>>>>
>>>>>>>  [squid_0:05266] [[55607,0],1] routed:binomial: Connection to
>>>>>>>
>>>>>>> lifeline [[55607,0],0] lost
>>>>>>>
>>>>>>>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>
>>>>>>>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>
>>>>>>> (strange: somebody wrote his message to the console)
>>>>>>>
>>>>>>> No matter whether i set the DISPLAY variable to the full hostname of
>>>>>>>
>>>>>>> the workstation,
>>>>>>>
>>>>>>> to the IP-Adress of the workstation or simply to ":0.0", it doesn't work
>>>>>>>
>>>>>>> But i do have xauth data (as far as i know):
>>>>>>>
>>>>>>> On the remote (squid_0):
>>>>>>>
>>>>>>>  jody <at> squid_0 ~ $ xauth list
>>>>>>>
>>>>>>>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>>>>>>>
>>>>>>>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>
>>>>>>>  chefli.uzh.ch:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>
>>>>>>> on the workstation:
>>>>>>>
>>>>>>>  $ xauth list
>>>>>>>
>>>>>>>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>>>>>>>
>>>>>>>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>
>>>>>>>  localhost.localdomain/unix:0  MIT-MAGIC-COOKIE-1
>>>>>>>
>>>>>>> 146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>
>>>>>>>  chefli.uzh.ch/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>
>>>>>>> In sshd_config on the workstation i have 'X11Forwarding yes'
>>>>>>>
>>>>>>> I have also done
>>>>>>>
>>>>>>>   xhost + squid_0
>>>>>>>
>>>>>>> on the workstation.
>>>>>>>
>>>>>>>
>>>>>>> How can i get the -xterm option running?
>>>>>>>
>>>>>>> Thank You
>>>>>>>
>>>>>>>  Jody
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>>
>>>>>>> users mailing list
>>>>>>>
>>>>>>> users <at> open-mpi.org
>>>>>>>
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>>
>>>>>>> users mailing list
>>>>>>>
>>>>>>> users <at> open-mpi.org
>>>>>>>
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users <at> open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users <at> open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users <at> open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users <at> open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users <at> open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users <at> open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
jody <at> chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca plm_rsh_agent "ssh -Y" -mca
plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached  --xterm 0 ./HelloMPI 
[chefli:02420] mca:base:select:(  plm) Querying component [rsh]
[chefli:02420] mca:base:select:(  plm) Query of component [rsh] set priority to 10
[chefli:02420] mca:base:select:(  plm) Querying component [slurm]
[chefli:02420] mca:base:select:(  plm) Skipping component [slurm]. Query failed to return a module
[chefli:02420] mca:base:select:(  plm) Selected component [rsh]
[chefli:02420] plm:base:set_hnp_name: initial bias 2420 nodename hash 72192778
[chefli:02420] plm:base:set_hnp_name: final jobfam 40499
[chefli:02420] [[40499,0],0] plm:base:receive start comm
[chefli:02420] mca:base:select:( odls) Querying component [default]
[chefli:02420] mca:base:select:( odls) Query of component [default] set priority to 1
[chefli:02420] mca:base:select:( odls) Selected component [default]
[chefli:02420] [[40499,0],0] plm:rsh: setting up job [40499,1]
[chefli:02420] [[40499,0],0] plm:base:setup_job for job [40499,1]
[chefli:02420] [[40499,0],0] plm:rsh: local shell: 0 (bash)
[chefli:02420] [[40499,0],0] plm:rsh: assuming same remote shell as local shell
[chefli:02420] [[40499,0],0] plm:rsh: remote shell: 0 (bash)
[chefli:02420] [[40499,0],0] plm:rsh: final template argv:
	/usr/bin/ssh -Y -X <template>  orted -mca ess env -mca orte_ess_jobid 2654142464 -mca orte_ess_vpid
<template> -mca orte_ess_num_procs 2 --hnp-uri "2654142464.0;tcp://192.168.0.14:39093" -mca
plm_base_verbose 5 -mca odls_base_verbose 5 --xterm 0 -mca plm_rsh_agent "ssh -Y"
[chefli:02420] [[40499,0],0] plm:rsh: launching on node squid_0
[chefli:02420] [[40499,0],0] plm:rsh: recording launch of daemon [[40499,0],1]
[chefli:02420] [[40499,0],0] plm:rsh: executing: (//usr/bin/ssh) [/usr/bin/ssh -Y -X squid_0  orted
-mca ess env -mca orte_ess_jobid 2654142464 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri
"2654142464.0;tcp://192.168.0.14:39093" -mca plm_base_verbose 5 -mca odls_base_verbose 5 --xterm
0 -mca plm_rsh_agent "ssh -Y"]
[squid_0:19442] mca:base:select:( odls) Querying component [default]
[squid_0:19442] mca:base:select:( odls) Query of component [default] set priority to 1
[squid_0:19442] mca:base:select:( odls) Selected component [default]
[chefli:02420] [[40499,0],0] plm:base:daemon_callback
[chefli:02420] [[40499,0],0] plm:base:orted_report_launch from daemon [[40499,0],1]
[chefli:02420] [[40499,0],0] plm:base:orted_report_launch completed for daemon [[40499,0],1]
[chefli:02420] [[40499,0],0] plm:base:daemon_callback completed
[chefli:02420] [[40499,0],0] plm:base:launch_apps for job [40499,1]
[chefli:02420] [[40499,0],0] plm:base:report_launched for job [40499,1]
[chefli:02420] [[40499,0],0] odls:constructing child list
[chefli:02420] [[40499,0],0] odls:construct_child_list unpacking data to launch job [40499,1]
[chefli:02420] [[40499,0],0] odls:construct_child_list adding new jobdat for job [40499,1]
[chefli:02420] [[40499,0],0] odls:construct_child_list unpacking 1 app_contexts
[chefli:02420] [[40499,0],0] odls:constructing child list - checking proc 0 on node 1 with daemon 1
[chefli:02420] [[40499,0],0] odls:construct:child: num_participating 1
[chefli:02420] [[40499,0],0] odls:launch found 12 processors for 0 children and set oversubscribed to false
[chefli:02420] [[40499,0],0] odls:launch reporting job [40499,1] launch status
[chefli:02420] [[40499,0],0] odls:launch setting waitpids
[chefli:02420] [[40499,0],0] plm:base:app_report_launch from daemon [[40499,0],0]
[chefli:02420] [[40499,0],0] plm:base:app_report_launch completed processing
[squid_0:19442] [[40499,0],1] odls:constructing child list
[squid_0:19442] [[40499,0],1] odls:construct_child_list unpacking data to launch job [40499,1]
[squid_0:19442] [[40499,0],1] odls:construct_child_list adding new jobdat for job [40499,1]
[squid_0:19442] [[40499,0],1] odls:construct_child_list unpacking 1 app_contexts
[squid_0:19442] [[40499,0],1] odls:constructing child list - checking proc 0 on node 1 with daemon 1
[squid_0:19442] [[40499,0],1] odls:constructing child list - found proc 0 for me!
[squid_0:19442] [[40499,0],1] odls:construct:child: num_participating 1
[squid_0:19442] [[40499,0],1] odls:launch found 4 processors for 1 children and set oversubscribed to false
[squid_0:19442] [[40499,0],1] odls:launch reporting job [40499,1] launch status
[squid_0:19442] [[40499,0],1] odls:launch setting waitpids
[chefli:02420] [[40499,0],0] plm:base:app_report_launch reissuing non-blocking recv
[chefli:02420] [[40499,0],0] plm:base:app_report_launch from daemon [[40499,0],1]
[chefli:02420] [[40499,0],0] plm:base:app_report_launched for proc [[40499,1],0] from daemon
[[40499,0],1]: pid 19446 state 2 exit 0
[chefli:02420] [[40499,0],0] plm:base:app_report_launch completed processing
[chefli:02420] [[40499,0],0] plm:base:report_launched all apps reported
[chefli:02420] [[40499,0],0] plm:base:launch wiring up iof
[chefli:02420] [[40499,0],0] plm:base:launch completed for job [40499,1]
[squid_0:19442] [[40499,0],1] odls: registering sync on child [[40499,1],0]
[squid_0:19442] [[40499,0],1] odls:sync nidmap requested for job [40499,1]
[squid_0:19442] [[40499,0],1] odls: sending sync ack to child [[40499,1],0] with 144 bytes of data
[squid_0:19442] [[40499,0],1] odls: sending contact info to HNP
[squid_0:19442] [[40499,0],1] odls: collecting data from child [[40499,1],0]
[squid_0:19442] [[40499,0],1] odls: executing collective
[squid_0:19442] [[40499,0],1] odls: daemon collective called
[squid_0:19442] [[40499,0],1] odls: daemon collective for job [40499,1] from [[40499,0],1] type 2
num_collected 1 num_participating 1 num_contributors 1
[squid_0:19442] [[40499,0],1] odls: daemon collective not the HNP - sending to parent [[40499,0],0]
[squid_0:19442] [[40499,0],1] odls: collective completed
[chefli:02420] [[40499,0],0] odls: daemon collective called
[chefli:02420] [[40499,0],0] odls: daemon collective for job [40499,1] from [[40499,0],1] type 2
num_collected 1 num_participating 1 num_contributors 1
[chefli:02420] [[40499,0],0] odls: daemon collective HNP - xcasting to job [40499,1]
[squid_0:19442] [[40499,0],1] odls: sending message to tag 15 on child [[40499,1],0]
[squid_0:19442] [[40499,0],1] odls: collecting data from child [[40499,1],0]
[squid_0:19442] [[40499,0],1] odls: executing collective
[squid_0:19442] [[40499,0],1] odls: daemon collective called
[squid_0:19442] [[40499,0],1] odls: daemon collective for job [40499,1] from [[40499,0],1] type 1
num_collected 1 num_participating 1 num_contributors 1
[squid_0:19442] [[40499,0],1] odls: daemon collective not the HNP - sending to parent [[40499,0],0]
[squid_0:19442] [[40499,0],1] odls: collective completed
[chefli:02420] [[40499,0],0] odls: daemon collective called
[chefli:02420] [[40499,0],0] odls: daemon collective for job [40499,1] from [[40499,0],1] type 1
num_collected 1 num_participating 1 num_contributors 1
[chefli:02420] [[40499,0],0] odls: daemon collective HNP - xcasting to job [40499,1]
[squid_0:19442] [[40499,0],1] odls: sending message to tag 17 on child [[40499,1],0]
[squid_0:19442] [[40499,0],1] odls: collecting data from child [[40499,1],0]
[squid_0:19442] [[40499,0],1] odls: executing collective
[squid_0:19442] [[40499,0],1] odls: daemon collective called
[squid_0:19442] [[40499,0],1] odls: daemon collective for job [40499,1] from [[40499,0],1] type 1
num_collected 1 num_participating 1 num_contributors 1
[squid_0:19442] [[40499,0],1] odls: daemon collective not the HNP - sending to parent [[40499,0],0]
[squid_0:19442] [[40499,0],1] odls: collective completed
[chefli:02420] [[40499,0],0] odls: daemon collective called
[chefli:02420] [[40499,0],0] odls: daemon collective for job [40499,1] from [[40499,0],1] type 1
num_collected 1 num_participating 1 num_contributors 1
[chefli:02420] [[40499,0],0] odls: daemon collective HNP - xcasting to job [40499,1]
[squid_0:19442] [[40499,0],1] odls: sending message to tag 17 on child [[40499,1],0]
[squid_0:19442] [[40499,0],1] odls: registering sync on child [[40499,1],0]
[squid_0:19442] [[40499,0],1] odls: sending sync ack to child [[40499,1],0] with 0 bytes of data
[chefli:02420] [[40499,0],0] plm:base:receive got message from [[40499,0],1]
[chefli:02420] [[40499,0],0] plm:base:receive got update_proc_state for job [40499,1]
[chefli:02420] [[40499,0],0] plm:base:receive got update_proc_state for proc [[40499,1],0] curnt
state 4 new state 80 exit_code 0
[chefli:02420] [[40499,0],0] plm:base:check_job_completed for job [40499,1] - num_terminated 1 
num_procs 1
[chefli:02420] [[40499,0],0] plm:base:check_job_completed declared job [40499,1] normally
terminated - checking all jobs
[chefli:02420] [[40499,0],0] plm:base:check_job_completed all jobs terminated - waking up
[chefli:02420] [[40499,0],0] plm:base:orted_cmd sending orted_exit commands
[chefli:02420] [[40499,0],0] odls:kill_local_proc working on job [WILDCARD]
[chefli:02420] [[40499,0],0] plm:base:check_job_completed for job [40499,0] - num_terminated 1 
num_procs 2
[squid_0:19442] [[40499,0],1] odls:wait_local_proc child process 19446 terminated
[squid_0:19442] [[40499,0],1] odls:notify_iof_complete for child [[40499,1],0]
[squid_0:19442] [[40499,0],1] odls:waitpid_fired checking abort file /tmp/openmpi-sessions-jody <at> squid_0_0/2654142465/0/abort
[chefli:02420] [[40499,0],0] plm:base:receive got message from [[40499,0],1]
[chefli:02420] [[40499,0],0] plm:base:receive got update_proc_state for job [40499,0]
[chefli:02420] [[40499,0],0] plm:base:receive got update_proc_state for proc [[40499,0],1] curnt
state 4 new state 80 exit_code 0
[chefli:02420] [[40499,0],0] plm:base:check_job_completed for job [40499,0] - num_terminated 2 
num_procs 2
[chefli:02420] [[40499,0],0] plm:base:check_job_completed declared job [40499,0] normally
terminated - checking all jobs
[chefli:02420] [[40499,0],0] plm:base:receive stop comm
[squid_0:19442] [[40499,0],1] odls:waitpid_fired child process [[40499,1],0] terminated normally
[squid_0:19442] [[40499,0],1] odls:proc_complete reporting all procs in [40499,1] terminated
[squid_0:19442] [[40499,0],1] odls:kill_local_proc working on job [WILDCARD]
jody <at> chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca plm_rsh_agent "ssh -Y" -mca
plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached  --xterm 0 hostname
[chefli:02476] mca:base:select:(  plm) Querying component [rsh]
[chefli:02476] mca:base:select:(  plm) Query of component [rsh] set priority to 10
[chefli:02476] mca:base:select:(  plm) Querying component [slurm]
[chefli:02476] mca:base:select:(  plm) Skipping component [slurm]. Query failed to return a module
[chefli:02476] mca:base:select:(  plm) Selected component [rsh]
[chefli:02476] plm:base:set_hnp_name: initial bias 2476 nodename hash 72192778
[chefli:02476] plm:base:set_hnp_name: final jobfam 40683
[chefli:02476] [[40683,0],0] plm:base:receive start comm
[chefli:02476] mca:base:select:( odls) Querying component [default]
[chefli:02476] mca:base:select:( odls) Query of component [default] set priority to 1
[chefli:02476] mca:base:select:( odls) Selected component [default]
[chefli:02476] [[40683,0],0] plm:rsh: setting up job [40683,1]
[chefli:02476] [[40683,0],0] plm:base:setup_job for job [40683,1]
[chefli:02476] [[40683,0],0] plm:rsh: local shell: 0 (bash)
[chefli:02476] [[40683,0],0] plm:rsh: assuming same remote shell as local shell
[chefli:02476] [[40683,0],0] plm:rsh: remote shell: 0 (bash)
[chefli:02476] [[40683,0],0] plm:rsh: final template argv:
	/usr/bin/ssh -Y -X <template>  orted -mca ess env -mca orte_ess_jobid 2666201088 -mca orte_ess_vpid
<template> -mca orte_ess_num_procs 2 --hnp-uri "2666201088.0;tcp://192.168.0.14:53879" -mca
plm_base_verbose 5 -mca odls_base_verbose 5 --xterm 0 -mca plm_rsh_agent "ssh -Y"
[chefli:02476] [[40683,0],0] plm:rsh: launching on node squid_0
[chefli:02476] [[40683,0],0] plm:rsh: recording launch of daemon [[40683,0],1]
[chefli:02476] [[40683,0],0] plm:rsh: executing: (//usr/bin/ssh) [/usr/bin/ssh -Y -X squid_0  orted
-mca ess env -mca orte_ess_jobid 2666201088 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri
"2666201088.0;tcp://192.168.0.14:53879" -mca plm_base_verbose 5 -mca odls_base_verbose 5 --xterm
0 -mca plm_rsh_agent "ssh -Y"]
[squid_0:19579] mca:base:select:( odls) Querying component [default]
[squid_0:19579] mca:base:select:( odls) Query of component [default] set priority to 1
[squid_0:19579] mca:base:select:( odls) Selected component [default]
[chefli:02476] [[40683,0],0] plm:base:daemon_callback
[chefli:02476] [[40683,0],0] plm:base:orted_report_launch from daemon [[40683,0],1]
[chefli:02476] [[40683,0],0] plm:base:orted_report_launch completed for daemon [[40683,0],1]
[chefli:02476] [[40683,0],0] plm:base:daemon_callback completed
[chefli:02476] [[40683,0],0] plm:base:launch_apps for job [40683,1]
[chefli:02476] [[40683,0],0] plm:base:report_launched for job [40683,1]
[chefli:02476] [[40683,0],0] odls:constructing child list
[chefli:02476] [[40683,0],0] odls:construct_child_list unpacking data to launch job [40683,1]
[chefli:02476] [[40683,0],0] odls:construct_child_list adding new jobdat for job [40683,1]
[chefli:02476] [[40683,0],0] odls:construct_child_list unpacking 1 app_contexts
[chefli:02476] [[40683,0],0] odls:constructing child list - checking proc 0 on node 1 with daemon 1
[chefli:02476] [[40683,0],0] odls:construct:child: num_participating 1
[chefli:02476] [[40683,0],0] odls:launch found 12 processors for 0 children and set oversubscribed to false
[chefli:02476] [[40683,0],0] odls:launch reporting job [40683,1] launch status
[chefli:02476] [[40683,0],0] odls:launch setting waitpids
[chefli:02476] [[40683,0],0] plm:base:app_report_launch from daemon [[40683,0],0]
[chefli:02476] [[40683,0],0] plm:base:app_report_launch completed processing
[squid_0:19579] [[40683,0],1] odls:constructing child list
[squid_0:19579] [[40683,0],1] odls:construct_child_list unpacking data to launch job [40683,1]
[squid_0:19579] [[40683,0],1] odls:construct_child_list adding new jobdat for job [40683,1]
[squid_0:19579] [[40683,0],1] odls:construct_child_list unpacking 1 app_contexts
[squid_0:19579] [[40683,0],1] odls:constructing child list - checking proc 0 on node 1 with daemon 1
[squid_0:19579] [[40683,0],1] odls:constructing child list - found proc 0 for me!
[squid_0:19579] [[40683,0],1] odls:construct:child: num_participating 1
[squid_0:19579] [[40683,0],1] odls:launch found 4 processors for 1 children and set oversubscribed to false
[squid_0:19579] [[40683,0],1] odls:launch reporting job [40683,1] launch status
[squid_0:19579] [[40683,0],1] odls:launch setting waitpids
[chefli:02476] [[40683,0],0] plm:base:app_report_launch reissuing non-blocking recv
[chefli:02476] [[40683,0],0] plm:base:app_report_launch from daemon [[40683,0],1]
[chefli:02476] [[40683,0],0] plm:base:app_report_launched for proc [[40683,1],0] from daemon
[[40683,0],1]: pid 19583 state 2 exit 0
[chefli:02476] [[40683,0],0] plm:base:app_report_launch completed processing
[chefli:02476] [[40683,0],0] plm:base:report_launched all apps reported
[chefli:02476] [[40683,0],0] plm:base:launch wiring up iof
[chefli:02476] [[40683,0],0] plm:base:launch completed for job [40683,1]
[squid_0:19579] [[40683,0],1] odls:wait_local_proc child process 19583 terminated
[squid_0:19579] [[40683,0],1] odls:waitpid_fired checking abort file /tmp/openmpi-sessions-jody <at> squid_0_0/2666201089/0/abort
[squid_0:19579] [[40683,0],1] odls:waitpid_fired child process [[40683,1],0] terminated normally
[squid_0:19579] [[40683,0],1] odls:notify_iof_complete for child [[40683,1],0]
[chefli:02476] [[40683,0],0] plm:base:receive got message from [[40683,0],1]
[chefli:02476] [[40683,0],0] plm:base:receive got update_proc_state for job [40683,1]
[chefli:02476] [[40683,0],0] plm:base:receive got update_proc_state for proc [[40683,1],0] curnt
state 2 new state 80 exit_code 0
[chefli:02476] [[40683,0],0] plm:base:check_job_completed for job [40683,1] - num_terminated 1 
num_procs 1
[chefli:02476] [[40683,0],0] plm:base:check_job_completed declared job [40683,1] normally
terminated - checking all jobs
[chefli:02476] [[40683,0],0] plm:base:check_job_completed all jobs terminated - waking up
[chefli:02476] [[40683,0],0] plm:base:orted_cmd sending orted_exit commands
[chefli:02476] [[40683,0],0] odls:kill_local_proc working on job [WILDCARD]
[chefli:02476] [[40683,0],0] plm:base:check_job_completed for job [40683,0] - num_terminated 1 
num_procs 2
[squid_0:19579] [[40683,0],1] odls:proc_complete reporting all procs in [40683,1] terminated
[chefli:02476] [[40683,0],0] plm:base:receive got message from [[40683,0],1]
[chefli:02476] [[40683,0],0] plm:base:receive got update_proc_state for job [40683,0]
[chefli:02476] [[40683,0],0] plm:base:receive got update_proc_state for proc [[40683,0],1] curnt
state 4 new state 80 exit_code 0
[chefli:02476] [[40683,0],0] plm:base:check_job_completed for job [40683,0] - num_terminated 2 
num_procs 2
[chefli:02476] [[40683,0],0] plm:base:check_job_completed declared job [40683,0] normally
terminated - checking all jobs
[chefli:02476] [[40683,0],0] plm:base:receive stop comm
[squid_0:19579] [[40683,0],1] odls:kill_local_proc working on job [WILDCARD]
jody <at> chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca plm_rsh_agent "ssh -Y" -mca
plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached  --xterm 0! hostname
[chefli:02487] mca:base:select:(  plm) Querying component [rsh]
[chefli:02487] mca:base:select:(  plm) Query of component [rsh] set priority to 10
[chefli:02487] mca:base:select:(  plm) Querying component [slurm]
[chefli:02487] mca:base:select:(  plm) Skipping component [slurm]. Query failed to return a module
[chefli:02487] mca:base:select:(  plm) Selected component [rsh]
[chefli:02487] plm:base:set_hnp_name: initial bias 2487 nodename hash 72192778
[chefli:02487] plm:base:set_hnp_name: final jobfam 40688
[chefli:02487] [[40688,0],0] plm:base:receive start comm
[chefli:02487] mca:base:select:( odls) Querying component [default]
[chefli:02487] mca:base:select:( odls) Query of component [default] set priority to 1
[chefli:02487] mca:base:select:( odls) Selected component [default]
[chefli:02487] [[40688,0],0] plm:rsh: setting up job [40688,1]
[chefli:02487] [[40688,0],0] plm:base:setup_job for job [40688,1]
[chefli:02487] [[40688,0],0] plm:rsh: local shell: 0 (bash)
[chefli:02487] [[40688,0],0] plm:rsh: assuming same remote shell as local shell
[chefli:02487] [[40688,0],0] plm:rsh: remote shell: 0 (bash)
[chefli:02487] [[40688,0],0] plm:rsh: final template argv:
	/usr/bin/ssh -Y -X <template>  orted -mca ess env -mca orte_ess_jobid 2666528768 -mca orte_ess_vpid
<template> -mca orte_ess_num_procs 2 --hnp-uri "2666528768.0;tcp://192.168.0.14:36402" -mca
plm_base_verbose 5 -mca odls_base_verbose 5 --xterm 0! -mca plm_rsh_agent "ssh -Y"
[chefli:02487] [[40688,0],0] plm:rsh: launching on node squid_0
[chefli:02487] [[40688,0],0] plm:rsh: recording launch of daemon [[40688,0],1]
[chefli:02487] [[40688,0],0] plm:rsh: executing: (//usr/bin/ssh) [/usr/bin/ssh -Y -X squid_0  orted
-mca ess env -mca orte_ess_jobid 2666528768 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri
"2666528768.0;tcp://192.168.0.14:36402" -mca plm_base_verbose 5 -mca odls_base_verbose 5 --xterm
0! -mca plm_rsh_agent "ssh -Y"]
[squid_0:19613] mca:base:select:( odls) Querying component [default]
[squid_0:19613] mca:base:select:( odls) Query of component [default] set priority to 1
[squid_0:19613] mca:base:select:( odls) Selected component [default]
[chefli:02487] [[40688,0],0] plm:base:daemon_callback
[chefli:02487] [[40688,0],0] plm:base:orted_report_launch from daemon [[40688,0],1]
[chefli:02487] [[40688,0],0] plm:base:orted_report_launch completed for daemon [[40688,0],1]
[chefli:02487] [[40688,0],0] plm:base:daemon_callback completed
[chefli:02487] [[40688,0],0] plm:base:launch_apps for job [40688,1]
[chefli:02487] [[40688,0],0] plm:base:report_launched for job [40688,1]
[chefli:02487] [[40688,0],0] odls:constructing child list
[chefli:02487] [[40688,0],0] odls:construct_child_list unpacking data to launch job [40688,1]
[chefli:02487] [[40688,0],0] odls:construct_child_list adding new jobdat for job [40688,1]
[chefli:02487] [[40688,0],0] odls:construct_child_list unpacking 1 app_contexts
[chefli:02487] [[40688,0],0] odls:constructing child list - checking proc 0 on node 1 with daemon 1
[chefli:02487] [[40688,0],0] odls:construct:child: num_participating 1
[chefli:02487] [[40688,0],0] odls:launch found 12 processors for 0 children and set oversubscribed to false
[chefli:02487] [[40688,0],0] odls:launch reporting job [40688,1] launch status
[chefli:02487] [[40688,0],0] odls:launch setting waitpids
[chefli:02487] [[40688,0],0] plm:base:app_report_launch from daemon [[40688,0],0]
[chefli:02487] [[40688,0],0] plm:base:app_report_launch completed processing
[squid_0:19613] [[40688,0],1] odls:constructing child list
[squid_0:19613] [[40688,0],1] odls:construct_child_list unpacking data to launch job [40688,1]
[squid_0:19613] [[40688,0],1] odls:construct_child_list adding new jobdat for job [40688,1]
[squid_0:19613] [[40688,0],1] odls:construct_child_list unpacking 1 app_contexts
[squid_0:19613] [[40688,0],1] odls:constructing child list - checking proc 0 on node 1 with daemon 1
[squid_0:19613] [[40688,0],1] odls:constructing child list - found proc 0 for me!
[squid_0:19613] [[40688,0],1] odls:construct:child: num_participating 1
[squid_0:19613] [[40688,0],1] odls:launch found 4 processors for 1 children and set oversubscribed to false
[squid_0:19613] [[40688,0],1] odls:launch reporting job [40688,1] launch status
[squid_0:19613] [[40688,0],1] odls:launch setting waitpids
[chefli:02487] [[40688,0],0] plm:base:app_report_launch reissuing non-blocking recv
[chefli:02487] [[40688,0],0] plm:base:app_report_launch from daemon [[40688,0],1]
[chefli:02487] [[40688,0],0] plm:base:app_report_launched for proc [[40688,1],0] from daemon
[[40688,0],1]: pid 19617 state 2 exit 0
[chefli:02487] [[40688,0],0] plm:base:app_report_launch completed processing
[chefli:02487] [[40688,0],0] plm:base:report_launched all apps reported
[chefli:02487] [[40688,0],0] plm:base:launch wiring up iof
[chefli:02487] [[40688,0],0] plm:base:launch completed for job [40688,1]
[squid_0:19613] [[40688,0],1] odls:wait_local_proc child process 19617 terminated
[squid_0:19613] [[40688,0],1] odls:waitpid_fired checking abort file /tmp/openmpi-sessions-jody <at> squid_0_0/2666528769/0/abort
[squid_0:19613] [[40688,0],1] odls:waitpid_fired child process [[40688,1],0] terminated normally
[squid_0:19613] [[40688,0],1] odls:notify_iof_complete for child [[40688,1],0]
[squid_0:19613] [[40688,0],1] odls:proc_complete reporting all procs in [40688,1] terminated
[chefli:02487] [[40688,0],0] plm:base:receive got message from [[40688,0],1]
[chefli:02487] [[40688,0],0] plm:base:receive got update_proc_state for job [40688,1]
[chefli:02487] [[40688,0],0] plm:base:receive got update_proc_state for proc [[40688,1],0] curnt
state 2 new state 80 exit_code 0
[chefli:02487] [[40688,0],0] plm:base:check_job_completed for job [40688,1] - num_terminated 1 
num_procs 1
[chefli:02487] [[40688,0],0] plm:base:check_job_completed declared job [40688,1] normally
terminated - checking all jobs
[chefli:02487] [[40688,0],0] plm:base:check_job_completed all jobs terminated - waking up
[chefli:02487] [[40688,0],0] plm:base:orted_cmd sending orted_exit commands
[chefli:02487] [[40688,0],0] odls:kill_local_proc working on job [WILDCARD]
[chefli:02487] [[40688,0],0] plm:base:check_job_completed for job [40688,0] - num_terminated 1 
num_procs 2
[chefli:02487] [[40688,0],0] plm:base:receive got message from [[40688,0],1]
[chefli:02487] [[40688,0],0] plm:base:receive got update_proc_state for job [40688,0]
[chefli:02487] [[40688,0],0] plm:base:receive got update_proc_state for proc [[40688,0],1] curnt
state 4 new state 80 exit_code 0
[chefli:02487] [[40688,0],0] plm:base:check_job_completed for job [40688,0] - num_terminated 2 
num_procs 2
[chefli:02487] [[40688,0],0] plm:base:check_job_completed declared job [40688,0] normally
terminated - checking all jobs
[squid_0:19613] [[40688,0],1] odls:kill_local_proc working on job [WILDCARD]
[chefli:02487] [[40688,0],0] plm:base:receive stop comm
jody <at> chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca plm_rsh_agent "ssh -Y" -mca
plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached  xterm
[chefli:02619] mca:base:select:(  plm) Querying component [rsh]
[chefli:02619] mca:base:select:(  plm) Query of component [rsh] set priority to 10
[chefli:02619] mca:base:select:(  plm) Querying component [slurm]
[chefli:02619] mca:base:select:(  plm) Skipping component [slurm]. Query failed to return a module
[chefli:02619] mca:base:select:(  plm) Selected component [rsh]
[chefli:02619] plm:base:set_hnp_name: initial bias 2619 nodename hash 72192778
[chefli:02619] plm:base:set_hnp_name: final jobfam 40316
[chefli:02619] [[40316,0],0] plm:base:receive start comm
[chefli:02619] mca:base:select:( odls) Querying component [default]
[chefli:02619] mca:base:select:( odls) Query of component [default] set priority to 1
[chefli:02619] mca:base:select:( odls) Selected component [default]
[chefli:02619] [[40316,0],0] plm:rsh: setting up job [40316,1]
[chefli:02619] [[40316,0],0] plm:base:setup_job for job [40316,1]
[chefli:02619] [[40316,0],0] plm:rsh: local shell: 0 (bash)
[chefli:02619] [[40316,0],0] plm:rsh: assuming same remote shell as local shell
[chefli:02619] [[40316,0],0] plm:rsh: remote shell: 0 (bash)
[chefli:02619] [[40316,0],0] plm:rsh: final template argv:
	/usr/bin/ssh -Y <template>  orted -mca ess env -mca orte_ess_jobid 2642149376 -mca orte_ess_vpid
<template> -mca orte_ess_num_procs 2 --hnp-uri "2642149376.0;tcp://192.168.0.14:57848" -mca
plm_base_verbose 5 -mca odls_base_verbose 5 -mca plm_rsh_agent "ssh -Y"
[chefli:02619] [[40316,0],0] plm:rsh: launching on node squid_0
[chefli:02619] [[40316,0],0] plm:rsh: recording launch of daemon [[40316,0],1]
[chefli:02619] [[40316,0],0] plm:rsh: executing: (//usr/bin/ssh) [/usr/bin/ssh -Y squid_0  orted
-mca ess env -mca orte_ess_jobid 2642149376 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri
"2642149376.0;tcp://192.168.0.14:57848" -mca plm_base_verbose 5 -mca odls_base_verbose 5 -mca
plm_rsh_agent "ssh -Y"]
[squid_0:20023] mca:base:select:( odls) Querying component [default]
[squid_0:20023] mca:base:select:( odls) Query of component [default] set priority to 1
[squid_0:20023] mca:base:select:( odls) Selected component [default]
[chefli:02619] [[40316,0],0] plm:base:daemon_callback
[chefli:02619] [[40316,0],0] plm:base:orted_report_launch from daemon [[40316,0],1]
[chefli:02619] [[40316,0],0] plm:base:orted_report_launch completed for daemon [[40316,0],1]
[chefli:02619] [[40316,0],0] plm:base:daemon_callback completed
[chefli:02619] [[40316,0],0] plm:base:launch_apps for job [40316,1]
[chefli:02619] [[40316,0],0] plm:base:report_launched for job [40316,1]
[chefli:02619] [[40316,0],0] odls:constructing child list
[chefli:02619] [[40316,0],0] odls:construct_child_list unpacking data to launch job [40316,1]
[chefli:02619] [[40316,0],0] odls:construct_child_list adding new jobdat for job [40316,1]
[chefli:02619] [[40316,0],0] odls:construct_child_list unpacking 1 app_contexts
[chefli:02619] [[40316,0],0] odls:constructing child list - checking proc 0 on node 1 with daemon 1
[chefli:02619] [[40316,0],0] odls:construct:child: num_participating 1
[chefli:02619] [[40316,0],0] odls:launch found 12 processors for 0 children and set oversubscribed to false
[chefli:02619] [[40316,0],0] odls:launch reporting job [40316,1] launch status
[chefli:02619] [[40316,0],0] odls:launch setting waitpids
[chefli:02619] [[40316,0],0] plm:base:app_report_launch from daemon [[40316,0],0]
[chefli:02619] [[40316,0],0] plm:base:app_report_launch completed processing
[squid_0:20023] [[40316,0],1] odls:constructing child list
[squid_0:20023] [[40316,0],1] odls:construct_child_list unpacking data to launch job [40316,1]
[squid_0:20023] [[40316,0],1] odls:construct_child_list adding new jobdat for job [40316,1]
[squid_0:20023] [[40316,0],1] odls:construct_child_list unpacking 1 app_contexts
[squid_0:20023] [[40316,0],1] odls:constructing child list - checking proc 0 on node 1 with daemon 1
[squid_0:20023] [[40316,0],1] odls:constructing child list - found proc 0 for me!
[squid_0:20023] [[40316,0],1] odls:construct:child: num_participating 1
[squid_0:20023] [[40316,0],1] odls:launch found 4 processors for 1 children and set oversubscribed to false
[chefli:02619] [[40316,0],0] plm:base:app_report_launch reissuing non-blocking recv
[chefli:02619] [[40316,0],0] plm:base:app_report_launch from daemon [[40316,0],1]
[chefli:02619] [[40316,0],0] plm:base:app_report_launched for proc [[40316,1],0] from daemon
[[40316,0],1]: pid 20027 state 2 exit 0
[chefli:02619] [[40316,0],0] plm:base:app_report_launch completed processing
[chefli:02619] [[40316,0],0] plm:base:report_launched all apps reported
[chefli:02619] [[40316,0],0] plm:base:launch wiring up iof
[chefli:02619] [[40316,0],0] plm:base:launch completed for job [40316,1]
[squid_0:20023] [[40316,0],1] odls:launch reporting job [40316,1] launch status
[squid_0:20023] [[40316,0],1] odls:launch setting waitpids
[chefli:02619] [[40316,0],0] plm:base:receive got message from [[40316,0],1]
[squid_0:20023] [[40316,0],1] odls:wait_local_proc child process 20027 terminated
[squid_0:20023] [[40316,0],1] odls:waitpid_fired checking abort file /tmp/openmpi-sessions-jody <at> squid_0_0/2642149377/0/abort
[squid_0:20023] [[40316,0],1] odls:waitpid_fired child process [[40316,1],0] terminated normally
[squid_0:20023] [[40316,0],1] odls:notify_iof_complete for child [[40316,1],0]
[chefli:02619] [[40316,0],0] plm:base:receive got update_proc_state for job [40316,1]
[squid_0:20023] [[40316,0],1] odls:proc_complete reporting all procs in [40316,1] terminated
[chefli:02619] [[40316,0],0] plm:base:receive got update_proc_state for proc [[40316,1],0] curnt
state 2 new state 80 exit_code 0
[chefli:02619] [[40316,0],0] plm:base:check_job_completed for job [40316,1] - num_terminated 1 
num_procs 1
[chefli:02619] [[40316,0],0] plm:base:check_job_completed declared job [40316,1] normally
terminated - checking all jobs
[chefli:02619] [[40316,0],0] plm:base:check_job_completed all jobs terminated - waking up
[chefli:02619] [[40316,0],0] plm:base:orted_cmd sending orted_exit commands
[chefli:02619] [[40316,0],0] odls:kill_local_proc working on job [WILDCARD]
[chefli:02619] [[40316,0],0] plm:base:check_job_completed for job [40316,0] - num_terminated 1 
num_procs 2
[chefli:02619] [[40316,0],0] plm:base:receive got message from [[40316,0],1]
[chefli:02619] [[40316,0],0] plm:base:receive got update_proc_state for job [40316,0]
[chefli:02619] [[40316,0],0] plm:base:receive got update_proc_state for proc [[40316,0],1] curnt
state 4 new state 80 exit_code 0
[chefli:02619] [[40316,0],0] plm:base:check_job_completed for job [40316,0] - num_terminated 2 
num_procs 2
[chefli:02619] [[40316,0],0] plm:base:check_job_completed declared job [40316,0] normally
terminated - checking all jobs
[chefli:02619] [[40316,0],0] plm:base:receive stop comm
[squid_0:20023] [[40316,0],1] odls:kill_local_proc working on job [WILDCARD]
_______________________________________________
users mailing list
users <at> open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
Ralph Castain | 2 May 2011 14:29
Favicon
Gravatar

Re: [OMPI users] problems with the -xterm option


On May 2, 2011, at 2:34 AM, jody wrote:

> Hi Ralph
> 
> I rebuilt open MPI 1.4.2 with the debug option on both chefli and squid_0.
> The results are interesting!
> 
> I wrote a small HelloMPI app which basically calls usleep for a pause
> of 5 seconds.
> 
> Now calling it as i did before, no MPI errors appear anymore, only the
> display problems:
>  jody <at> chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca
> plm_rsh_agent "ssh -Y" --xterm 0 ./HelloMPI
>  /usr/bin/xterm Xt error: Can't open display: localhost:10.0
> 
> When i do the same call *with* the debug option, the xterm appears and
> shows the output of HelloMPI!
> I attach the output in ompidbg_1.txt (It also works if i call with
> '-np 4' and '--xterm 0,1,2,3'

Good!

> 
> Calling hostname the same way does not open an xterm (cf. ompidbg_2.txt).
> 
> If i use the hold-option, the xterm appears with the output of
> 'hostrname' (cf. ompidbg_3.txt)
> The xterm opens after the line "launch complete for job..." has been
> written (line 59)

Okay, that's also expected. Like I said, without the "hold", the output is generated so quickly that the
window just flashes at best. I've had similar experiences - hence the "hold" option.

> 
> I just found that everything works as expected if i use the the
> '--leave-session-attached' option (without the debug options):
>  jody <at> chefli ~/share/neander $ mpirun -np 4 -host squid_0 -mca
> plm_rsh_agent "ssh -Y"  --leave-session-attached  --xterm 0,1,2,3!
> ./HelloMPI
> The xterms are also opened if i do not use the '!' hold option.

Okay, I can understand why. The --leave-session-attached option just tells mpirun to not daemonize the
backend daemons - thus leaving the ssh session alive. The debug options do the same thing, but turn on all
the debug output.

The problem is that if you don't leave the ssh session alive, then the xterm has no way back to your screen. By
daemonizing, we severe that connection.

What I should do (and maybe used to do, but it got removed) is automatically turn "on" the
leave-session-attached option if you give --xterm. I can enter that patch.

Note that this does limit the size of the launch to the number of ssh sessions the system allows you to have
open at the same time. We default to a limit of 128 nodes, which is likely adequate for an xterm-based
debugging session. However, you can increase it using an mca param (see ompi_info) to as high as the system allows.

Thanks for helping debug this! I'll add you to the patch list so you can track it.

> 
> What does *not* work is
>  jody <at> aim-triops ~/share/neander $ mpirun -np 2 -host squid_0 -mca
> plm_rsh_agent "ssh -Y"  --leave-session-attached  xterm
>  xterm Xt error: Can't open display:
>  xterm:  DISPLAY is not set
>  xterm Xt error: Can't open display:
>  xterm:  DISPLAY is not set
> 
> But then again, this call works (i.e. an xterm is opened) if all the
> debug-options are used (ompidbg_4.txt).
> Here the '--leave-session-attached' is necessary - without it, no xterm.
> 
>> From these results i would say that there is no basic mishandling of
> 'ssh', though i have no idea
> what internal differences the use of the '-leave-session-attached'
> option or the debug options make.
> 
> I hope these observations are helpful
>  Jody
> 
> 
> On Fri, Apr 29, 2011 at 12:08 AM, jody <jody.xha <at> gmail.com> wrote:
>> Hi Ralph
>> 
>> Thank you for your suggestions.
>> I'll be happy to help  you.
>> I'm not sure if i'll get around to this tomorrow,
>> but i certainly will do so on Monday.
>> 
>> Thanks
>>  Jody
>> 
>> On Thu, Apr 28, 2011 at 11:53 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>> Hi Jody
>>> 
>>> I'm not sure when I'll get a chance to work on this - got a deadline to meet. I do have a couple of
suggestions, if you wouldn't mind helping debug the problem?
>>> 
>>> It looks to me like the problem is that mpirun is crashing or terminating early for some reason - hence the
failures to send msgs to it, and the "lifeline lost" error that leads to the termination of the daemon. If
you build a debug version of the code (i.e., --enable-debug on configure), you can get a lot of debug info
that traces the behavior.
>>> 
>>> If you could then run your program with
>>> 
>>>  -mca plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached
>>> 
>>> and send it to me, we'll see what ORTE thinks it is doing.
>>> 
>>> You could also take a look at the code for implementing the xterm option. You'll find it in
>>> 
>>> orte/mca/odls/base/odls_base_default_fns.c
>>> 
>>> around line 1115. The xterm command syntax is defined in
>>> 
>>> orte/mca/odls/base/odls_base_open.c
>>> 
>>> around line 233 and following. Note that we use "xterm -T" as the cmd. Perhaps you can spot an error in the
way we treat xterm?
>>> 
>>> Also, remember that you have to specify that you want us to "hold" the xterm window open even after the
process terminates. If you don't specify it, the window automatically closes upon completion of the
process. So a fast-running cmd like "hostname" might disappear so quickly that it causes a race condition problem.
>>> 
>>> You might want to try a spinner application - i.e.., output something and then sit in a loop or sleep for
some period of time. Or, use the "hold" option to keep the window open - you designate "hold" by putting a '!'
before the rank, e.g., "mpirun -np 2 -xterm \!2 hostname"
>>> 
>>> 
>>> On Apr 28, 2011, at 8:38 AM, jody wrote:
>>> 
>>>> Hi
>>>> 
>>>> Unfortunately this does not solve my problem.
>>>> While i can do
>>>>  ssh -Y squid_0 xterm
>>>> and this will open an xterm on m,y machiine (chefli),
>>>> i run into problems with the -xterm option of openmpi:
>>>> 
>>>>  jody <at> chefli ~/share/neander $ mpirun -np 4  -mca plm_rsh_agent "ssh
>>>> -Y" -host squid_0 --xterm 1 hostname
>>>>  squid_0
>>>>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>> [sd = 8]
>>>>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
>>>> lifeline [[35219,0],0] lost
>>>>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>> [sd = 8]
>>>>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
>>>> lifeline [[35219,0],0] lost
>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>> 
>>>> By the way when i look at the DISPLAY variable in the xterm window
>>>> opened via squid_0,
>>>> i also have the display variable "localhost:11.0"
>>>> 
>>>> Actually, the difference with using the "-mca plm_rsh_agent" is that
>>>> the lines wiht the warnings about "xauth" and "untrusted X" do not
>>>> appear:
>>>> 
>>>>  jody <at> chefli ~/share/neander $ mpirun -np 4   -host squid_0 -xterm 1 hostname
>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>  squid_0
>>>>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>> [sd = 8]
>>>>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
>>>> lifeline [[34926,0],0] lost
>>>>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>> [sd = 8]
>>>>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
>>>> lifeline [[34926,0],0] lost
>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>> 
>>>> 
>>>> I have doubts that the "-Y" is passed correctly:
>>>>   jody <at> triops ~/share/neander $ mpirun -np   -mca plm_rsh_agent "ssh
>>>> -Y" -host squid_0 xterm
>>>>  xterm Xt error: Can't open display:
>>>>  xterm:  DISPLAY is not set
>>>>  xterm Xt error: Can't open display:
>>>>  xterm:  DISPLAY is not set
>>>> 
>>>> 
>>>> ---> as a matter of fact i noticed that the xterm option doesn't work locally:
>>>>  mpirun -np 4    -xterm 1 /usr/bin/printenv
>>>> prints verything onto the console.
>>>> 
>>>> Do you have any other suggestions i could try?
>>>> 
>>>> Thank You
>>>> Jody
>>>> 
>>>> On Thu, Apr 28, 2011 at 3:06 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>>> Should be able to just set
>>>>> 
>>>>> -mca plm_rsh_agent "ssh -Y"
>>>>> 
>>>>> on your cmd line, I believe
>>>>> 
>>>>> On Apr 28, 2011, at 12:53 AM, jody wrote:
>>>>> 
>>>>>> Hi Ralph
>>>>>> 
>>>>>> Is there an easy way i could modify the OpenMPI code so that it would use
>>>>>> the -Y option for ssh when connecting to remote machines?
>>>>>> 
>>>>>> Thank You
>>>>>>   Jody
>>>>>> 
>>>>>> On Thu, Apr 7, 2011 at 4:01 PM, jody <jody.xha <at> gmail.com> wrote:
>>>>>>> Hi Ralph
>>>>>>> thank you for your suggestions. After some fiddling, i found that after my
>>>>>>> last update (gentoo) my sshd_config had been overwritten
>>>>>>> (X11Forwarding was set to 'no').
>>>>>>> 
>>>>>>> After correcting that, i can now open remote terminals with 'ssh -Y'
>>>>>>> and with 'ssh -X'
>>>>>>> (but with '-X' is till get those xauth warnings)
>>>>>>> 
>>>>>>> But the xterm option still doesn't work:
>>>>>>>  jody <at> chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1,2
>>>>>>> printenv | grep WORLD_RANK
>>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>>>>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>>  [aim-squid_0:09856] [[54132,0],1]->[[54132,0],0]
>>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>>> [sd = 8]
>>>>>>>  [aim-squid_0:09856] [[54132,0],1] routed:binomial: Connection to
>>>>>>> lifeline [[54132,0],0] lost
>>>>>>> 
>>>>>>> So it looks like the two processes from squid_0 can't open the display this way,
>>>>>>> but one of them writes the output to the console...
>>>>>>> Surprisingly, they are trying 'localhost:11.0' whereas when i use 'ssh -Y' the
>>>>>>> DISPLAY variable is set to 'localhost:10.0'
>>>>>>> 
>>>>>>> So in what way would OMPI have to be adapted, so -xterm would work?
>>>>>>> 
>>>>>>> Thank You
>>>>>>>  Jody
>>>>>>> 
>>>>>>> On Wed, Apr 6, 2011 at 8:32 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>>>>>> Here's a little more info - it's for Cygwin, but I don't see anything
>>>>>>>> Cygwin-specific in the answers:
>>>>>>>> http://x.cygwin.com/docs/faq/cygwin-x-faq.html#q-ssh-no-x11forwarding
>>>>>>>> 
>>>>>>>> On Apr 6, 2011, at 12:30 PM, Ralph Castain wrote:
>>>>>>>> 
>>>>>>>> Sorry Jody - I should have read your note more carefully to see that you
>>>>>>>> already tried -Y. :-(
>>>>>>>> Not sure what to suggest...
>>>>>>>> 
>>>>>>>> On Apr 6, 2011, at 12:29 PM, Ralph Castain wrote:
>>>>>>>> 
>>>>>>>> Like I said, I'm not expert. However, a quick "google" of revealed this
>>>>>>>> result:
>>>>>>>> 
>>>>>>>> When trying to set up x11 forwarding over an ssh session to a remote server
>>>>>>>> with the -X switch, I was getting an error like Warning: No xauth
>>>>>>>> data; using fake authentication data for X11 forwarding.
>>>>>>>> 
>>>>>>>> When doing something like:
>>>>>>>> ssh -Xl root 10.1.1.9 to a remote server, the authentication worked, but I
>>>>>>>> got an error message like:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> jason <at> badman ~/bin $ ssh -Xl root 10.1.1.9
>>>>>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>>>>>>> Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>>> Last login: Wed Apr 14 18:18:39 2010 from 10.1.1.5
>>>>>>>> [root <at> RHEL ~]#
>>>>>>>> and any X programs I ran would not display on my local system..
>>>>>>>> 
>>>>>>>> Turns out the solution is to use the -Y switch instead.
>>>>>>>> 
>>>>>>>> ssh -Yl root 10.1.1.9
>>>>>>>> 
>>>>>>>> and that worked fine.
>>>>>>>> 
>>>>>>>> See if that works for you - if it does, we may have to modify OMPI to
>>>>>>>> accommodate.
>>>>>>>> 
>>>>>>>> On Apr 6, 2011, at 9:19 AM, jody wrote:
>>>>>>>> 
>>>>>>>> Hi Ralph
>>>>>>>> No, after the above error message mpirun has exited.
>>>>>>>> 
>>>>>>>> But i also noticed that it is to ssh into squid_0 and open a xterm there:
>>>>>>>> 
>>>>>>>>  jody <at> chefli ~/share/neander $ ssh -Y squid_0
>>>>>>>>  Last login: Wed Apr  6 17:14:02 CEST 2011 from chefli.uzh.ch on pts/0
>>>>>>>>  jody <at> squid_0 ~ $ xterm
>>>>>>>>  xterm Xt error: Can't open display:
>>>>>>>>  xterm:  DISPLAY is not set
>>>>>>>>  jody <at> squid_0 ~ $ export DISPLAY=130.60.126.74:0.0
>>>>>>>>  jody <at> squid_0 ~ $ xterm
>>>>>>>>  xterm Xt error: Can't open display: 130.60.126.74:0.0
>>>>>>>>  jody <at> squid_0 ~ $ export DISPLAY=chefli.uzh.ch:0.0
>>>>>>>>  jody <at> squid_0 ~ $ xterm
>>>>>>>>  xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>>  jody <at> squid_0 ~ $ exit
>>>>>>>>  logout
>>>>>>>> 
>>>>>>>> same thing with ssh -X, but here i get the same warning/error message
>>>>>>>> as with mpirun:
>>>>>>>> 
>>>>>>>>  jody <at> chefli ~/share/neander $ ssh -X squid_0
>>>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not
>>>>>>>> generated
>>>>>>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>>>  Last login: Wed Apr  6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh
>>>>>>>> 
>>>>>>>> So perhaps the whole problem is linked to that xauth-thing.
>>>>>>>> Do you have a suggestion how this can be solved?
>>>>>>>> 
>>>>>>>> Thank You
>>>>>>>>  Jody
>>>>>>>> 
>>>>>>>> On Wed, Apr 6, 2011 at 4:41 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>>>>>> 
>>>>>>>> If I read your error messages correctly, it looks like mpirun is crashing -
>>>>>>>> the daemon is complaining that it lost the socket connection back to mpirun,
>>>>>>>> and hence will abort.
>>>>>>>> 
>>>>>>>> Are you seeing mpirun still alive?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Apr 5, 2011, at 4:46 AM, jody wrote:
>>>>>>>> 
>>>>>>>> Hi
>>>>>>>> 
>>>>>>>> On my workstation and  the cluster i set up OpenMPI (v 1.4.2) so that
>>>>>>>> 
>>>>>>>> it works in "text-mode":
>>>>>>>> 
>>>>>>>>  $ mpirun -np 4  -x DISPLAY -host squid_0   printenv | grep WORLD_RANK
>>>>>>>> 
>>>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>>> 
>>>>>>>>  OMPI_COMM_WORLD_RANK=1
>>>>>>>> 
>>>>>>>>  OMPI_COMM_WORLD_RANK=2
>>>>>>>> 
>>>>>>>>  OMPI_COMM_WORLD_RANK=3
>>>>>>>> 
>>>>>>>> but when i use  the -xterm option to mpirun, it doesn't work
>>>>>>>> 
>>>>>>>> $ mpirun -np 4  -x DISPLAY -host squid_0 -xterm 1,2  printenv | grep
>>>>>>>> WORLD_RANK
>>>>>>>> 
>>>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not
>>>>>>>> generated
>>>>>>>> 
>>>>>>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>>> 
>>>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>>> 
>>>>>>>>  [squid_0:05266] [[55607,0],1]->[[55607,0],0]
>>>>>>>> 
>>>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>>>> 
>>>>>>>> [sd = 8]
>>>>>>>> 
>>>>>>>>  [squid_0:05266] [[55607,0],1] routed:binomial: Connection to
>>>>>>>> 
>>>>>>>> lifeline [[55607,0],0] lost
>>>>>>>> 
>>>>>>>>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>> 
>>>>>>>>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>> 
>>>>>>>> (strange: somebody wrote his message to the console)
>>>>>>>> 
>>>>>>>> No matter whether i set the DISPLAY variable to the full hostname of
>>>>>>>> 
>>>>>>>> the workstation,
>>>>>>>> 
>>>>>>>> to the IP-Adress of the workstation or simply to ":0.0", it doesn't work
>>>>>>>> 
>>>>>>>> But i do have xauth data (as far as i know):
>>>>>>>> 
>>>>>>>> On the remote (squid_0):
>>>>>>>> 
>>>>>>>>  jody <at> squid_0 ~ $ xauth list
>>>>>>>> 
>>>>>>>>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>>>>>>>> 
>>>>>>>>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>> 
>>>>>>>>  chefli.uzh.ch:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>> 
>>>>>>>> on the workstation:
>>>>>>>> 
>>>>>>>>  $ xauth list
>>>>>>>> 
>>>>>>>>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>>>>>>>> 
>>>>>>>>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>> 
>>>>>>>>  localhost.localdomain/unix:0  MIT-MAGIC-COOKIE-1
>>>>>>>> 
>>>>>>>> 146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>> 
>>>>>>>>  chefli.uzh.ch/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>> 
>>>>>>>> In sshd_config on the workstation i have 'X11Forwarding yes'
>>>>>>>> 
>>>>>>>> I have also done
>>>>>>>> 
>>>>>>>>   xhost + squid_0
>>>>>>>> 
>>>>>>>> on the workstation.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> How can i get the -xterm option running?
>>>>>>>> 
>>>>>>>> Thank You
>>>>>>>> 
>>>>>>>>  Jody
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> 
>>>>>>>> users mailing list
>>>>>>>> 
>>>>>>>> users <at> open-mpi.org
>>>>>>>> 
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> 
>>>>>>>> users mailing list
>>>>>>>> 
>>>>>>>> users <at> open-mpi.org
>>>>>>>> 
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users <at> open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users <at> open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users <at> open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users <at> open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> users <at> open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users <at> open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> 
> <ompidbg_1.txt><ompidbg_2.txt><ompidbg_3.txt><ompidbg_4.txt>_______________________________________________
> users mailing list
> users <at> open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
Terry Dontje | 2 May 2011 14:34
Picon
Favicon

Re: [OMPI users] OMPI vs. network socket communcation

On 04/30/2011 08:52 PM, Jack Bryan wrote:
> Hi, All:
>
> What is the relationship between MPI communication and socket 
> communication ?
>
MPI may use socket communications to do communications between two 
processes.  Aside from that they are used for different purposes.
> Is the network socket programming better than MPI ?
Depends on what you are trying to do.  If you are writing a parallel 
program that may run in multiple environments with different types of 
performing protocols available for its use then MPI is probably better.  
If you are looking to do simple client/server type programming then 
socket program might have an advantage.
>
> I am a newbie of network socket programming.
>
> I do not know which one is better for parallel/distributed computing ?
IMO MPI.
>
> I know that network socket is unix-based file communication between 
> server and client.
>
> If they can also be used for parallel computing, how MPI can work 
> better than them ?
There is a lot of stuff that MPI does behind the curtain to make a 
parallel applications life a lot easier.  As far as performance MPI will 
not perform better than sockets if it is using sockets as the underlying 
model.  However, the performance difference should be negligible which 
makes all the other stuff MPI does for you a big win.
>
> I know MPI is for homogeneous cluster system and network socket is 
> based on internet TCP/IP.
What do you mean by homogeneous cluster?  There are some MPIs that can 
work among different platforms and even different OSes (though some 
initial setup may be necessary).

Hope this helps,

--

-- 
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje <at> oracle.com <mailto:terry.dontje <at> oracle.com>

_______________________________________________
users mailing list
users <at> open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
jody | 2 May 2011 15:56
Picon

Re: [OMPI users] problems with the -xterm option

Hi Ralph

Thank You for doing the fix.

Do you perhaps also have an idea what is going on when i try to start
xterm (or probably an other X application) on a remote host?
In this case it is not enough to specify the '--leave-session-attached' option.

These calls won't open any xterms
  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
plm_base_verbose 1 xterm
  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"
--leave-session-attached xterm
  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
odls_base_verbose 5 xterm
  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
odls_base_verbose 5 --leave-session-attached xterm

But this will open the xterms:
  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
plm_base_verbose 1  --leave-session-attached xterm

Any verbosity level > 0 will open xterms, but with ' -mca
plm_base_verbose 0' there are again no xterms.

Thank You
  Jody

On Mon, May 2, 2011 at 2:29 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>
> On May 2, 2011, at 2:34 AM, jody wrote:
>
>> Hi Ralph
>>
>> I rebuilt open MPI 1.4.2 with the debug option on both chefli and squid_0.
>> The results are interesting!
>>
>> I wrote a small HelloMPI app which basically calls usleep for a pause
>> of 5 seconds.
>>
>> Now calling it as i did before, no MPI errors appear anymore, only the
>> display problems:
>>  jody <at> chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca
>> plm_rsh_agent "ssh -Y" --xterm 0 ./HelloMPI
>>  /usr/bin/xterm Xt error: Can't open display: localhost:10.0
>>
>> When i do the same call *with* the debug option, the xterm appears and
>> shows the output of HelloMPI!
>> I attach the output in ompidbg_1.txt (It also works if i call with
>> '-np 4' and '--xterm 0,1,2,3'
>
> Good!
>
>>
>> Calling hostname the same way does not open an xterm (cf. ompidbg_2.txt).
>>
>> If i use the hold-option, the xterm appears with the output of
>> 'hostrname' (cf. ompidbg_3.txt)
>> The xterm opens after the line "launch complete for job..." has been
>> written (line 59)
>
> Okay, that's also expected. Like I said, without the "hold", the output is generated so quickly that the
window just flashes at best. I've had similar experiences - hence the "hold" option.
>
>>
>> I just found that everything works as expected if i use the the
>> '--leave-session-attached' option (without the debug options):
>>  jody <at> chefli ~/share/neander $ mpirun -np 4 -host squid_0 -mca
>> plm_rsh_agent "ssh -Y"  --leave-session-attached  --xterm 0,1,2,3!
>> ./HelloMPI
>> The xterms are also opened if i do not use the '!' hold option.
>
> Okay, I can understand why. The --leave-session-attached option just tells mpirun to not daemonize the
backend daemons - thus leaving the ssh session alive. The debug options do the same thing, but turn on all
the debug output.
>
> The problem is that if you don't leave the ssh session alive, then the xterm has no way back to your screen. By
daemonizing, we severe that connection.
>
> What I should do (and maybe used to do, but it got removed) is automatically turn "on" the
leave-session-attached option if you give --xterm. I can enter that patch.
>
> Note that this does limit the size of the launch to the number of ssh sessions the system allows you to have
open at the same time. We default to a limit of 128 nodes, which is likely adequate for an xterm-based
debugging session. However, you can increase it using an mca param (see ompi_info) to as high as the system allows.
>
> Thanks for helping debug this! I'll add you to the patch list so you can track it.
>
>>
>> What does *not* work is
>>  jody <at> aim-triops ~/share/neander $ mpirun -np 2 -host squid_0 -mca
>> plm_rsh_agent "ssh -Y"  --leave-session-attached  xterm
>>  xterm Xt error: Can't open display:
>>  xterm:  DISPLAY is not set
>>  xterm Xt error: Can't open display:
>>  xterm:  DISPLAY is not set
>>
>> But then again, this call works (i.e. an xterm is opened) if all the
>> debug-options are used (ompidbg_4.txt).
>> Here the '--leave-session-attached' is necessary - without it, no xterm.
>>
>>> From these results i would say that there is no basic mishandling of
>> 'ssh', though i have no idea
>> what internal differences the use of the '-leave-session-attached'
>> option or the debug options make.
>>
>> I hope these observations are helpful
>>  Jody
>>
>>
>> On Fri, Apr 29, 2011 at 12:08 AM, jody <jody.xha <at> gmail.com> wrote:
>>> Hi Ralph
>>>
>>> Thank you for your suggestions.
>>> I'll be happy to help  you.
>>> I'm not sure if i'll get around to this tomorrow,
>>> but i certainly will do so on Monday.
>>>
>>> Thanks
>>>  Jody
>>>
>>> On Thu, Apr 28, 2011 at 11:53 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>> Hi Jody
>>>>
>>>> I'm not sure when I'll get a chance to work on this - got a deadline to meet. I do have a couple of
suggestions, if you wouldn't mind helping debug the problem?
>>>>
>>>> It looks to me like the problem is that mpirun is crashing or terminating early for some reason - hence
the failures to send msgs to it, and the "lifeline lost" error that leads to the termination of the daemon.
If you build a debug version of the code (i.e., --enable-debug on configure), you can get a lot of debug info
that traces the behavior.
>>>>
>>>> If you could then run your program with
>>>>
>>>>  -mca plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached
>>>>
>>>> and send it to me, we'll see what ORTE thinks it is doing.
>>>>
>>>> You could also take a look at the code for implementing the xterm option. You'll find it in
>>>>
>>>> orte/mca/odls/base/odls_base_default_fns.c
>>>>
>>>> around line 1115. The xterm command syntax is defined in
>>>>
>>>> orte/mca/odls/base/odls_base_open.c
>>>>
>>>> around line 233 and following. Note that we use "xterm -T" as the cmd. Perhaps you can spot an error in the
way we treat xterm?
>>>>
>>>> Also, remember that you have to specify that you want us to "hold" the xterm window open even after the
process terminates. If you don't specify it, the window automatically closes upon completion of the
process. So a fast-running cmd like "hostname" might disappear so quickly that it causes a race condition problem.
>>>>
>>>> You might want to try a spinner application - i.e.., output something and then sit in a loop or sleep for
some period of time. Or, use the "hold" option to keep the window open - you designate "hold" by putting a '!'
before the rank, e.g., "mpirun -np 2 -xterm \!2 hostname"
>>>>
>>>>
>>>> On Apr 28, 2011, at 8:38 AM, jody wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> Unfortunately this does not solve my problem.
>>>>> While i can do
>>>>>  ssh -Y squid_0 xterm
>>>>> and this will open an xterm on m,y machiine (chefli),
>>>>> i run into problems with the -xterm option of openmpi:
>>>>>
>>>>>  jody <at> chefli ~/share/neander $ mpirun -np 4  -mca plm_rsh_agent "ssh
>>>>> -Y" -host squid_0 --xterm 1 hostname
>>>>>  squid_0
>>>>>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>> [sd = 8]
>>>>>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
>>>>> lifeline [[35219,0],0] lost
>>>>>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>> [sd = 8]
>>>>>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
>>>>> lifeline [[35219,0],0] lost
>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>
>>>>> By the way when i look at the DISPLAY variable in the xterm window
>>>>> opened via squid_0,
>>>>> i also have the display variable "localhost:11.0"
>>>>>
>>>>> Actually, the difference with using the "-mca plm_rsh_agent" is that
>>>>> the lines wiht the warnings about "xauth" and "untrusted X" do not
>>>>> appear:
>>>>>
>>>>>  jody <at> chefli ~/share/neander $ mpirun -np 4   -host squid_0 -xterm 1 hostname
>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>  squid_0
>>>>>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>> [sd = 8]
>>>>>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
>>>>> lifeline [[34926,0],0] lost
>>>>>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>> [sd = 8]
>>>>>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
>>>>> lifeline [[34926,0],0] lost
>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>
>>>>>
>>>>> I have doubts that the "-Y" is passed correctly:
>>>>>   jody <at> triops ~/share/neander $ mpirun -np   -mca plm_rsh_agent "ssh
>>>>> -Y" -host squid_0 xterm
>>>>>  xterm Xt error: Can't open display:
>>>>>  xterm:  DISPLAY is not set
>>>>>  xterm Xt error: Can't open display:
>>>>>  xterm:  DISPLAY is not set
>>>>>
>>>>>
>>>>> ---> as a matter of fact i noticed that the xterm option doesn't work locally:
>>>>>  mpirun -np 4    -xterm 1 /usr/bin/printenv
>>>>> prints verything onto the console.
>>>>>
>>>>> Do you have any other suggestions i could try?
>>>>>
>>>>> Thank You
>>>>> Jody
>>>>>
>>>>> On Thu, Apr 28, 2011 at 3:06 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>>>> Should be able to just set
>>>>>>
>>>>>> -mca plm_rsh_agent "ssh -Y"
>>>>>>
>>>>>> on your cmd line, I believe
>>>>>>
>>>>>> On Apr 28, 2011, at 12:53 AM, jody wrote:
>>>>>>
>>>>>>> Hi Ralph
>>>>>>>
>>>>>>> Is there an easy way i could modify the OpenMPI code so that it would use
>>>>>>> the -Y option for ssh when connecting to remote machines?
>>>>>>>
>>>>>>> Thank You
>>>>>>>   Jody
>>>>>>>
>>>>>>> On Thu, Apr 7, 2011 at 4:01 PM, jody <jody.xha <at> gmail.com> wrote:
>>>>>>>> Hi Ralph
>>>>>>>> thank you for your suggestions. After some fiddling, i found that after my
>>>>>>>> last update (gentoo) my sshd_config had been overwritten
>>>>>>>> (X11Forwarding was set to 'no').
>>>>>>>>
>>>>>>>> After correcting that, i can now open remote terminals with 'ssh -Y'
>>>>>>>> and with 'ssh -X'
>>>>>>>> (but with '-X' is till get those xauth warnings)
>>>>>>>>
>>>>>>>> But the xterm option still doesn't work:
>>>>>>>>  jody <at> chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1,2
>>>>>>>> printenv | grep WORLD_RANK
>>>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>>>>>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>>>  [aim-squid_0:09856] [[54132,0],1]->[[54132,0],0]
>>>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>>>> [sd = 8]
>>>>>>>>  [aim-squid_0:09856] [[54132,0],1] routed:binomial: Connection to
>>>>>>>> lifeline [[54132,0],0] lost
>>>>>>>>
>>>>>>>> So it looks like the two processes from squid_0 can't open the display this way,
>>>>>>>> but one of them writes the output to the console...
>>>>>>>> Surprisingly, they are trying 'localhost:11.0' whereas when i use 'ssh -Y' the
>>>>>>>> DISPLAY variable is set to 'localhost:10.0'
>>>>>>>>
>>>>>>>> So in what way would OMPI have to be adapted, so -xterm would work?
>>>>>>>>
>>>>>>>> Thank You
>>>>>>>>  Jody
>>>>>>>>
>>>>>>>> On Wed, Apr 6, 2011 at 8:32 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>>>>>>> Here's a little more info - it's for Cygwin, but I don't see anything
>>>>>>>>> Cygwin-specific in the answers:
>>>>>>>>> http://x.cygwin.com/docs/faq/cygwin-x-faq.html#q-ssh-no-x11forwarding
>>>>>>>>>
>>>>>>>>> On Apr 6, 2011, at 12:30 PM, Ralph Castain wrote:
>>>>>>>>>
>>>>>>>>> Sorry Jody - I should have read your note more carefully to see that you
>>>>>>>>> already tried -Y. :-(
>>>>>>>>> Not sure what to suggest...
>>>>>>>>>
>>>>>>>>> On Apr 6, 2011, at 12:29 PM, Ralph Castain wrote:
>>>>>>>>>
>>>>>>>>> Like I said, I'm not expert. However, a quick "google" of revealed this
>>>>>>>>> result:
>>>>>>>>>
>>>>>>>>> When trying to set up x11 forwarding over an ssh session to a remote server
>>>>>>>>> with the -X switch, I was getting an error like Warning: No xauth
>>>>>>>>> data; using fake authentication data for X11 forwarding.
>>>>>>>>>
>>>>>>>>> When doing something like:
>>>>>>>>> ssh -Xl root 10.1.1.9 to a remote server, the authentication worked, but I
>>>>>>>>> got an error message like:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> jason <at> badman ~/bin $ ssh -Xl root 10.1.1.9
>>>>>>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>>>>>>>> Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>>>> Last login: Wed Apr 14 18:18:39 2010 from 10.1.1.5
>>>>>>>>> [root <at> RHEL ~]#
>>>>>>>>> and any X programs I ran would not display on my local system..
>>>>>>>>>
>>>>>>>>> Turns out the solution is to use the -Y switch instead.
>>>>>>>>>
>>>>>>>>> ssh -Yl root 10.1.1.9
>>>>>>>>>
>>>>>>>>> and that worked fine.
>>>>>>>>>
>>>>>>>>> See if that works for you - if it does, we may have to modify OMPI to
>>>>>>>>> accommodate.
>>>>>>>>>
>>>>>>>>> On Apr 6, 2011, at 9:19 AM, jody wrote:
>>>>>>>>>
>>>>>>>>> Hi Ralph
>>>>>>>>> No, after the above error message mpirun has exited.
>>>>>>>>>
>>>>>>>>> But i also noticed that it is to ssh into squid_0 and open a xterm there:
>>>>>>>>>
>>>>>>>>>  jody <at> chefli ~/share/neander $ ssh -Y squid_0
>>>>>>>>>  Last login: Wed Apr  6 17:14:02 CEST 2011 from chefli.uzh.ch on pts/0
>>>>>>>>>  jody <at> squid_0 ~ $ xterm
>>>>>>>>>  xterm Xt error: Can't open display:
>>>>>>>>>  xterm:  DISPLAY is not set
>>>>>>>>>  jody <at> squid_0 ~ $ export DISPLAY=130.60.126.74:0.0
>>>>>>>>>  jody <at> squid_0 ~ $ xterm
>>>>>>>>>  xterm Xt error: Can't open display: 130.60.126.74:0.0
>>>>>>>>>  jody <at> squid_0 ~ $ export DISPLAY=chefli.uzh.ch:0.0
>>>>>>>>>  jody <at> squid_0 ~ $ xterm
>>>>>>>>>  xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>>>  jody <at> squid_0 ~ $ exit
>>>>>>>>>  logout
>>>>>>>>>
>>>>>>>>> same thing with ssh -X, but here i get the same warning/error message
>>>>>>>>> as with mpirun:
>>>>>>>>>
>>>>>>>>>  jody <at> chefli ~/share/neander $ ssh -X squid_0
>>>>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not
>>>>>>>>> generated
>>>>>>>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>>>>  Last login: Wed Apr  6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh
>>>>>>>>>
>>>>>>>>> So perhaps the whole problem is linked to that xauth-thing.
>>>>>>>>> Do you have a suggestion how this can be solved?
>>>>>>>>>
>>>>>>>>> Thank You
>>>>>>>>>  Jody
>>>>>>>>>
>>>>>>>>> On Wed, Apr 6, 2011 at 4:41 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>>>>>>>
>>>>>>>>> If I read your error messages correctly, it looks like mpirun is crashing -
>>>>>>>>> the daemon is complaining that it lost the socket connection back to mpirun,
>>>>>>>>> and hence will abort.
>>>>>>>>>
>>>>>>>>> Are you seeing mpirun still alive?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Apr 5, 2011, at 4:46 AM, jody wrote:
>>>>>>>>>
>>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>> On my workstation and  the cluster i set up OpenMPI (v 1.4.2) so that
>>>>>>>>>
>>>>>>>>> it works in "text-mode":
>>>>>>>>>
>>>>>>>>>  $ mpirun -np 4  -x DISPLAY -host squid_0   printenv | grep WORLD_RANK
>>>>>>>>>
>>>>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>>>>
>>>>>>>>>  OMPI_COMM_WORLD_RANK=1
>>>>>>>>>
>>>>>>>>>  OMPI_COMM_WORLD_RANK=2
>>>>>>>>>
>>>>>>>>>  OMPI_COMM_WORLD_RANK=3
>>>>>>>>>
>>>>>>>>> but when i use  the -xterm option to mpirun, it doesn't work
>>>>>>>>>
>>>>>>>>> $ mpirun -np 4  -x DISPLAY -host squid_0 -xterm 1,2  printenv | grep
>>>>>>>>> WORLD_RANK
>>>>>>>>>
>>>>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not
>>>>>>>>> generated
>>>>>>>>>
>>>>>>>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>>>>
>>>>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>>>>
>>>>>>>>>  [squid_0:05266] [[55607,0],1]->[[55607,0],0]
>>>>>>>>>
>>>>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>>>>>
>>>>>>>>> [sd = 8]
>>>>>>>>>
>>>>>>>>>  [squid_0:05266] [[55607,0],1] routed:binomial: Connection to
>>>>>>>>>
>>>>>>>>> lifeline [[55607,0],0] lost
>>>>>>>>>
>>>>>>>>>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>>>
>>>>>>>>>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>>>
>>>>>>>>> (strange: somebody wrote his message to the console)
>>>>>>>>>
>>>>>>>>> No matter whether i set the DISPLAY variable to the full hostname of
>>>>>>>>>
>>>>>>>>> the workstation,
>>>>>>>>>
>>>>>>>>> to the IP-Adress of the workstation or simply to ":0.0", it doesn't work
>>>>>>>>>
>>>>>>>>> But i do have xauth data (as far as i know):
>>>>>>>>>
>>>>>>>>> On the remote (squid_0):
>>>>>>>>>
>>>>>>>>>  jody <at> squid_0 ~ $ xauth list
>>>>>>>>>
>>>>>>>>>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>>>>>>>>>
>>>>>>>>>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>>>
>>>>>>>>>  chefli.uzh.ch:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>>>
>>>>>>>>> on the workstation:
>>>>>>>>>
>>>>>>>>>  $ xauth list
>>>>>>>>>
>>>>>>>>>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>>>>>>>>>
>>>>>>>>>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>>>
>>>>>>>>>  localhost.localdomain/unix:0  MIT-MAGIC-COOKIE-1
>>>>>>>>>
>>>>>>>>> 146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>>>
>>>>>>>>>  chefli.uzh.ch/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>>>
>>>>>>>>> In sshd_config on the workstation i have 'X11Forwarding yes'
>>>>>>>>>
>>>>>>>>> I have also done
>>>>>>>>>
>>>>>>>>>   xhost + squid_0
>>>>>>>>>
>>>>>>>>> on the workstation.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> How can i get the -xterm option running?
>>>>>>>>>
>>>>>>>>> Thank You
>>>>>>>>>
>>>>>>>>>  Jody
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>>
>>>>>>>>> users mailing list
>>>>>>>>>
>>>>>>>>> users <at> open-mpi.org
>>>>>>>>>
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>>
>>>>>>>>> users mailing list
>>>>>>>>>
>>>>>>>>> users <at> open-mpi.org
>>>>>>>>>
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users <at> open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users <at> open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users <at> open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users <at> open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users <at> open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users <at> open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>> <ompidbg_1.txt><ompidbg_2.txt><ompidbg_3.txt><ompidbg_4.txt>_______________________________________________
>> users mailing list
>> users <at> open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users <at> open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
Ralph Castain | 2 May 2011 16:08
Favicon
Gravatar

Re: [OMPI users] problems with the -xterm option


On May 2, 2011, at 7:56 AM, jody wrote:

> Hi Ralph
> 
> Thank You for doing the fix.
> 
> Do you perhaps also have an idea what is going on when i try to start
> xterm (or probably an other X application) on a remote host?
> In this case it is not enough to specify the '--leave-session-attached' option.
> 
> These calls won't open any xterms
>  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
> plm_base_verbose 1 xterm
>  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"
> --leave-session-attached xterm
>  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
> odls_base_verbose 5 xterm
>  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
> odls_base_verbose 5 --leave-session-attached xterm
> 
> But this will open the xterms:
>  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
> plm_base_verbose 1  --leave-session-attached xterm
> 
> Any verbosity level > 0 will open xterms, but with ' -mca
> plm_base_verbose 0' there are again no xterms.
> 

No earthly idea...this seems to contradict what you had below. You said you were seeing the xterms with this
cmd line:

>>> I just found that everything works as expected if i use the the
>>> '--leave-session-attached' option (without the debug options):
>>>  jody <at> chefli ~/share/neander $ mpirun -np 4 -host squid_0 -mca
>>> plm_rsh_agent "ssh -Y"  --leave-session-attached  --xterm 0,1,2,3!
>>> ./HelloMPI
>>> The xterms are also opened if i do not use the '!' hold option.
>> 

Did I miss something?

> Thank You
>  Jody
> 
> On Mon, May 2, 2011 at 2:29 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>> 
>> On May 2, 2011, at 2:34 AM, jody wrote:
>> 
>>> Hi Ralph
>>> 
>>> I rebuilt open MPI 1.4.2 with the debug option on both chefli and squid_0.
>>> The results are interesting!
>>> 
>>> I wrote a small HelloMPI app which basically calls usleep for a pause
>>> of 5 seconds.
>>> 
>>> Now calling it as i did before, no MPI errors appear anymore, only the
>>> display problems:
>>>  jody <at> chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca
>>> plm_rsh_agent "ssh -Y" --xterm 0 ./HelloMPI
>>>  /usr/bin/xterm Xt error: Can't open display: localhost:10.0
>>> 
>>> When i do the same call *with* the debug option, the xterm appears and
>>> shows the output of HelloMPI!
>>> I attach the output in ompidbg_1.txt (It also works if i call with
>>> '-np 4' and '--xterm 0,1,2,3'
>> 
>> Good!
>> 
>>> 
>>> Calling hostname the same way does not open an xterm (cf. ompidbg_2.txt).
>>> 
>>> If i use the hold-option, the xterm appears with the output of
>>> 'hostrname' (cf. ompidbg_3.txt)
>>> The xterm opens after the line "launch complete for job..." has been
>>> written (line 59)
>> 
>> Okay, that's also expected. Like I said, without the "hold", the output is generated so quickly that the
window just flashes at best. I've had similar experiences - hence the "hold" option.
>> 
>>> 
>>> I just found that everything works as expected if i use the the
>>> '--leave-session-attached' option (without the debug options):
>>>  jody <at> chefli ~/share/neander $ mpirun -np 4 -host squid_0 -mca
>>> plm_rsh_agent "ssh -Y"  --leave-session-attached  --xterm 0,1,2,3!
>>> ./HelloMPI
>>> The xterms are also opened if i do not use the '!' hold option.
>> 
>> Okay, I can understand why. The --leave-session-attached option just tells mpirun to not daemonize the
backend daemons - thus leaving the ssh session alive. The debug options do the same thing, but turn on all
the debug output.
>> 
>> The problem is that if you don't leave the ssh session alive, then the xterm has no way back to your screen.
By daemonizing, we severe that connection.
>> 
>> What I should do (and maybe used to do, but it got removed) is automatically turn "on" the
leave-session-attached option if you give --xterm. I can enter that patch.
>> 
>> Note that this does limit the size of the launch to the number of ssh sessions the system allows you to have
open at the same time. We default to a limit of 128 nodes, which is likely adequate for an xterm-based
debugging session. However, you can increase it using an mca param (see ompi_info) to as high as the system allows.
>> 
>> Thanks for helping debug this! I'll add you to the patch list so you can track it.
>> 
>>> 
>>> What does *not* work is
>>>  jody <at> aim-triops ~/share/neander $ mpirun -np 2 -host squid_0 -mca
>>> plm_rsh_agent "ssh -Y"  --leave-session-attached  xterm
>>>  xterm Xt error: Can't open display:
>>>  xterm:  DISPLAY is not set
>>>  xterm Xt error: Can't open display:
>>>  xterm:  DISPLAY is not set
>>> 
>>> But then again, this call works (i.e. an xterm is opened) if all the
>>> debug-options are used (ompidbg_4.txt).
>>> Here the '--leave-session-attached' is necessary - without it, no xterm.
>>> 
>>>> From these results i would say that there is no basic mishandling of
>>> 'ssh', though i have no idea
>>> what internal differences the use of the '-leave-session-attached'
>>> option or the debug options make.
>>> 
>>> I hope these observations are helpful
>>>  Jody
>>> 
>>> 
>>> On Fri, Apr 29, 2011 at 12:08 AM, jody <jody.xha <at> gmail.com> wrote:
>>>> Hi Ralph
>>>> 
>>>> Thank you for your suggestions.
>>>> I'll be happy to help  you.
>>>> I'm not sure if i'll get around to this tomorrow,
>>>> but i certainly will do so on Monday.
>>>> 
>>>> Thanks
>>>>  Jody
>>>> 
>>>> On Thu, Apr 28, 2011 at 11:53 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>>> Hi Jody
>>>>> 
>>>>> I'm not sure when I'll get a chance to work on this - got a deadline to meet. I do have a couple of
suggestions, if you wouldn't mind helping debug the problem?
>>>>> 
>>>>> It looks to me like the problem is that mpirun is crashing or terminating early for some reason - hence
the failures to send msgs to it, and the "lifeline lost" error that leads to the termination of the daemon.
If you build a debug version of the code (i.e., --enable-debug on configure), you can get a lot of debug info
that traces the behavior.
>>>>> 
>>>>> If you could then run your program with
>>>>> 
>>>>>  -mca plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached
>>>>> 
>>>>> and send it to me, we'll see what ORTE thinks it is doing.
>>>>> 
>>>>> You could also take a look at the code for implementing the xterm option. You'll find it in
>>>>> 
>>>>> orte/mca/odls/base/odls_base_default_fns.c
>>>>> 
>>>>> around line 1115. The xterm command syntax is defined in
>>>>> 
>>>>> orte/mca/odls/base/odls_base_open.c
>>>>> 
>>>>> around line 233 and following. Note that we use "xterm -T" as the cmd. Perhaps you can spot an error in
the way we treat xterm?
>>>>> 
>>>>> Also, remember that you have to specify that you want us to "hold" the xterm window open even after the
process terminates. If you don't specify it, the window automatically closes upon completion of the
process. So a fast-running cmd like "hostname" might disappear so quickly that it causes a race condition problem.
>>>>> 
>>>>> You might want to try a spinner application - i.e.., output something and then sit in a loop or sleep for
some period of time. Or, use the "hold" option to keep the window open - you designate "hold" by putting a '!'
before the rank, e.g., "mpirun -np 2 -xterm \!2 hostname"
>>>>> 
>>>>> 
>>>>> On Apr 28, 2011, at 8:38 AM, jody wrote:
>>>>> 
>>>>>> Hi
>>>>>> 
>>>>>> Unfortunately this does not solve my problem.
>>>>>> While i can do
>>>>>>  ssh -Y squid_0 xterm
>>>>>> and this will open an xterm on m,y machiine (chefli),
>>>>>> i run into problems with the -xterm option of openmpi:
>>>>>> 
>>>>>>  jody <at> chefli ~/share/neander $ mpirun -np 4  -mca plm_rsh_agent "ssh
>>>>>> -Y" -host squid_0 --xterm 1 hostname
>>>>>>  squid_0
>>>>>>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>> [sd = 8]
>>>>>>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
>>>>>> lifeline [[35219,0],0] lost
>>>>>>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>> [sd = 8]
>>>>>>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
>>>>>> lifeline [[35219,0],0] lost
>>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>> 
>>>>>> By the way when i look at the DISPLAY variable in the xterm window
>>>>>> opened via squid_0,
>>>>>> i also have the display variable "localhost:11.0"
>>>>>> 
>>>>>> Actually, the difference with using the "-mca plm_rsh_agent" is that
>>>>>> the lines wiht the warnings about "xauth" and "untrusted X" do not
>>>>>> appear:
>>>>>> 
>>>>>>  jody <at> chefli ~/share/neander $ mpirun -np 4   -host squid_0 -xterm 1 hostname
>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>>>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>  squid_0
>>>>>>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>> [sd = 8]
>>>>>>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
>>>>>> lifeline [[34926,0],0] lost
>>>>>>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>> [sd = 8]
>>>>>>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
>>>>>> lifeline [[34926,0],0] lost
>>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>> 
>>>>>> 
>>>>>> I have doubts that the "-Y" is passed correctly:
>>>>>>   jody <at> triops ~/share/neander $ mpirun -np   -mca plm_rsh_agent "ssh
>>>>>> -Y" -host squid_0 xterm
>>>>>>  xterm Xt error: Can't open display:
>>>>>>  xterm:  DISPLAY is not set
>>>>>>  xterm Xt error: Can't open display:
>>>>>>  xterm:  DISPLAY is not set
>>>>>> 
>>>>>> 
>>>>>> ---> as a matter of fact i noticed that the xterm option doesn't work locally:
>>>>>>  mpirun -np 4    -xterm 1 /usr/bin/printenv
>>>>>> prints verything onto the console.
>>>>>> 
>>>>>> Do you have any other suggestions i could try?
>>>>>> 
>>>>>> Thank You
>>>>>> Jody
>>>>>> 
>>>>>> On Thu, Apr 28, 2011 at 3:06 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>>>>> Should be able to just set
>>>>>>> 
>>>>>>> -mca plm_rsh_agent "ssh -Y"
>>>>>>> 
>>>>>>> on your cmd line, I believe
>>>>>>> 
>>>>>>> On Apr 28, 2011, at 12:53 AM, jody wrote:
>>>>>>> 
>>>>>>>> Hi Ralph
>>>>>>>> 
>>>>>>>> Is there an easy way i could modify the OpenMPI code so that it would use
>>>>>>>> the -Y option for ssh when connecting to remote machines?
>>>>>>>> 
>>>>>>>> Thank You
>>>>>>>>   Jody
>>>>>>>> 
>>>>>>>> On Thu, Apr 7, 2011 at 4:01 PM, jody <jody.xha <at> gmail.com> wrote:
>>>>>>>>> Hi Ralph
>>>>>>>>> thank you for your suggestions. After some fiddling, i found that after my
>>>>>>>>> last update (gentoo) my sshd_config had been overwritten
>>>>>>>>> (X11Forwarding was set to 'no').
>>>>>>>>> 
>>>>>>>>> After correcting that, i can now open remote terminals with 'ssh -Y'
>>>>>>>>> and with 'ssh -X'
>>>>>>>>> (but with '-X' is till get those xauth warnings)
>>>>>>>>> 
>>>>>>>>> But the xterm option still doesn't work:
>>>>>>>>>  jody <at> chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1,2
>>>>>>>>> printenv | grep WORLD_RANK
>>>>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>>>>>>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>>>>  [aim-squid_0:09856] [[54132,0],1]->[[54132,0],0]
>>>>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>>>>> [sd = 8]
>>>>>>>>>  [aim-squid_0:09856] [[54132,0],1] routed:binomial: Connection to
>>>>>>>>> lifeline [[54132,0],0] lost
>>>>>>>>> 
>>>>>>>>> So it looks like the two processes from squid_0 can't open the display this way,
>>>>>>>>> but one of them writes the output to the console...
>>>>>>>>> Surprisingly, they are trying 'localhost:11.0' whereas when i use 'ssh -Y' the
>>>>>>>>> DISPLAY variable is set to 'localhost:10.0'
>>>>>>>>> 
>>>>>>>>> So in what way would OMPI have to be adapted, so -xterm would work?
>>>>>>>>> 
>>>>>>>>> Thank You
>>>>>>>>>  Jody
>>>>>>>>> 
>>>>>>>>> On Wed, Apr 6, 2011 at 8:32 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>>>>>>>> Here's a little more info - it's for Cygwin, but I don't see anything
>>>>>>>>>> Cygwin-specific in the answers:
>>>>>>>>>> http://x.cygwin.com/docs/faq/cygwin-x-faq.html#q-ssh-no-x11forwarding
>>>>>>>>>> 
>>>>>>>>>> On Apr 6, 2011, at 12:30 PM, Ralph Castain wrote:
>>>>>>>>>> 
>>>>>>>>>> Sorry Jody - I should have read your note more carefully to see that you
>>>>>>>>>> already tried -Y. :-(
>>>>>>>>>> Not sure what to suggest...
>>>>>>>>>> 
>>>>>>>>>> On Apr 6, 2011, at 12:29 PM, Ralph Castain wrote:
>>>>>>>>>> 
>>>>>>>>>> Like I said, I'm not expert. However, a quick "google" of revealed this
>>>>>>>>>> result:
>>>>>>>>>> 
>>>>>>>>>> When trying to set up x11 forwarding over an ssh session to a remote server
>>>>>>>>>> with the -X switch, I was getting an error like Warning: No xauth
>>>>>>>>>> data; using fake authentication data for X11 forwarding.
>>>>>>>>>> 
>>>>>>>>>> When doing something like:
>>>>>>>>>> ssh -Xl root 10.1.1.9 to a remote server, the authentication worked, but I
>>>>>>>>>> got an error message like:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> jason <at> badman ~/bin $ ssh -Xl root 10.1.1.9
>>>>>>>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>>>>>>>>> Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>>>>> Last login: Wed Apr 14 18:18:39 2010 from 10.1.1.5
>>>>>>>>>> [root <at> RHEL ~]#
>>>>>>>>>> and any X programs I ran would not display on my local system..
>>>>>>>>>> 
>>>>>>>>>> Turns out the solution is to use the -Y switch instead.
>>>>>>>>>> 
>>>>>>>>>> ssh -Yl root 10.1.1.9
>>>>>>>>>> 
>>>>>>>>>> and that worked fine.
>>>>>>>>>> 
>>>>>>>>>> See if that works for you - if it does, we may have to modify OMPI to
>>>>>>>>>> accommodate.
>>>>>>>>>> 
>>>>>>>>>> On Apr 6, 2011, at 9:19 AM, jody wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi Ralph
>>>>>>>>>> No, after the above error message mpirun has exited.
>>>>>>>>>> 
>>>>>>>>>> But i also noticed that it is to ssh into squid_0 and open a xterm there:
>>>>>>>>>> 
>>>>>>>>>>  jody <at> chefli ~/share/neander $ ssh -Y squid_0
>>>>>>>>>>  Last login: Wed Apr  6 17:14:02 CEST 2011 from chefli.uzh.ch on pts/0
>>>>>>>>>>  jody <at> squid_0 ~ $ xterm
>>>>>>>>>>  xterm Xt error: Can't open display:
>>>>>>>>>>  xterm:  DISPLAY is not set
>>>>>>>>>>  jody <at> squid_0 ~ $ export DISPLAY=130.60.126.74:0.0
>>>>>>>>>>  jody <at> squid_0 ~ $ xterm
>>>>>>>>>>  xterm Xt error: Can't open display: 130.60.126.74:0.0
>>>>>>>>>>  jody <at> squid_0 ~ $ export DISPLAY=chefli.uzh.ch:0.0
>>>>>>>>>>  jody <at> squid_0 ~ $ xterm
>>>>>>>>>>  xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>>>>  jody <at> squid_0 ~ $ exit
>>>>>>>>>>  logout
>>>>>>>>>> 
>>>>>>>>>> same thing with ssh -X, but here i get the same warning/error message
>>>>>>>>>> as with mpirun:
>>>>>>>>>> 
>>>>>>>>>>  jody <at> chefli ~/share/neander $ ssh -X squid_0
>>>>>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not
>>>>>>>>>> generated
>>>>>>>>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>>>>>  Last login: Wed Apr  6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh
>>>>>>>>>> 
>>>>>>>>>> So perhaps the whole problem is linked to that xauth-thing.
>>>>>>>>>> Do you have a suggestion how this can be solved?
>>>>>>>>>> 
>>>>>>>>>> Thank You
>>>>>>>>>>  Jody
>>>>>>>>>> 
>>>>>>>>>> On Wed, Apr 6, 2011 at 4:41 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>>>>>>>> 
>>>>>>>>>> If I read your error messages correctly, it looks like mpirun is crashing -
>>>>>>>>>> the daemon is complaining that it lost the socket connection back to mpirun,
>>>>>>>>>> and hence will abort.
>>>>>>>>>> 
>>>>>>>>>> Are you seeing mpirun still alive?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Apr 5, 2011, at 4:46 AM, jody wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi
>>>>>>>>>> 
>>>>>>>>>> On my workstation and  the cluster i set up OpenMPI (v 1.4.2) so that
>>>>>>>>>> 
>>>>>>>>>> it works in "text-mode":
>>>>>>>>>> 
>>>>>>>>>>  $ mpirun -np 4  -x DISPLAY -host squid_0   printenv | grep WORLD_RANK
>>>>>>>>>> 
>>>>>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>>>>> 
>>>>>>>>>>  OMPI_COMM_WORLD_RANK=1
>>>>>>>>>> 
>>>>>>>>>>  OMPI_COMM_WORLD_RANK=2
>>>>>>>>>> 
>>>>>>>>>>  OMPI_COMM_WORLD_RANK=3
>>>>>>>>>> 
>>>>>>>>>> but when i use  the -xterm option to mpirun, it doesn't work
>>>>>>>>>> 
>>>>>>>>>> $ mpirun -np 4  -x DISPLAY -host squid_0 -xterm 1,2  printenv | grep
>>>>>>>>>> WORLD_RANK
>>>>>>>>>> 
>>>>>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not
>>>>>>>>>> generated
>>>>>>>>>> 
>>>>>>>>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>>>>> 
>>>>>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>>>>> 
>>>>>>>>>>  [squid_0:05266] [[55607,0],1]->[[55607,0],0]
>>>>>>>>>> 
>>>>>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>>>>>> 
>>>>>>>>>> [sd = 8]
>>>>>>>>>> 
>>>>>>>>>>  [squid_0:05266] [[55607,0],1] routed:binomial: Connection to
>>>>>>>>>> 
>>>>>>>>>> lifeline [[55607,0],0] lost
>>>>>>>>>> 
>>>>>>>>>>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>>>> 
>>>>>>>>>>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>>>> 
>>>>>>>>>> (strange: somebody wrote his message to the console)
>>>>>>>>>> 
>>>>>>>>>> No matter whether i set the DISPLAY variable to the full hostname of
>>>>>>>>>> 
>>>>>>>>>> the workstation,
>>>>>>>>>> 
>>>>>>>>>> to the IP-Adress of the workstation or simply to ":0.0", it doesn't work
>>>>>>>>>> 
>>>>>>>>>> But i do have xauth data (as far as i know):
>>>>>>>>>> 
>>>>>>>>>> On the remote (squid_0):
>>>>>>>>>> 
>>>>>>>>>>  jody <at> squid_0 ~ $ xauth list
>>>>>>>>>> 
>>>>>>>>>>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>>>>>>>>>> 
>>>>>>>>>>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>>>> 
>>>>>>>>>>  chefli.uzh.ch:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>>>> 
>>>>>>>>>> on the workstation:
>>>>>>>>>> 
>>>>>>>>>>  $ xauth list
>>>>>>>>>> 
>>>>>>>>>>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>>>>>>>>>> 
>>>>>>>>>>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>>>> 
>>>>>>>>>>  localhost.localdomain/unix:0  MIT-MAGIC-COOKIE-1
>>>>>>>>>> 
>>>>>>>>>> 146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>>>> 
>>>>>>>>>>  chefli.uzh.ch/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>>>> 
>>>>>>>>>> In sshd_config on the workstation i have 'X11Forwarding yes'
>>>>>>>>>> 
>>>>>>>>>> I have also done
>>>>>>>>>> 
>>>>>>>>>>   xhost + squid_0
>>>>>>>>>> 
>>>>>>>>>> on the workstation.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> How can i get the -xterm option running?
>>>>>>>>>> 
>>>>>>>>>> Thank You
>>>>>>>>>> 
>>>>>>>>>>  Jody
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> 
>>>>>>>>>> users mailing list
>>>>>>>>>> 
>>>>>>>>>> users <at> open-mpi.org
>>>>>>>>>> 
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> 
>>>>>>>>>> users mailing list
>>>>>>>>>> 
>>>>>>>>>> users <at> open-mpi.org
>>>>>>>>>> 
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users <at> open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users <at> open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users <at> open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users <at> open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users <at> open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users <at> open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>> 
>>> <ompidbg_1.txt><ompidbg_2.txt><ompidbg_3.txt><ompidbg_4.txt>_______________________________________________
>>> users mailing list
>>> users <at> open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> users <at> open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> _______________________________________________
> users mailing list
> users <at> open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
jody | 2 May 2011 16:21
Picon

Re: [OMPI users] problems with the -xterm option

Hi
Well, the difference is that one time i call the application
'HelloMPI' with the '--xterm' option,
whereas in my previous mail i am calling the application 'xterm'
(without the '--xterm' option)

Jody

On Mon, May 2, 2011 at 4:08 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>
> On May 2, 2011, at 7:56 AM, jody wrote:
>
>> Hi Ralph
>>
>> Thank You for doing the fix.
>>
>> Do you perhaps also have an idea what is going on when i try to start
>> xterm (or probably an other X application) on a remote host?
>> In this case it is not enough to specify the '--leave-session-attached' option.
>>
>> These calls won't open any xterms
>>  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
>> plm_base_verbose 1 xterm
>>  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"
>> --leave-session-attached xterm
>>  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
>> odls_base_verbose 5 xterm
>>  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
>> odls_base_verbose 5 --leave-session-attached xterm
>>
>> But this will open the xterms:
>>  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
>> plm_base_verbose 1  --leave-session-attached xterm
>>
>> Any verbosity level > 0 will open xterms, but with ' -mca
>> plm_base_verbose 0' there are again no xterms.
>>
>
> No earthly idea...this seems to contradict what you had below. You said you were seeing the xterms with
this cmd line:
>
>>>> I just found that everything works as expected if i use the the
>>>> '--leave-session-attached' option (without the debug options):
>>>>  jody <at> chefli ~/share/neander $ mpirun -np 4 -host squid_0 -mca
>>>> plm_rsh_agent "ssh -Y"  --leave-session-attached  --xterm 0,1,2,3!
>>>> ./HelloMPI
>>>> The xterms are also opened if i do not use the '!' hold option.
>>>
>
> Did I miss something?
>
>
>> Thank You
>>  Jody
>>
>> On Mon, May 2, 2011 at 2:29 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>
>>> On May 2, 2011, at 2:34 AM, jody wrote:
>>>
>>>> Hi Ralph
>>>>
>>>> I rebuilt open MPI 1.4.2 with the debug option on both chefli and squid_0.
>>>> The results are interesting!
>>>>
>>>> I wrote a small HelloMPI app which basically calls usleep for a pause
>>>> of 5 seconds.
>>>>
>>>> Now calling it as i did before, no MPI errors appear anymore, only the
>>>> display problems:
>>>>  jody <at> chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca
>>>> plm_rsh_agent "ssh -Y" --xterm 0 ./HelloMPI
>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:10.0
>>>>
>>>> When i do the same call *with* the debug option, the xterm appears and
>>>> shows the output of HelloMPI!
>>>> I attach the output in ompidbg_1.txt (It also works if i call with
>>>> '-np 4' and '--xterm 0,1,2,3'
>>>
>>> Good!
>>>
>>>>
>>>> Calling hostname the same way does not open an xterm (cf. ompidbg_2.txt).
>>>>
>>>> If i use the hold-option, the xterm appears with the output of
>>>> 'hostrname' (cf. ompidbg_3.txt)
>>>> The xterm opens after the line "launch complete for job..." has been
>>>> written (line 59)
>>>
>>> Okay, that's also expected. Like I said, without the "hold", the output is generated so quickly that the
window just flashes at best. I've had similar experiences - hence the "hold" option.
>>>
>>>>
>>>> I just found that everything works as expected if i use the the
>>>> '--leave-session-attached' option (without the debug options):
>>>>  jody <at> chefli ~/share/neander $ mpirun -np 4 -host squid_0 -mca
>>>> plm_rsh_agent "ssh -Y"  --leave-session-attached  --xterm 0,1,2,3!
>>>> ./HelloMPI
>>>> The xterms are also opened if i do not use the '!' hold option.
>>>
>>> Okay, I can understand why. The --leave-session-attached option just tells mpirun to not daemonize
the backend daemons - thus leaving the ssh session alive. The debug options do the same thing, but turn on
all the debug output.
>>>
>>> The problem is that if you don't leave the ssh session alive, then the xterm has no way back to your screen.
By daemonizing, we severe that connection.
>>>
>>> What I should do (and maybe used to do, but it got removed) is automatically turn "on" the
leave-session-attached option if you give --xterm. I can enter that patch.
>>>
>>> Note that this does limit the size of the launch to the number of ssh sessions the system allows you to have
open at the same time. We default to a limit of 128 nodes, which is likely adequate for an xterm-based
debugging session. However, you can increase it using an mca param (see ompi_info) to as high as the system allows.
>>>
>>> Thanks for helping debug this! I'll add you to the patch list so you can track it.
>>>
>>>>
>>>> What does *not* work is
>>>>  jody <at> aim-triops ~/share/neander $ mpirun -np 2 -host squid_0 -mca
>>>> plm_rsh_agent "ssh -Y"  --leave-session-attached  xterm
>>>>  xterm Xt error: Can't open display:
>>>>  xterm:  DISPLAY is not set
>>>>  xterm Xt error: Can't open display:
>>>>  xterm:  DISPLAY is not set
>>>>
>>>> But then again, this call works (i.e. an xterm is opened) if all the
>>>> debug-options are used (ompidbg_4.txt).
>>>> Here the '--leave-session-attached' is necessary - without it, no xterm.
>>>>
>>>>> From these results i would say that there is no basic mishandling of
>>>> 'ssh', though i have no idea
>>>> what internal differences the use of the '-leave-session-attached'
>>>> option or the debug options make.
>>>>
>>>> I hope these observations are helpful
>>>>  Jody
>>>>
>>>>
>>>> On Fri, Apr 29, 2011 at 12:08 AM, jody <jody.xha <at> gmail.com> wrote:
>>>>> Hi Ralph
>>>>>
>>>>> Thank you for your suggestions.
>>>>> I'll be happy to help  you.
>>>>> I'm not sure if i'll get around to this tomorrow,
>>>>> but i certainly will do so on Monday.
>>>>>
>>>>> Thanks
>>>>>  Jody
>>>>>
>>>>> On Thu, Apr 28, 2011 at 11:53 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>>>> Hi Jody
>>>>>>
>>>>>> I'm not sure when I'll get a chance to work on this - got a deadline to meet. I do have a couple of
suggestions, if you wouldn't mind helping debug the problem?
>>>>>>
>>>>>> It looks to me like the problem is that mpirun is crashing or terminating early for some reason - hence
the failures to send msgs to it, and the "lifeline lost" error that leads to the termination of the daemon.
If you build a debug version of the code (i.e., --enable-debug on configure), you can get a lot of debug info
that traces the behavior.
>>>>>>
>>>>>> If you could then run your program with
>>>>>>
>>>>>>  -mca plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached
>>>>>>
>>>>>> and send it to me, we'll see what ORTE thinks it is doing.
>>>>>>
>>>>>> You could also take a look at the code for implementing the xterm option. You'll find it in
>>>>>>
>>>>>> orte/mca/odls/base/odls_base_default_fns.c
>>>>>>
>>>>>> around line 1115. The xterm command syntax is defined in
>>>>>>
>>>>>> orte/mca/odls/base/odls_base_open.c
>>>>>>
>>>>>> around line 233 and following. Note that we use "xterm -T" as the cmd. Perhaps you can spot an error in
the way we treat xterm?
>>>>>>
>>>>>> Also, remember that you have to specify that you want us to "hold" the xterm window open even after the
process terminates. If you don't specify it, the window automatically closes upon completion of the
process. So a fast-running cmd like "hostname" might disappear so quickly that it causes a race condition problem.
>>>>>>
>>>>>> You might want to try a spinner application - i.e.., output something and then sit in a loop or sleep
for some period of time. Or, use the "hold" option to keep the window open - you designate "hold" by putting a
'!' before the rank, e.g., "mpirun -np 2 -xterm \!2 hostname"
>>>>>>
>>>>>>
>>>>>> On Apr 28, 2011, at 8:38 AM, jody wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> Unfortunately this does not solve my problem.
>>>>>>> While i can do
>>>>>>>  ssh -Y squid_0 xterm
>>>>>>> and this will open an xterm on m,y machiine (chefli),
>>>>>>> i run into problems with the -xterm option of openmpi:
>>>>>>>
>>>>>>>  jody <at> chefli ~/share/neander $ mpirun -np 4  -mca plm_rsh_agent "ssh
>>>>>>> -Y" -host squid_0 --xterm 1 hostname
>>>>>>>  squid_0
>>>>>>>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
>>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>>> [sd = 8]
>>>>>>>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
>>>>>>> lifeline [[35219,0],0] lost
>>>>>>>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
>>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>>> [sd = 8]
>>>>>>>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
>>>>>>> lifeline [[35219,0],0] lost
>>>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>>>
>>>>>>> By the way when i look at the DISPLAY variable in the xterm window
>>>>>>> opened via squid_0,
>>>>>>> i also have the display variable "localhost:11.0"
>>>>>>>
>>>>>>> Actually, the difference with using the "-mca plm_rsh_agent" is that
>>>>>>> the lines wiht the warnings about "xauth" and "untrusted X" do not
>>>>>>> appear:
>>>>>>>
>>>>>>>  jody <at> chefli ~/share/neander $ mpirun -np 4   -host squid_0 -xterm 1 hostname
>>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>>>>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>>  squid_0
>>>>>>>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
>>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>>> [sd = 8]
>>>>>>>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
>>>>>>> lifeline [[34926,0],0] lost
>>>>>>>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
>>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>>> [sd = 8]
>>>>>>>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
>>>>>>> lifeline [[34926,0],0] lost
>>>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>>>
>>>>>>>
>>>>>>> I have doubts that the "-Y" is passed correctly:
>>>>>>>   jody <at> triops ~/share/neander $ mpirun -np   -mca plm_rsh_agent "ssh
>>>>>>> -Y" -host squid_0 xterm
>>>>>>>  xterm Xt error: Can't open display:
>>>>>>>  xterm:  DISPLAY is not set
>>>>>>>  xterm Xt error: Can't open display:
>>>>>>>  xterm:  DISPLAY is not set
>>>>>>>
>>>>>>>
>>>>>>> ---> as a matter of fact i noticed that the xterm option doesn't work locally:
>>>>>>>  mpirun -np 4    -xterm 1 /usr/bin/printenv
>>>>>>> prints verything onto the console.
>>>>>>>
>>>>>>> Do you have any other suggestions i could try?
>>>>>>>
>>>>>>> Thank You
>>>>>>> Jody
>>>>>>>
>>>>>>> On Thu, Apr 28, 2011 at 3:06 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>>>>>> Should be able to just set
>>>>>>>>
>>>>>>>> -mca plm_rsh_agent "ssh -Y"
>>>>>>>>
>>>>>>>> on your cmd line, I believe
>>>>>>>>
>>>>>>>> On Apr 28, 2011, at 12:53 AM, jody wrote:
>>>>>>>>
>>>>>>>>> Hi Ralph
>>>>>>>>>
>>>>>>>>> Is there an easy way i could modify the OpenMPI code so that it would use
>>>>>>>>> the -Y option for ssh when connecting to remote machines?
>>>>>>>>>
>>>>>>>>> Thank You
>>>>>>>>>   Jody
>>>>>>>>>
>>>>>>>>> On Thu, Apr 7, 2011 at 4:01 PM, jody <jody.xha <at> gmail.com> wrote:
>>>>>>>>>> Hi Ralph
>>>>>>>>>> thank you for your suggestions. After some fiddling, i found that after my
>>>>>>>>>> last update (gentoo) my sshd_config had been overwritten
>>>>>>>>>> (X11Forwarding was set to 'no').
>>>>>>>>>>
>>>>>>>>>> After correcting that, i can now open remote terminals with 'ssh -Y'
>>>>>>>>>> and with 'ssh -X'
>>>>>>>>>> (but with '-X' is till get those xauth warnings)
>>>>>>>>>>
>>>>>>>>>> But the xterm option still doesn't work:
>>>>>>>>>>  jody <at> chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1,2
>>>>>>>>>> printenv | grep WORLD_RANK
>>>>>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>>>>>>>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>>>>>  [aim-squid_0:09856] [[54132,0],1]->[[54132,0],0]
>>>>>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>>>>>> [sd = 8]
>>>>>>>>>>  [aim-squid_0:09856] [[54132,0],1] routed:binomial: Connection to
>>>>>>>>>> lifeline [[54132,0],0] lost
>>>>>>>>>>
>>>>>>>>>> So it looks like the two processes from squid_0 can't open the display this way,
>>>>>>>>>> but one of them writes the output to the console...
>>>>>>>>>> Surprisingly, they are trying 'localhost:11.0' whereas when i use 'ssh -Y' the
>>>>>>>>>> DISPLAY variable is set to 'localhost:10.0'
>>>>>>>>>>
>>>>>>>>>> So in what way would OMPI have to be adapted, so -xterm would work?
>>>>>>>>>>
>>>>>>>>>> Thank You
>>>>>>>>>>  Jody
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 6, 2011 at 8:32 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>>>>>>>>> Here's a little more info - it's for Cygwin, but I don't see anything
>>>>>>>>>>> Cygwin-specific in the answers:
>>>>>>>>>>> http://x.cygwin.com/docs/faq/cygwin-x-faq.html#q-ssh-no-x11forwarding
>>>>>>>>>>>
>>>>>>>>>>> On Apr 6, 2011, at 12:30 PM, Ralph Castain wrote:
>>>>>>>>>>>
>>>>>>>>>>> Sorry Jody - I should have read your note more carefully to see that you
>>>>>>>>>>> already tried -Y. :-(
>>>>>>>>>>> Not sure what to suggest...
>>>>>>>>>>>
>>>>>>>>>>> On Apr 6, 2011, at 12:29 PM, Ralph Castain wrote:
>>>>>>>>>>>
>>>>>>>>>>> Like I said, I'm not expert. However, a quick "google" of revealed this
>>>>>>>>>>> result:
>>>>>>>>>>>
>>>>>>>>>>> When trying to set up x11 forwarding over an ssh session to a remote server
>>>>>>>>>>> with the -X switch, I was getting an error like Warning: No xauth
>>>>>>>>>>> data; using fake authentication data for X11 forwarding.
>>>>>>>>>>>
>>>>>>>>>>> When doing something like:
>>>>>>>>>>> ssh -Xl root 10.1.1.9 to a remote server, the authentication worked, but I
>>>>>>>>>>> got an error message like:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> jason <at> badman ~/bin $ ssh -Xl root 10.1.1.9
>>>>>>>>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>>>>>>>>>> Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>>>>>> Last login: Wed Apr 14 18:18:39 2010 from 10.1.1.5
>>>>>>>>>>> [root <at> RHEL ~]#
>>>>>>>>>>> and any X programs I ran would not display on my local system..
>>>>>>>>>>>
>>>>>>>>>>> Turns out the solution is to use the -Y switch instead.
>>>>>>>>>>>
>>>>>>>>>>> ssh -Yl root 10.1.1.9
>>>>>>>>>>>
>>>>>>>>>>> and that worked fine.
>>>>>>>>>>>
>>>>>>>>>>> See if that works for you - if it does, we may have to modify OMPI to
>>>>>>>>>>> accommodate.
>>>>>>>>>>>
>>>>>>>>>>> On Apr 6, 2011, at 9:19 AM, jody wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Ralph
>>>>>>>>>>> No, after the above error message mpirun has exited.
>>>>>>>>>>>
>>>>>>>>>>> But i also noticed that it is to ssh into squid_0 and open a xterm there:
>>>>>>>>>>>
>>>>>>>>>>>  jody <at> chefli ~/share/neander $ ssh -Y squid_0
>>>>>>>>>>>  Last login: Wed Apr  6 17:14:02 CEST 2011 from chefli.uzh.ch on pts/0
>>>>>>>>>>>  jody <at> squid_0 ~ $ xterm
>>>>>>>>>>>  xterm Xt error: Can't open display:
>>>>>>>>>>>  xterm:  DISPLAY is not set
>>>>>>>>>>>  jody <at> squid_0 ~ $ export DISPLAY=130.60.126.74:0.0
>>>>>>>>>>>  jody <at> squid_0 ~ $ xterm
>>>>>>>>>>>  xterm Xt error: Can't open display: 130.60.126.74:0.0
>>>>>>>>>>>  jody <at> squid_0 ~ $ export DISPLAY=chefli.uzh.ch:0.0
>>>>>>>>>>>  jody <at> squid_0 ~ $ xterm
>>>>>>>>>>>  xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>>>>>  jody <at> squid_0 ~ $ exit
>>>>>>>>>>>  logout
>>>>>>>>>>>
>>>>>>>>>>> same thing with ssh -X, but here i get the same warning/error message
>>>>>>>>>>> as with mpirun:
>>>>>>>>>>>
>>>>>>>>>>>  jody <at> chefli ~/share/neander $ ssh -X squid_0
>>>>>>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not
>>>>>>>>>>> generated
>>>>>>>>>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>>>>>>  Last login: Wed Apr  6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh
>>>>>>>>>>>
>>>>>>>>>>> So perhaps the whole problem is linked to that xauth-thing.
>>>>>>>>>>> Do you have a suggestion how this can be solved?
>>>>>>>>>>>
>>>>>>>>>>> Thank You
>>>>>>>>>>>  Jody
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Apr 6, 2011 at 4:41 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>> If I read your error messages correctly, it looks like mpirun is crashing -
>>>>>>>>>>> the daemon is complaining that it lost the socket connection back to mpirun,
>>>>>>>>>>> and hence will abort.
>>>>>>>>>>>
>>>>>>>>>>> Are you seeing mpirun still alive?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Apr 5, 2011, at 4:46 AM, jody wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi
>>>>>>>>>>>
>>>>>>>>>>> On my workstation and  the cluster i set up OpenMPI (v 1.4.2) so that
>>>>>>>>>>>
>>>>>>>>>>> it works in "text-mode":
>>>>>>>>>>>
>>>>>>>>>>>  $ mpirun -np 4  -x DISPLAY -host squid_0   printenv | grep WORLD_RANK
>>>>>>>>>>>
>>>>>>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>>>>>>
>>>>>>>>>>>  OMPI_COMM_WORLD_RANK=1
>>>>>>>>>>>
>>>>>>>>>>>  OMPI_COMM_WORLD_RANK=2
>>>>>>>>>>>
>>>>>>>>>>>  OMPI_COMM_WORLD_RANK=3
>>>>>>>>>>>
>>>>>>>>>>> but when i use  the -xterm option to mpirun, it doesn't work
>>>>>>>>>>>
>>>>>>>>>>> $ mpirun -np 4  -x DISPLAY -host squid_0 -xterm 1,2  printenv | grep
>>>>>>>>>>> WORLD_RANK
>>>>>>>>>>>
>>>>>>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not
>>>>>>>>>>> generated
>>>>>>>>>>>
>>>>>>>>>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>>>>>>
>>>>>>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>>>>>>
>>>>>>>>>>>  [squid_0:05266] [[55607,0],1]->[[55607,0],0]
>>>>>>>>>>>
>>>>>>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>>>>>>>
>>>>>>>>>>> [sd = 8]
>>>>>>>>>>>
>>>>>>>>>>>  [squid_0:05266] [[55607,0],1] routed:binomial: Connection to
>>>>>>>>>>>
>>>>>>>>>>> lifeline [[55607,0],0] lost
>>>>>>>>>>>
>>>>>>>>>>>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>>>>>
>>>>>>>>>>>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>>>>>
>>>>>>>>>>> (strange: somebody wrote his message to the console)
>>>>>>>>>>>
>>>>>>>>>>> No matter whether i set the DISPLAY variable to the full hostname of
>>>>>>>>>>>
>>>>>>>>>>> the workstation,
>>>>>>>>>>>
>>>>>>>>>>> to the IP-Adress of the workstation or simply to ":0.0", it doesn't work
>>>>>>>>>>>
>>>>>>>>>>> But i do have xauth data (as far as i know):
>>>>>>>>>>>
>>>>>>>>>>> On the remote (squid_0):
>>>>>>>>>>>
>>>>>>>>>>>  jody <at> squid_0 ~ $ xauth list
>>>>>>>>>>>
>>>>>>>>>>>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>>>>>>>>>>>
>>>>>>>>>>>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>>>>>
>>>>>>>>>>>  chefli.uzh.ch:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>>>>>
>>>>>>>>>>> on the workstation:
>>>>>>>>>>>
>>>>>>>>>>>  $ xauth list
>>>>>>>>>>>
>>>>>>>>>>>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>>>>>>>>>>>
>>>>>>>>>>>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>>>>>
>>>>>>>>>>>  localhost.localdomain/unix:0  MIT-MAGIC-COOKIE-1
>>>>>>>>>>>
>>>>>>>>>>> 146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>>>>>
>>>>>>>>>>>  chefli.uzh.ch/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>>>>>
>>>>>>>>>>> In sshd_config on the workstation i have 'X11Forwarding yes'
>>>>>>>>>>>
>>>>>>>>>>> I have also done
>>>>>>>>>>>
>>>>>>>>>>>   xhost + squid_0
>>>>>>>>>>>
>>>>>>>>>>> on the workstation.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> How can i get the -xterm option running?
>>>>>>>>>>>
>>>>>>>>>>> Thank You
>>>>>>>>>>>
>>>>>>>>>>>  Jody
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>
>>>>>>>>>>> users mailing list
>>>>>>>>>>>
>>>>>>>>>>> users <at> open-mpi.org
>>>>>>>>>>>
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>
>>>>>>>>>>> users mailing list
>>>>>>>>>>>
>>>>>>>>>>> users <at> open-mpi.org
>>>>>>>>>>>
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users <at> open-mpi.org
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> users <at> open-mpi.org
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users <at> open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users <at> open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users <at> open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users <at> open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>> <ompidbg_1.txt><ompidbg_2.txt><ompidbg_3.txt><ompidbg_4.txt>_______________________________________________
>>>> users mailing list
>>>> users <at> open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users <at> open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users <at> open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users <at> open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
Ralph Castain | 2 May 2011 16:30
Favicon
Gravatar

Re: [OMPI users] problems with the -xterm option


On May 2, 2011, at 8:21 AM, jody wrote:

> Hi
> Well, the difference is that one time i call the application
> 'HelloMPI' with the '--xterm' option,
> whereas in my previous mail i am calling the application 'xterm'
> (without the '--xterm' option)

Ah, well that might explain it. I don't know how xterm would react to just being launched by mpirun onto a
remote platform without any command to run. I can't explain what the plm verbosity has to do with anything, though.

> Jody
> 
> On Mon, May 2, 2011 at 4:08 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>> 
>> On May 2, 2011, at 7:56 AM, jody wrote:
>> 
>>> Hi Ralph
>>> 
>>> Thank You for doing the fix.
>>> 
>>> Do you perhaps also have an idea what is going on when i try to start
>>> xterm (or probably an other X application) on a remote host?
>>> In this case it is not enough to specify the '--leave-session-attached' option.
>>> 
>>> These calls won't open any xterms
>>>  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
>>> plm_base_verbose 1 xterm
>>>  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"
>>> --leave-session-attached xterm
>>>  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
>>> odls_base_verbose 5 xterm
>>>  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
>>> odls_base_verbose 5 --leave-session-attached xterm
>>> 
>>> But this will open the xterms:
>>>  mpirun -np 4 -host squid_0 -mca plm_rsh_agent "ssh -Y"  -mca
>>> plm_base_verbose 1  --leave-session-attached xterm
>>> 
>>> Any verbosity level > 0 will open xterms, but with ' -mca
>>> plm_base_verbose 0' there are again no xterms.
>>> 
>> 
>> No earthly idea...this seems to contradict what you had below. You said you were seeing the xterms with
this cmd line:
>> 
>>>>> I just found that everything works as expected if i use the the
>>>>> '--leave-session-attached' option (without the debug options):
>>>>>  jody <at> chefli ~/share/neander $ mpirun -np 4 -host squid_0 -mca
>>>>> plm_rsh_agent "ssh -Y"  --leave-session-attached  --xterm 0,1,2,3!
>>>>> ./HelloMPI
>>>>> The xterms are also opened if i do not use the '!' hold option.
>>>> 
>> 
>> Did I miss something?
>> 
>> 
>>> Thank You
>>>  Jody
>>> 
>>> On Mon, May 2, 2011 at 2:29 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>> 
>>>> On May 2, 2011, at 2:34 AM, jody wrote:
>>>> 
>>>>> Hi Ralph
>>>>> 
>>>>> I rebuilt open MPI 1.4.2 with the debug option on both chefli and squid_0.
>>>>> The results are interesting!
>>>>> 
>>>>> I wrote a small HelloMPI app which basically calls usleep for a pause
>>>>> of 5 seconds.
>>>>> 
>>>>> Now calling it as i did before, no MPI errors appear anymore, only the
>>>>> display problems:
>>>>>  jody <at> chefli ~/share/neander $ mpirun -np 1 -host squid_0 -mca
>>>>> plm_rsh_agent "ssh -Y" --xterm 0 ./HelloMPI
>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:10.0
>>>>> 
>>>>> When i do the same call *with* the debug option, the xterm appears and
>>>>> shows the output of HelloMPI!
>>>>> I attach the output in ompidbg_1.txt (It also works if i call with
>>>>> '-np 4' and '--xterm 0,1,2,3'
>>>> 
>>>> Good!
>>>> 
>>>>> 
>>>>> Calling hostname the same way does not open an xterm (cf. ompidbg_2.txt).
>>>>> 
>>>>> If i use the hold-option, the xterm appears with the output of
>>>>> 'hostrname' (cf. ompidbg_3.txt)
>>>>> The xterm opens after the line "launch complete for job..." has been
>>>>> written (line 59)
>>>> 
>>>> Okay, that's also expected. Like I said, without the "hold", the output is generated so quickly that
the window just flashes at best. I've had similar experiences - hence the "hold" option.
>>>> 
>>>>> 
>>>>> I just found that everything works as expected if i use the the
>>>>> '--leave-session-attached' option (without the debug options):
>>>>>  jody <at> chefli ~/share/neander $ mpirun -np 4 -host squid_0 -mca
>>>>> plm_rsh_agent "ssh -Y"  --leave-session-attached  --xterm 0,1,2,3!
>>>>> ./HelloMPI
>>>>> The xterms are also opened if i do not use the '!' hold option.
>>>> 
>>>> Okay, I can understand why. The --leave-session-attached option just tells mpirun to not daemonize
the backend daemons - thus leaving the ssh session alive. The debug options do the same thing, but turn on
all the debug output.
>>>> 
>>>> The problem is that if you don't leave the ssh session alive, then the xterm has no way back to your
screen. By daemonizing, we severe that connection.
>>>> 
>>>> What I should do (and maybe used to do, but it got removed) is automatically turn "on" the
leave-session-attached option if you give --xterm. I can enter that patch.
>>>> 
>>>> Note that this does limit the size of the launch to the number of ssh sessions the system allows you to
have open at the same time. We default to a limit of 128 nodes, which is likely adequate for an xterm-based
debugging session. However, you can increase it using an mca param (see ompi_info) to as high as the system allows.
>>>> 
>>>> Thanks for helping debug this! I'll add you to the patch list so you can track it.
>>>> 
>>>>> 
>>>>> What does *not* work is
>>>>>  jody <at> aim-triops ~/share/neander $ mpirun -np 2 -host squid_0 -mca
>>>>> plm_rsh_agent "ssh -Y"  --leave-session-attached  xterm
>>>>>  xterm Xt error: Can't open display:
>>>>>  xterm:  DISPLAY is not set
>>>>>  xterm Xt error: Can't open display:
>>>>>  xterm:  DISPLAY is not set
>>>>> 
>>>>> But then again, this call works (i.e. an xterm is opened) if all the
>>>>> debug-options are used (ompidbg_4.txt).
>>>>> Here the '--leave-session-attached' is necessary - without it, no xterm.
>>>>> 
>>>>>> From these results i would say that there is no basic mishandling of
>>>>> 'ssh', though i have no idea
>>>>> what internal differences the use of the '-leave-session-attached'
>>>>> option or the debug options make.
>>>>> 
>>>>> I hope these observations are helpful
>>>>>  Jody
>>>>> 
>>>>> 
>>>>> On Fri, Apr 29, 2011 at 12:08 AM, jody <jody.xha <at> gmail.com> wrote:
>>>>>> Hi Ralph
>>>>>> 
>>>>>> Thank you for your suggestions.
>>>>>> I'll be happy to help  you.
>>>>>> I'm not sure if i'll get around to this tomorrow,
>>>>>> but i certainly will do so on Monday.
>>>>>> 
>>>>>> Thanks
>>>>>>  Jody
>>>>>> 
>>>>>> On Thu, Apr 28, 2011 at 11:53 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>>>>> Hi Jody
>>>>>>> 
>>>>>>> I'm not sure when I'll get a chance to work on this - got a deadline to meet. I do have a couple of
suggestions, if you wouldn't mind helping debug the problem?
>>>>>>> 
>>>>>>> It looks to me like the problem is that mpirun is crashing or terminating early for some reason -
hence the failures to send msgs to it, and the "lifeline lost" error that leads to the termination of the
daemon. If you build a debug version of the code (i.e., --enable-debug on configure), you can get a lot of
debug info that traces the behavior.
>>>>>>> 
>>>>>>> If you could then run your program with
>>>>>>> 
>>>>>>>  -mca plm_base_verbose 5 -mca odls_base_verbose 5 --leave-session-attached
>>>>>>> 
>>>>>>> and send it to me, we'll see what ORTE thinks it is doing.
>>>>>>> 
>>>>>>> You could also take a look at the code for implementing the xterm option. You'll find it in
>>>>>>> 
>>>>>>> orte/mca/odls/base/odls_base_default_fns.c
>>>>>>> 
>>>>>>> around line 1115. The xterm command syntax is defined in
>>>>>>> 
>>>>>>> orte/mca/odls/base/odls_base_open.c
>>>>>>> 
>>>>>>> around line 233 and following. Note that we use "xterm -T" as the cmd. Perhaps you can spot an error in
the way we treat xterm?
>>>>>>> 
>>>>>>> Also, remember that you have to specify that you want us to "hold" the xterm window open even after
the process terminates. If you don't specify it, the window automatically closes upon completion of the
process. So a fast-running cmd like "hostname" might disappear so quickly that it causes a race condition problem.
>>>>>>> 
>>>>>>> You might want to try a spinner application - i.e.., output something and then sit in a loop or sleep
for some period of time. Or, use the "hold" option to keep the window open - you designate "hold" by putting a
'!' before the rank, e.g., "mpirun -np 2 -xterm \!2 hostname"
>>>>>>> 
>>>>>>> 
>>>>>>> On Apr 28, 2011, at 8:38 AM, jody wrote:
>>>>>>> 
>>>>>>>> Hi
>>>>>>>> 
>>>>>>>> Unfortunately this does not solve my problem.
>>>>>>>> While i can do
>>>>>>>>  ssh -Y squid_0 xterm
>>>>>>>> and this will open an xterm on m,y machiine (chefli),
>>>>>>>> i run into problems with the -xterm option of openmpi:
>>>>>>>> 
>>>>>>>>  jody <at> chefli ~/share/neander $ mpirun -np 4  -mca plm_rsh_agent "ssh
>>>>>>>> -Y" -host squid_0 --xterm 1 hostname
>>>>>>>>  squid_0
>>>>>>>>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
>>>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>>>> [sd = 8]
>>>>>>>>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
>>>>>>>> lifeline [[35219,0],0] lost
>>>>>>>>  [squid_0:28046] [[35219,0],1]->[[35219,0],0]
>>>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>>>> [sd = 8]
>>>>>>>>  [squid_0:28046] [[35219,0],1] routed:binomial: Connection to
>>>>>>>> lifeline [[35219,0],0] lost
>>>>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>>>> 
>>>>>>>> By the way when i look at the DISPLAY variable in the xterm window
>>>>>>>> opened via squid_0,
>>>>>>>> i also have the display variable "localhost:11.0"
>>>>>>>> 
>>>>>>>> Actually, the difference with using the "-mca plm_rsh_agent" is that
>>>>>>>> the lines wiht the warnings about "xauth" and "untrusted X" do not
>>>>>>>> appear:
>>>>>>>> 
>>>>>>>>  jody <at> chefli ~/share/neander $ mpirun -np 4   -host squid_0 -xterm 1 hostname
>>>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>>>>>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>>>  squid_0
>>>>>>>>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
>>>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>>>> [sd = 8]
>>>>>>>>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
>>>>>>>> lifeline [[34926,0],0] lost
>>>>>>>>  [squid_0:28337] [[34926,0],1]->[[34926,0],0]
>>>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>>>> [sd = 8]
>>>>>>>>  [squid_0:28337] [[34926,0],1] routed:binomial: Connection to
>>>>>>>> lifeline [[34926,0],0] lost
>>>>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I have doubts that the "-Y" is passed correctly:
>>>>>>>>   jody <at> triops ~/share/neander $ mpirun -np   -mca plm_rsh_agent "ssh
>>>>>>>> -Y" -host squid_0 xterm
>>>>>>>>  xterm Xt error: Can't open display:
>>>>>>>>  xterm:  DISPLAY is not set
>>>>>>>>  xterm Xt error: Can't open display:
>>>>>>>>  xterm:  DISPLAY is not set
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ---> as a matter of fact i noticed that the xterm option doesn't work locally:
>>>>>>>>  mpirun -np 4    -xterm 1 /usr/bin/printenv
>>>>>>>> prints verything onto the console.
>>>>>>>> 
>>>>>>>> Do you have any other suggestions i could try?
>>>>>>>> 
>>>>>>>> Thank You
>>>>>>>> Jody
>>>>>>>> 
>>>>>>>> On Thu, Apr 28, 2011 at 3:06 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>>>>>>> Should be able to just set
>>>>>>>>> 
>>>>>>>>> -mca plm_rsh_agent "ssh -Y"
>>>>>>>>> 
>>>>>>>>> on your cmd line, I believe
>>>>>>>>> 
>>>>>>>>> On Apr 28, 2011, at 12:53 AM, jody wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Ralph
>>>>>>>>>> 
>>>>>>>>>> Is there an easy way i could modify the OpenMPI code so that it would use
>>>>>>>>>> the -Y option for ssh when connecting to remote machines?
>>>>>>>>>> 
>>>>>>>>>> Thank You
>>>>>>>>>>   Jody
>>>>>>>>>> 
>>>>>>>>>> On Thu, Apr 7, 2011 at 4:01 PM, jody <jody.xha <at> gmail.com> wrote:
>>>>>>>>>>> Hi Ralph
>>>>>>>>>>> thank you for your suggestions. After some fiddling, i found that after my
>>>>>>>>>>> last update (gentoo) my sshd_config had been overwritten
>>>>>>>>>>> (X11Forwarding was set to 'no').
>>>>>>>>>>> 
>>>>>>>>>>> After correcting that, i can now open remote terminals with 'ssh -Y'
>>>>>>>>>>> and with 'ssh -X'
>>>>>>>>>>> (but with '-X' is till get those xauth warnings)
>>>>>>>>>>> 
>>>>>>>>>>> But the xterm option still doesn't work:
>>>>>>>>>>>  jody <at> chefli ~/share/neander $ mpirun -np 4 -host squid_0 -xterm 1,2
>>>>>>>>>>> printenv | grep WORLD_RANK
>>>>>>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>>>>>>>>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>>>>>>>  /usr/bin/xterm Xt error: Can't open display: localhost:11.0
>>>>>>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>>>>>>  [aim-squid_0:09856] [[54132,0],1]->[[54132,0],0]
>>>>>>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>>>>>>> [sd = 8]
>>>>>>>>>>>  [aim-squid_0:09856] [[54132,0],1] routed:binomial: Connection to
>>>>>>>>>>> lifeline [[54132,0],0] lost
>>>>>>>>>>> 
>>>>>>>>>>> So it looks like the two processes from squid_0 can't open the display this way,
>>>>>>>>>>> but one of them writes the output to the console...
>>>>>>>>>>> Surprisingly, they are trying 'localhost:11.0' whereas when i use 'ssh -Y' the
>>>>>>>>>>> DISPLAY variable is set to 'localhost:10.0'
>>>>>>>>>>> 
>>>>>>>>>>> So in what way would OMPI have to be adapted, so -xterm would work?
>>>>>>>>>>> 
>>>>>>>>>>> Thank You
>>>>>>>>>>>  Jody
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Apr 6, 2011 at 8:32 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>>>>>>>>>> Here's a little more info - it's for Cygwin, but I don't see anything
>>>>>>>>>>>> Cygwin-specific in the answers:
>>>>>>>>>>>> http://x.cygwin.com/docs/faq/cygwin-x-faq.html#q-ssh-no-x11forwarding
>>>>>>>>>>>> 
>>>>>>>>>>>> On Apr 6, 2011, at 12:30 PM, Ralph Castain wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Sorry Jody - I should have read your note more carefully to see that you
>>>>>>>>>>>> already tried -Y. :-(
>>>>>>>>>>>> Not sure what to suggest...
>>>>>>>>>>>> 
>>>>>>>>>>>> On Apr 6, 2011, at 12:29 PM, Ralph Castain wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Like I said, I'm not expert. However, a quick "google" of revealed this
>>>>>>>>>>>> result:
>>>>>>>>>>>> 
>>>>>>>>>>>> When trying to set up x11 forwarding over an ssh session to a remote server
>>>>>>>>>>>> with the -X switch, I was getting an error like Warning: No xauth
>>>>>>>>>>>> data; using fake authentication data for X11 forwarding.
>>>>>>>>>>>> 
>>>>>>>>>>>> When doing something like:
>>>>>>>>>>>> ssh -Xl root 10.1.1.9 to a remote server, the authentication worked, but I
>>>>>>>>>>>> got an error message like:
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> jason <at> badman ~/bin $ ssh -Xl root 10.1.1.9
>>>>>>>>>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>>>>>>>>>>>> Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>>>>>>> Last login: Wed Apr 14 18:18:39 2010 from 10.1.1.5
>>>>>>>>>>>> [root <at> RHEL ~]#
>>>>>>>>>>>> and any X programs I ran would not display on my local system..
>>>>>>>>>>>> 
>>>>>>>>>>>> Turns out the solution is to use the -Y switch instead.
>>>>>>>>>>>> 
>>>>>>>>>>>> ssh -Yl root 10.1.1.9
>>>>>>>>>>>> 
>>>>>>>>>>>> and that worked fine.
>>>>>>>>>>>> 
>>>>>>>>>>>> See if that works for you - if it does, we may have to modify OMPI to
>>>>>>>>>>>> accommodate.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Apr 6, 2011, at 9:19 AM, jody wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Ralph
>>>>>>>>>>>> No, after the above error message mpirun has exited.
>>>>>>>>>>>> 
>>>>>>>>>>>> But i also noticed that it is to ssh into squid_0 and open a xterm there:
>>>>>>>>>>>> 
>>>>>>>>>>>>  jody <at> chefli ~/share/neander $ ssh -Y squid_0
>>>>>>>>>>>>  Last login: Wed Apr  6 17:14:02 CEST 2011 from chefli.uzh.ch on pts/0
>>>>>>>>>>>>  jody <at> squid_0 ~ $ xterm
>>>>>>>>>>>>  xterm Xt error: Can't open display:
>>>>>>>>>>>>  xterm:  DISPLAY is not set
>>>>>>>>>>>>  jody <at> squid_0 ~ $ export DISPLAY=130.60.126.74:0.0
>>>>>>>>>>>>  jody <at> squid_0 ~ $ xterm
>>>>>>>>>>>>  xterm Xt error: Can't open display: 130.60.126.74:0.0
>>>>>>>>>>>>  jody <at> squid_0 ~ $ export DISPLAY=chefli.uzh.ch:0.0
>>>>>>>>>>>>  jody <at> squid_0 ~ $ xterm
>>>>>>>>>>>>  xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>>>>>>  jody <at> squid_0 ~ $ exit
>>>>>>>>>>>>  logout
>>>>>>>>>>>> 
>>>>>>>>>>>> same thing with ssh -X, but here i get the same warning/error message
>>>>>>>>>>>> as with mpirun:
>>>>>>>>>>>> 
>>>>>>>>>>>>  jody <at> chefli ~/share/neander $ ssh -X squid_0
>>>>>>>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not
>>>>>>>>>>>> generated
>>>>>>>>>>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>>>>>>>  Last login: Wed Apr  6 17:12:31 CEST 2011 from chefli.uzh.ch on ssh
>>>>>>>>>>>> 
>>>>>>>>>>>> So perhaps the whole problem is linked to that xauth-thing.
>>>>>>>>>>>> Do you have a suggestion how this can be solved?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thank You
>>>>>>>>>>>>  Jody
>>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, Apr 6, 2011 at 4:41 PM, Ralph Castain <rhc <at> open-mpi.org> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> If I read your error messages correctly, it looks like mpirun is crashing -
>>>>>>>>>>>> the daemon is complaining that it lost the socket connection back to mpirun,
>>>>>>>>>>>> and hence will abort.
>>>>>>>>>>>> 
>>>>>>>>>>>> Are you seeing mpirun still alive?
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Apr 5, 2011, at 4:46 AM, jody wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi
>>>>>>>>>>>> 
>>>>>>>>>>>> On my workstation and  the cluster i set up OpenMPI (v 1.4.2) so that
>>>>>>>>>>>> 
>>>>>>>>>>>> it works in "text-mode":
>>>>>>>>>>>> 
>>>>>>>>>>>>  $ mpirun -np 4  -x DISPLAY -host squid_0   printenv | grep WORLD_RANK
>>>>>>>>>>>> 
>>>>>>>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>>>>>>> 
>>>>>>>>>>>>  OMPI_COMM_WORLD_RANK=1
>>>>>>>>>>>> 
>>>>>>>>>>>>  OMPI_COMM_WORLD_RANK=2
>>>>>>>>>>>> 
>>>>>>>>>>>>  OMPI_COMM_WORLD_RANK=3
>>>>>>>>>>>> 
>>>>>>>>>>>> but when i use  the -xterm option to mpirun, it doesn't work
>>>>>>>>>>>> 
>>>>>>>>>>>> $ mpirun -np 4  -x DISPLAY -host squid_0 -xterm 1,2  printenv | grep
>>>>>>>>>>>> WORLD_RANK
>>>>>>>>>>>> 
>>>>>>>>>>>>  Warning: untrusted X11 forwarding setup failed: xauth key data not
>>>>>>>>>>>> generated
>>>>>>>>>>>> 
>>>>>>>>>>>>  Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>>>>>>>>>> 
>>>>>>>>>>>>  OMPI_COMM_WORLD_RANK=0
>>>>>>>>>>>> 
>>>>>>>>>>>>  [squid_0:05266] [[55607,0],1]->[[55607,0],0]
>>>>>>>>>>>> 
>>>>>>>>>>>> mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor (9)
>>>>>>>>>>>> 
>>>>>>>>>>>> [sd = 8]
>>>>>>>>>>>> 
>>>>>>>>>>>>  [squid_0:05266] [[55607,0],1] routed:binomial: Connection to
>>>>>>>>>>>> 
>>>>>>>>>>>> lifeline [[55607,0],0] lost
>>>>>>>>>>>> 
>>>>>>>>>>>>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>>>>>> 
>>>>>>>>>>>>  /usr/bin/xterm Xt error: Can't open display: chefli.uzh.ch:0.0
>>>>>>>>>>>> 
>>>>>>>>>>>> (strange: somebody wrote his message to the console)
>>>>>>>>>>>> 
>>>>>>>>>>>> No matter whether i set the DISPLAY variable to the full hostname of
>>>>>>>>>>>> 
>>>>>>>>>>>> the workstation,
>>>>>>>>>>>> 
>>>>>>>>>>>> to the IP-Adress of the workstation or simply to ":0.0", it doesn't work
>>>>>>>>>>>> 
>>>>>>>>>>>> But i do have xauth data (as far as i know):
>>>>>>>>>>>> 
>>>>>>>>>>>> On the remote (squid_0):
>>>>>>>>>>>> 
>>>>>>>>>>>>  jody <at> squid_0 ~ $ xauth list
>>>>>>>>>>>> 
>>>>>>>>>>>>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>>>>>>>>>>>> 
>>>>>>>>>>>>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>>>>>> 
>>>>>>>>>>>>  chefli.uzh.ch:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>>>>>> 
>>>>>>>>>>>> on the workstation:
>>>>>>>>>>>> 
>>>>>>>>>>>>  $ xauth list
>>>>>>>>>>>> 
>>>>>>>>>>>>  chefli/unix:10  MIT-MAGIC-COOKIE-1  5293e179bc7b2036d87cbcdf14891d0c
>>>>>>>>>>>> 
>>>>>>>>>>>>  chefli/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>>>>>> 
>>>>>>>>>>>>  localhost.localdomain/unix:0  MIT-MAGIC-COOKIE-1
>>>>>>>>>>>> 
>>>>>>>>>>>> 146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>>>>>> 
>>>>>>>>>>>>  chefli.uzh.ch/unix:0  MIT-MAGIC-COOKIE-1  146c7f438fab79deb8a8a7df242b6f4b
>>>>>>>>>>>> 
>>>>>>>>>>>> In sshd_config on the workstation i have 'X11Forwarding yes'
>>>>>>>>>>>> 
>>>>>>>>>>>> I have also done
>>>>>>>>>>>> 
>>>>>>>>>>>>   xhost + squid_0
>>>>>>>>>>>> 
>>>>>>>>>>>> on the workstation.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> How can i get the -xterm option running?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thank You
>>>>>>>>>>>> 
>>>>>>>>>>>>  Jody
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> 
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> 
>>>>>>>>>>>> users <at> open-mpi.org
>>>>>>>>>>>> 
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> 
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> 
>>>>>>>>>>>> users <at> open-mpi.org
>>>>>>>>>>>> 
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> users <at> open-mpi.org
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> users <at> open-mpi.org
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> users <at> open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> users <at> open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users <at> open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users <at> open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> 
>>>>>> 
>>>>> <ompidbg_1.txt><ompidbg_2.txt><ompidbg_3.txt><ompidbg_4.txt>_______________________________________________
>>>>> users mailing list
>>>>> users <at> open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> users <at> open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users <at> open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> users <at> open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> _______________________________________________
> users mailing list
> users <at> open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
Jack Bryan | 2 May 2011 17:30
Picon
Favicon

Re: [OMPI users] OMPI vs. network socket communcation

Thanks for your reply. 

MPI is for academic purpose. How about business applications ? 

What kinds of parallel/distributed computing environment do the financial institutions
use for their high frequency trading ? 

Any help is really appreciated. 

Thanks, 

Date: Mon, 2 May 2011 08:34:33 -0400
From: terry.dontje <at> oracle.com
To: users <at> open-mpi.org
Subject: Re: [OMPI users] OMPI vs. network socket communcation

On 04/30/2011 08:52 PM, Jack Bryan wrote:
.ExternalClass .ecxhmmessage P {padding:0px;} .ExternalClass body.ecxhmmessage {font-size:10pt;font-family:Tahoma;} Hi, All:

What is the relationship between MPI communication and socket communication ? 

MPI may use socket communications to do communications between two processes.  Aside from that they are used for different purposes.
Is the network socket programming better than MPI ?
Depends on what you are trying to do.  If you are writing a parallel program that may run in multiple environments with different types of performing protocols available for its use then MPI is probably better.  If you are looking to do simple client/server type programming then socket program might have an advantage.

I am a newbie of   network socket programming. 

I do not know which one is better for parallel/distributed computing ?
IMO MPI.

I know that network socket is unix-based file communication between server and client. 

If they can also be used for parallel computing, how MPI can work better than them ?
There is a lot of stuff that MPI does behind the curtain to make a parallel applications life a lot easier.  As far as performance MPI will not perform better than sockets if it is using sockets as the underlying model.  However, the performance difference should be negligible which makes all the other stuff MPI does for you a big win.

I know MPI is for homogeneous cluster system and network socket is based on internet TCP/IP.
What do you mean by homogeneous cluster?  There are some MPIs that can work among different platforms and even different OSes (though some initial setup may be necessary).

Hope this helps,


--
Message body
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803
Email terry.dontje <at> oracle.com




_______________________________________________ users mailing list users <at> open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
users <at> open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
Terry Dontje | 2 May 2011 17:58
Picon
Favicon

Re: [OMPI users] OMPI vs. network socket communcation

On 05/02/2011 11:30 AM, Jack Bryan wrote:
> Thanks for your reply.
>
> MPI is for academic purpose. How about business applications ?
>
There are quite a bit of non-academic MPI applications.  For example 
there are quite a bit of simulation codes from different vendors that 
support MPI (Nastran is one common one).
> What kinds of parallel/distributed computing environment do the 
> financial institutions
> use for their high frequency trading ?
I personally know of a private trading shop that uses MPI, but that's as 
much as I can say.  I am not sure how common it is, however the direct 
communications to the trading servers is still via sockets or something 
similar as opposed to MPI.

--td

>
> Any help is really appreciated.
>
> Thanks,
>
> ------------------------------------------------------------------------
> Date: Mon, 2 May 2011 08:34:33 -0400
> From: terry.dontje <at> oracle.com
> To: users <at> open-mpi.org
> Subject: Re: [OMPI users] OMPI vs. network socket communcation
>
> On 04/30/2011 08:52 PM, Jack Bryan wrote:
>
>     Hi, All:
>
>     What is the relationship between MPI communication and socket
>     communication ?
>
> MPI may use socket communications to do communications between two 
> processes.  Aside from that they are used for different purposes.
>
>     Is the network socket programming better than MPI ?
>
> Depends on what you are trying to do.  If you are writing a parallel 
> program that may run in multiple environments with different types of 
> performing protocols available for its use then MPI is probably 
> better.  If you are looking to do simple client/server type 
> programming then socket program might have an advantage.
>
>
>     I am a newbie of network socket programming.
>
>     I do not know which one is better for parallel/distributed
>     computing ?
>
> IMO MPI.
>
>
>     I know that network socket is unix-based file communication
>     between server and client.
>
>     If they can also be used for parallel computing, how MPI can work
>     better than them ?
>
> There is a lot of stuff that MPI does behind the curtain to make a 
> parallel applications life a lot easier.  As far as performance MPI 
> will not perform better than sockets if it is using sockets as the 
> underlying model.  However, the performance difference should be 
> negligible which makes all the other stuff MPI does for you a big win.
>
>
>     I know MPI is for homogeneous cluster system and network socket is
>     based on internet TCP/IP.
>
> What do you mean by homogeneous cluster?  There are some MPIs that can 
> work among different platforms and even different OSes (though some 
> initial setup may be necessary).
>
> Hope this helps,
>
>
> -- 
> Oracle
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle *- Performance Technologies*
> 95 Network Drive, Burlington, MA 01803
> Email terry.dontje <at> oracle.com <mailto:terry.dontje <at> oracle.com>
>
>
>
>
> _______________________________________________ users mailing list 
> users <at> open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users <at> open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--

-- 
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje <at> oracle.com <mailto:terry.dontje <at> oracle.com>

_______________________________________________
users mailing list
users <at> open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Gmane