Lorenzo Donà | 19 Jun 2013 18:26
Picon
Favicon

[OMPI users] error with openmpi on snow leopard



Inizio messaggio inoltrato:

Da: Lorenzo Donà <lorechimica91 <at> hotmail.it>
Data: 19 giugno 2013 18.14.26 GMT+02.00
Oggetto: error with openmpi on snow leopard

Hi I compiled openmpi v1.7.1 and previous but I always found this message:
Cannot open configuration file /Users/lorenzodona/Desktop/openmpi-1.7.1/share/openmpi/opal_wrapper-wrapper-data.txt
Error parsing data file opal_wrapper: Not found
Please can you help me?
Thans for your patience dearly
Lorenzo.

_______________________________________________
users mailing list
users <at> open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
Sergio Maffioletti | 19 Jun 2013 10:49
Picon

[OMPI users] openmpi 1.6.3 fails to identify local host if its IP is 127.0.1.1

Hello,

we have been hit observing a strange behavior with OpenMPI 1.6.3

    strace -f /share/apps/openmpi/1.6.3/bin/mpiexec -n 2
--nooversubscribe --display-allocation --display-map --tag-output
/share/apps/gamess/2011R1/gamess.2011R1.x
/state/partition1/rmurri/29515/exam01.F05 -scr
/state/partition1/rmurri/29515

    ======================   ALLOCATED NODES   ======================

     Data for node: nh64-1-17.local Num slots: 0    Max slots: 0
     Data for node: nh64-1-17       Num slots: 2    Max slots: 0

    =================================================================

     ========================   JOB MAP   ========================

     Data for node: nh64-1-17       Num procs: 2
            Process OMPI jobid: [37108,1] Process rank: 0
            Process OMPI jobid: [37108,1] Process rank: 1

     =============================================================

As you can see, the host file lists the *unqualified* local host name;
OpenMPI fails to recognize that as the same host where it is running,
and uses `ssh` to spawn a remote `orted`, as use of `strace -f` shows:

    Process 16552 attached
    [pid 16552] execve("//usr/bin/ssh", ["/usr/bin/ssh", "-x",
"nh64-1-17", "OPAL_PREFIX=/share/apps/openmpi/1.6.3 ; export
OPAL_PREFIX; PATH=/share/apps/openmpi/1.6.3/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/share/apps/openmpi/1.6.3/lib:$LD_LIBRARY_PATH ;
export LD_LIBRARY_PATH ;
DYLD_LIBRARY_PATH=/share/apps/openmpi/1.6.3/lib:$", "--daemonize",
"-mca", "ess", "env", "-mca", "orte_ess_jobid", "2431909888", "-mca",
"orte_ess_vpid", "1", "-mca", "orte_ess_num_procs", "2", "--hnp-uri",
"\"2431909888.0;tcp://10.1.255.237:33154\"", "-mca", "plm", "rsh"],
["OLI235=/state/partition1/rmurri/29515/exam01.F235", ...

If the machine file lists the FQDNs instead, `mpiexec` spawns the jobs
directly via fork()/exec().

This seems related to the fact that each compute node advertises
127.0.1.1 as the IP address associated to its hostname:

    $ ssh nh64-1-17 getent hosts nh64-1-17
    127.0.1.1    nh64-1-17.local nh64-1-17

Indeed, if I change /etc/hosts so that a compute node associates a
"real" IP with its hostname, `mpiexec` works as expected.

Is this a known feature/bug/easter egg?

For the record: using OpenMPI 1.6.3 on Rocks 5.2.

Thanks,
on behalf of the GC3 Team
Sergio :)

GC3: Grid Computing Competence Center
http://www.gc3.uzh.ch/
University of Zurich
Winterthurerstrasse 190
CH-8057 Zurich Switzerland
Tel: +41 44 635 4222
Fax: +41 44 635 6888

_______________________________________________
users mailing list
users <at> open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
Blosch, Edwin L | 19 Jun 2013 00:50
Favicon

[OMPI users] Application hangs on mpi_waitall

I’m running OpenMPI 1.6.4 and seeing a problem where mpi_waitall never returns.  The case runs fine with MVAPICH.  The logic associated with the communications has been extensively debugged in the past; we don’t think it has errors.   Each process posts non-blocking receives, non-blocking sends, and then does waitall on all the outstanding requests. 

 

The work is broken down into 960 chunks. If I run with 960 processes (60 nodes of 16 cores each), things seem to work.  If I use 160 processes (each process handling 6 chunks of work), then each process is handling 6 times as much communication, and that is the case that hangs with OpenMPI 1.6.4; again, seems to work with MVAPICH.  Is there an obvious place to start, diagnostically?  We’re using the openib btl.

 

Thanks,

 

Ed

_______________________________________________
users mailing list
users <at> open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
Claire Williams | 18 Jun 2013 19:14
Picon
Favicon

[OMPI users] Trouble with Sending Multiple messages to the Same Machine

Hi guys ☺!

I'm working with a simple "Hello, World" MPI program that has one master and is sending one message to each worker, receives a message back from each of the workers, and re-sends a new message. This unfortunately is not working :(. When the master only sends one message to each worker, and then receives it, it is working fine, but there are problems with sending more than one message to each worker. When it happens, it prints the error:

[[401,1],0][../../../../../openmpi-1.6.3/ompi/mca/btl/tcp/btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.X.X failed: No route to host (113)

I'm wondering how I can go about fixing this. This program is running across multiple Linux nodes, by the way :). 

BTW, I'm a girl.



_______________________________________________
users mailing list
users <at> open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
xu | 18 Jun 2013 06:28
Picon
Favicon

[OMPI users] mpif90 error with different openmpi editions

my code get this error under openmpi 1.6.4
mpif90 -O2 -m64 -fbounds-check -ffree-line-length-0 -c -o 2dem_mpi.o 2dem_mpi.f90 Fatal Error: Reading module mpi at line 110 column 30: Expected string
If I use mpif90: Open MPI 1.3.3
It compiles ok. What the problem for this?

_______________________________________________
users mailing list
users <at> open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
Haroogan | 17 Jun 2013 19:50
Picon
Gravatar

[OMPI users] Troubles Building OpenMPI on MinGW-w64 (GCC 4.8.0)

Hello,

I'm trying to build OpenMPI with CMake under MinGW-w64 based on GCC 4.8.0 (POSIX Threads), and here is what I get:

In file included from ../opal/threads/mutex_windows.h:36:0,
                 from ../opal/threads/mutex.h:121,
                 from ../opal/event/event.h:161,
                 from ../opal/event/log.c:60:
../opal/include/opal/sys/atomic.h:577:2: error: #error Atomic arithmetic on pointers not supported
 #error Atomic arithmetic on pointers not supported
  ^
ninja: build stopped: subcommand failed.

Any ideas?
_______________________________________________
users mailing list
users <at> open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
Picon
Favicon

[OMPI users] lsb_launch failed: 0

Hi Team,

 

Our users jobs are exiting with below error for random nodes. could you please help us to resolve this issue?

 

[root <at> bng1grcdc200 output.228472]# cat user_script.stderr

[bng1grcdc181:08381] [[54933,0],0] ORTE_ERROR_LOG: The specified application failed to start in file plm_lsf_module.c at line 308

[bng1grcdc181:08381] lsb_launch failed: 0

--------------------------------------------------------------------------

A daemon (pid unknown) died unexpectedly on signal 1  while attempting to

launch so we are aborting.

 

There may be more information reported by the environment (see above).

 

This may be because the daemon was unable to find all the needed shared

libraries on the remote node. You may set your LD_LIBRARY_PATH to have the

location of the shared libraries on the remote nodes and this will

automatically be forwarded to the remote nodes.

--------------------------------------------------------------------------

--------------------------------------------------------------------------

mpirun noticed that the job aborted, but has no info as to the process

that caused that situation.

--------------------------------------------------------------------------

--------------------------------------------------------------------------

mpirun was unable to cleanly terminate the daemons on the nodes shown

below. Additional manual cleanup may be required - please refer to

the "orte-clean" tool for assistance.

--------------------------------------------------------------------------

        bng1grcdc172 - daemon did not report back when launched

        bng1grcdc154 - daemon did not report back when launched

        bng1grcdc198 - daemon did not report back when launched

        bng1grcdc183 - daemon did not report back when launched

        bng1grcdc187 - daemon did not report back when launched

        bng1grcdc196 - daemon did not report back when launched

        bng1grcdc153 - daemon did not report back when launched

        bng1grcdc173 - daemon did not report back when launched

        bng1grcdc193 - daemon did not report back when launched

        bng1grcdc185 - daemon did not report back when launched

        bng1grcdc176 - daemon did not report back when launched

        bng1grcdc190 - daemon did not report back when launched

        bng1grcdc194 - daemon did not report back when launched

        bng1grcdc156 - daemon did not report back when launched

 

 

Thanks,

Bharati Singh

_______________________________________________
users mailing list
users <at> open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
Picon
Favicon

[OMPI users] jobs are hanging with btl_openib_component error

Hi Team,

 

Our users jobs are hanging and we notice below errors.  

 

[[61410,1],65][btl_openib_component.c:3238:handle_wc] from bng1aviationdc22 to: bng1aviationdc26 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 774739584 opcode 1  vendor error 129 qp_idx 0

 

PFA file for more information.

 

Thanks,

Bharati Singh

***************************************************************************** ** ** ** WARNING: This email contains an attachment of a very suspicious type. ** ** You are urged NOT to open this attachment unless you are absolutely ** ** sure it is legitimate. Opening this attachment may cause irreparable ** ** damage to your computer and your files. If you have any questions ** ** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING IT. ** ** ** ** This warning was added by the IU Computer Science Dept. mail scanner. ** *****************************************************************************
Attachment (output.14807.zip): application/x-zip-compressed, 115 KiB
_______________________________________________
users mailing list
users <at> open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
Elias Rudberg | 16 Jun 2013 17:54
Picon
Picon
Favicon

[OMPI users] MPI_Init_thread hangs in OpenMPI 1.7.1 when using --enable-mpi-thread-multiple

Hello!

I would like to report what seems to be a bug in MPI_Init_thread in  
OpenMPI 1.7.1.

The bug can be reproduced with the following test program  
(test_mpi_thread_support.c):
===========================================
#include <mpi.h>
#include <stdio.h>
int main(int argc, const char* argv[]) {
   int provided = -1;
   printf("Calling MPI_Init_thread...\n");
   MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, &provided);
   printf("MPI_Init_thread returned, provided = %d\n", provided);
   MPI_Finalize();
   return 0;
}
===========================================

When trying to run this when OpenMPI was configured with  
--enable-mpi-thread-multiple, the program hangs when trying to run  
using anything more than one process.

Steps I use to reproduce this in Ubuntu:

(1) Download openmpi-1.7.1.tar.gz

(2) Configure like this:
./configure --enable-mpi-thread-multiple

(3) make

(4) Compile test program like this:
mpicc test_mpi_thread_support.c

(5) Run like this:
mpirun -np 2 ./a.out
Then you see the following two lines of output:
Calling MPI_Init_thread...
Calling MPI_Init_thread...
And then it hangs.

MPI_Init_thread did not hang in earlier OpenMPI versions (for example  
it worked in 1.5.* and 1.6.*), so it seems like a bug introduced in 1.7.

The description above shows how I reproduce this in Ubuntu on my local  
desktop computer, but the same problem exists for the OpenMPI 1.7.1  
installation at the UPPMAX computer center where I wan to run my code  
in the end. I don't know all details about how they installed it  
there, but I know they set --enable-mpi-thread-multiple. So maybe it  
hangs in 1.7.1 on any computer as long as you use MPI_THREAD_MULTIPLE.  
At least I have not seen it work anywhere.

Do you agree that this is a bug, or am I doing something wrong?

Best regards,
Elias
Vanja Z | 15 Jun 2013 19:44
Picon
Favicon

Re: [OMPI users] QLogic HCA random crash after prolonged use

>>  I have seen it recommended to use psm instead of openib for QLogic cards.

> [Tom] 
> Yes.  PSM will perform better and be more stable when running OpenMPI than using 
> verbs.  Intel has acquired the InfiniBand assets of QLogic about a year ago.  
> These SDR HCAs are no longer supported, but should still work.  You can get the 
> driver (ib_qib) and PSM library from OFED 1.5.4.1 or the current release OFED 
> 3.5.
> 
> With the current OFED 3.5 release there are included psm-release notes which 
> start out this way (read down to the OpenMPI build instructions for PSM):

Thanks
 for the reply (and sorry for my late response). I had already tried 
compiling OpenMPI with the "--with-psm" flag. It compiles but doesn't 
seem to get me much closer to actually using psm.

I've found a software package(s) available from the Intel site,
http://www.intel.com/content/www/us/en/search.html?keyword=qlogic+ofed
It
 seems like installing these on a supported OS (RHEL5/6 and SLES 10/11) 
is the recommended method for using QLogic/Intel cards. I also found 
this very informative post by Julian Blache explaining how he got it all
 working on Debian Squeeze,
http://swik.net/Debian/Planet+Debian/Julien+Blache%3A+QLogic+QLE73xx+InfiniBand+adapters,+QDR,+ib_qib,+OFED+1.5.2+and+Debian+Squeeze/e56if
It
 seems like apart from building OpenMPI with the right flag there is 
also some configuration requiring at the very least a utility called 
iba_portconfig.sh and an openibd initscript. I have tried getting these 
utilities from various sources and I can't find a version that doesn't 
segfault on my machines (Debian Wheezy). It's also not clear to me what 
should come from the Debian repos and what should come from the Intel 
package including what to do about the kernel :S

The more I read 
online, the more it seems that these cards have absolutely no hope of 
operating stably. With a recent kernel upgrade I'm also getting a new 
MPI fork warning that some searching indicates is also connected to 
QLogic cards. I bought 24 of these cards a few months ago and it has 
turned into the biggest computer related nightmare I've ever 
experienced. I'm beginning to think I'm better off trying to sell them 
and buy an equivalent from Mellanox card (I have 2 Mellanox cards that I
 seem to work fine on Debian out of the box).

Have I got any chance of making these cards work on Debian Wheezy?
Zehan Cui | 15 Jun 2013 04:54
Picon

[OMPI users] MPI_Iallgatherv performance

Hi,

OpenMPI-1.7.1 is announce support MPI-3 functionality such as non-blocking collectives.

I have test MPI_Iallgatherv on a 8-node cluster, however, I got bad performance. The MPI_Iallgatherv block the program for even longer time than traditional MPI_Allgatherv.

Following is the test pseudo-code and result.

===========================

Using MPI_Allgatherv:

for( i=0; i<8; i++ )
{
  // computation
    mytime( t_begin );
    computation;
    mytime( t_end );
    comp_time += (t_end - t_begin);
  
  // communication
    t_begin = t_end;
    MPI_Allgatherv();
    mytime( t_end );
    comm_time += (t_end - t_begin);
}

result:
    comp_time = 811,630 us
    comm_time = 342,284 us

--------------------------------------------

Using MPI_Iallgatherv:

for( i=0; i<8; i++ )
{
  // computation
    mytime( t_begin );
    computation;
    mytime( t_end );
    comp_time += (t_end - t_begin);
  
  // communication
    t_begin = t_end;
    MPI_Iallgatherv();
    mytime( t_end );
    comm_time += (t_end - t_begin);
}

// wait for non-blocking allgather to complete
mytime( t_begin );
for( i=0; i<8; i++ )
    MPI_Wait;
mytime( t_end );
wait_time = t_end - t_begin;

result:
    comp_time = 817,397 us
    comm_time = 1,183,511 us
    wait_time = 1,294,330 us

==============================

From the result, we can tell that MPI_Iallgatherv block the program for 1,183,511 us, much longer than that of MPI_Allgatherv, which is 342,284 us. Even worse, it still take 1,294,330 us to wait for the non-blocking MPI_Iallgatherv to finish.


- Zehan Cui


_______________________________________________
users mailing list
users <at> open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Gmane