Steven Lo | 25 Nov 23:57 2014
Picon

maui crash while parsing command argument on RHEL 6.6


Hi,

Since the mauiuser mailing list is very quiet, We're hoping this list 
can give us some clues
on the Maui problem that we are having.

Thanks in advance for your help.

Steven.

On 11/25/2014 02:50 PM, Steven Lo wrote:
>
> Hi,
>
> We have compiled maui 3.3.1 on RHEL 6.6 using default gcc 4.4.7. When 
> we try to run it with
> "/usr/local/maui-3.3.1/sbin/maui -C /var/spool/maui/maui.cfg", it 
> crashes with
>
> *** glibc detected *** /usr/local/maui-3.3.1/sbin/maui: free(): 
> invalid pointer: 0x000000349040eefc ***
>
>
> We then use gdb and found out that it crashes in subroutine "MUFree" 
> when it try to parse the
> command arguments.  It's from main() -> MUStrDup() -> MUFree() -> free()
>
>
> Page 
(Continue reading)

Andrus, Brian Contractor | 25 Nov 18:31 2014

Back from SC14 and prepping for Torque 5.0

Had a great time with everyone at SC14! Thanks for everything and keeping it moving!

So, I am starting to get things ready for our upgrade to torque 5.0 based on Ken and Nik's accolades on how
smooth it goes :)

In our shop, I build RPMs to install/manage pretty much everything. Makes life way easier.

That being my situation, my first grouse has already shown up.

Spec files can be picky. They need to be specific (hmmm.. connection to the name perhaps...yes).

I have seen this issue in previous versions of torque as well:

The name paths inside the spec file do not match the paths in the tar file.

Can we please get some consistency on this?
Case in point: current .tgz file that is downloaded: torque-5.0.1-1_4fa836f5.tgz
In there:
[root SOURCES]# tar tzvf torque-5.0.1-1_4fa836f5.tgz|head -1
drwxr-xr-x 1167/2010         0 2014-11-05 13:45 torque-5.0.1-1_4fa836f5/

So everything gets put in a subdirectory named torque-5.0.1-1_4fa836f5

BUT in the spec file that is contained in said tgz file:
-------------snip---------
### Handle logic for snapshots
%define tarversion 5.0.1
#define snap 0
%if %{?snap}0
%{expand:%%define version %(echo %{tarversion} | sed 's/-snap\..*$//')}
(Continue reading)

Andrew Mather | 25 Nov 08:04 2014
Picon

Re: Great TORQUE BoF and dinner




Message: 1
Date: Wed, 19 Nov 2014 07:31:09 -0700
From: Ken Nielson <knielson <at> adaptivecomputing.com>
Subject: [torqueusers] Great TORQUE BoF and dinner
To: torqueusers <torqueusers <at> supercluster.org>,         Torque Developers
        mailing list <torquedev <at> supercluster.org>,      David Hill
        <dhill <at> adaptivecomputing.com>,  Benjamin Schmuhl
        <bschmuhl <at> adaptivecomputing.com>,       Nate Lyons
        <nlyons <at> adaptivecomputing.com>,         Dustin Halliday
        <dhalliday <at> adaptivecomputing.com>
Message-ID:
        <CADvLK3cnO0d0cGX_g4=rnTR9Ek0thrFTxBFO63vmr8Kp9g_TaA <at> mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi all,

Thanks to everyone who came to our 6th annual TORQUE BoF here in New
Orleans. I hope our time together is useful to everyone.

The TORQUE dinner was definitely the best attended event since we started
this tradition 6 years ago.

Thanks to Adaptive Computing for sponsoring the dinner. And thanks to Ben
Schmuhl, David Hill, Nate Lyons and Dustin Haliday for joining us and being
available to the community.

I love this community.

Ken

--
    

Back Down-Under and with time to go through all the email that mounted up while away in New Orleans, I'd like to thank Ken and the Adaptive folks for the invite and for the BoF !

A great dinner with good company and some interesting stuff in the pipeline !

Andrew
 
--
-
http://surfcoast.redbubble.com | https://picasaweb.google.com/107747436224613508618
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"Unless someone like you, cares a whole awful lot, nothing is going to get better...It's not !" - The Lorax
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
A committee is a cul-de-sac, down which ideas are lured and then quietly strangled.
  Sir Barnett Cocks
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"A mind is like a parachute. It doesnt work if it's not open." :- Frank Zappa
-
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Covington, Cody Lance | 21 Nov 16:51 2014
Picon

communication problems?

Hello all

 

I would greatly appreciate any help at this point! I am trying to make a small computing cluster for a class in the fall. I have installed torque-5.0.1, and I am trying to get a 2 computer cluster to work before I add more machines.

 

On the head node (pchem-m-236), I have pbs_server and pbs_sched running, and the compute node (pchem-s1-236) is running pbs_mom.

 

The compute node state is down when everything is started. When I set the node state to free using qmgr, jobs will actually run on the compute node. But after 1 min or so the node goes down.

 

Here’s some info, momcrl gives warnings that no messages have been sent to the server:

 

pchem-m-236 teacher # momctl -d 3 -h DHCP-129-59-119-100.n1.vanderbilt.edu

 

Host: pchem-s1-236/pchem-s1-236   Version: 5.0.1   PID: 3828

Server[0]: DHCP-129-59-119-18.n1.vanderbilt.edu (129.59.119.18:15001)

  Last Msg From Server:   739 seconds (CLUSTER_ADDRS)

  WARNING:  no messages sent to server

HomeDirectory:          /var/spool/torque/mom_priv

stdout/stderr spool directory: '/var/spool/torque/spool/' (33917122blocks available)

NOTE:  syslog enabled

MOM active:             3932 seconds

Check Poll Time:        45 seconds

Server Update Interval: 30 seconds

LogLevel:               5 (use SIGUSR1/SIGUSR2 to adjust)

Communication Model:    TCP

MemLocked:              TRUE  (mlock)

TCP Timeout:            120 seconds

Prolog:                 /var/spool/torque/mom_priv/prologue (disabled)

Alarm Time:             0 of 10 seconds

Trusted Client List:  127.0.0.1:0,127.0.1.1:0,129.59.119.18:0,129.59.119.100:15003:  0

Copy Command:           /usr/bin/scp -rpB

NOTE:  no local jobs detected

 

diagnostics complete

 

But the output of the mom’s log says that a status update was successfully sent:

11/21/2014 09:36:43;0002;   pbs_mom.3828;Svr;pbs_mom;Torque Mom Version = 5.0.1, loglevel = 5

11/21/2014 09:36:43;0002;   pbs_mom.4956;n/a;mom_server_update_stat;status update successfully sent to DHCP-129-59-119-18.n1.vanderbilt.edu

11/21/2014 09:36:43;0008;   pbs_mom.3828;Job;scan_for_terminated;pid 4956 not tracked, statloc=0, exitval=0

11/21/2014 09:37:13;0002;   pbs_mom.6195;n/a;mom_server_update_stat;status update successfully sent to DHCP-129-59-119-18.n1.vanderbilt.edu

11/21/2014 09:37:14;0008;   pbs_mom.3828;Job;scan_for_terminated;pid 6195 not tracked, statloc=0, exitval=0

 

Does anyone have any ideas?

 

Thanks in advance

Cody

 

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Ryan Novosielski | 20 Nov 01:06 2014
Picon

Admin Job/Reboot?


Can someone share with me the way they are scheduling nodes for
reboot? Occasionally I do OS patching on the nodes and it would be
nice to schedule a reboot in the slot where a job would be, more or
less. Someone must have written a script and it would be nice not to
have to reinvent the wheel.

--

-- 
____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
|| \\UTGERS      |---------------------*O*---------------------
||_// Biomedical | Ryan Novosielski - Senior Technologist
|| \\ and Health | novosirj <at> rutgers.edu - 973/972.0922 (2x0922)
||  \\  Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
     `'
Paul Tatarsky | 14 Nov 22:07 2014

DRMAA from Torque 5.0.1 and python-drmaa


Forgive my ignorance if this is a frequent question and any other
generic signs of stupidity on my part. First post to this list.

I saw a few threads on DRMAA but not this particular matter.

I compiled the DRMAA library in the recent 5.0.1 tarball with
--enable-drmaa. (I add gperf to the mix here to get it to build)

I add the tpackage it generated to get the library and include file for
drmaa.

I bring down the python-drmaa package from

http://drmaa-python.github.io/

I export:

export DRMAA_LIBRARY_PATH=/opt/torque/lib/libdrmaa.so.1.0

(because it lives there on this cluster)

I build.

python setup.py build

I make an RPM to be tidy.

python setup.py bdist_rpm

I deploy the RPM.

When I attempt to import drmaa I believe I run into C++ mangling
confusion somewhere:

> python
>>> import drmaa
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "drmaa/__init__.py", line 63, in <module>
    from .session import JobInfo, JobTemplate, Session
  File "drmaa/session.py", line 39, in <module>
    from drmaa.helpers import (adapt_rusage, Attribute,
attribute_names_iterator,
  File "drmaa/helpers.py", line 36, in <module>
    from drmaa.wrappers import (drmaa_attr_names_t, drmaa_attr_values_t,
  File "drmaa/wrappers.py", line 56, in <module>
    _lib = CDLL(libpath, mode=RTLD_GLOBAL)
  File "/usr/lib64/python2.6/ctypes/__init__.py", line 353, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /opt/torque/lib/libdrmaa.so.0: undefined symbol:
_Z19drmaa_attrib_lookupPKcj

Because c++filt returns a call which I believe comes from a gperf
produced source file:

> c++filt _Z19drmaa_attrib_lookupPKcj
drmaa_attrib_lookup(char const*, unsigned int)

> grep drmaa_attrib_lookup *.c
attrib.c:        drmaa_attrib_lookup(const char *str, unsigned int len);
attrib.c:        = drmaa_attrib_lookup(drmaa_name, strlen(drmaa_name));
drmaa_attrib.c:/* Command-line: /usr/bin/gperf -L C++ --struct-type
--readonly-tables --includes --hash-function-name=drmaa_attrib_hash
--lookup-function-name=drmaa_attrib_lookup drmaa_attrib.gperf  */
drmaa_attrib.c:  static const struct drmaa_attrib *drmaa_attrib_lookup
(const char *str, unsigned int len);
drmaa_attrib.c:Perfect_Hash::drmaa_attrib_lookup (register const char
*str, register unsigned int len)

What have I forgotten on this Friday and should I just open a beer instead?

Any ideas very much appreciated!

--

-- 
Paul Tatarsky                          |---|
Two Guys and a Cluster            \O/  |--X| \O/
http://www.clusterguys.com/        |   |---|  |
(608) 630-8182 x1              __ /-\ _|X--|_/-\ __
Ken Nielson | 19 Nov 15:31 2014

Great TORQUE BoF and dinner

Hi all,

Thanks to everyone who came to our 6th annual TORQUE BoF here in New Orleans. I hope our time together is useful to everyone.

The TORQUE dinner was definitely the best attended event since we started this tradition 6 years ago.

Thanks to Adaptive Computing for sponsoring the dinner. And thanks to Ben Schmuhl, David Hill, Nate Lyons and Dustin Haliday for joining us and being available to the community.

I love this community.

Ken

--

Ken Nielson Sr. Software Engineer
+1 801.717.3700 office    +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300     Provo, UT 84606
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Ken Nielson | 14 Nov 17:09 2014

TORQUE dinner for SC14

Hi all,

We will be meeting at Gordon Biersch, 200 Poydras Street, for the TORQUE dinner. We will go there right after the TORQUE BoF.

See you all there

Ken

--

Ken Nielson Sr. Software Engineer
+1 801.717.3700 office    +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300     Provo, UT 84606
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Steven Lo | 12 Nov 20:29 2014
Picon

TORQUEView build questions


Hi,

I'm able to get Qt install and built TORQUEView.  When invoke 
TORQUEView, we get the following
errors which I don't believe is significant:

libGL error: failed to authenticate magic 25
libGL error: failed to load driver: nouveau

Probably need a Nvidia driver.....

First question is do I need to build TORQUEView on a system which has 
Torque install?  I thought
it can run on any system and monitor Torque server using remote. When 
start TORQUEView,
as 'root', a window "TORQUEView Error" pops up with "bash: pbs_server: 
command not found".

After I click OK, a window "TORQUEView Message" pops up with the following:

-----------
TORQUEView is being run without admin rights.  Certain features will not 
be available, such as:

-Issuing "momctl-d3" commands
-Start Mom command
-Stop MOM command
-Submitting commands to qmgr
-Reading or editing the 'nodes' file
-----------

Start TORQUEView as root should grant the admin rights??

Thanks advance for your help.

Steven.
Liam Gretton | 10 Nov 12:09 2014
Picon
Picon

Job summary other than by email?

Is there a mechanism for Torque to report a job's summary once finished
by a means other than using the -m option to send an email?

In many cases, especially for large array jobs, it's not desirable to be
receiving emails for job events.

Ideally there'd be an option to save a file alongside stdout and stderr
with the same information that's sent in the email.

--

-- 
Liam Gretton                                    liam.gretton <at> le.ac.uk
Systems Specialist                           http://www.le.ac.uk/its/
IT Services                                   Tel: +44 (0)116 2522254
University Of Leicester, University Road
Leicestershire LE1 7RH, United Kingdom
Erica Riello | 11 Nov 19:10 2014
Picon

mpi jobs

Hi all,

I'd like to know if there's any way to monitor the mpi processes that 
has finished in a mpi job.

Thanks in advance,

--
Erica Riello

Gmane