Ken Nielson | 26 Mar 18:32 2015

Torque 4.2.10 available

Hi all,

Torque 4.2.10 is available for download at http://www.adaptivecomputing.com/support/download-center/torque-download/

Thanks to everyone who helped resolving issues and getting this release out the door.

Ken

--

Ken Nielson Sr. Software Engineer
+1 801.717.3700 office    +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300     Provo, UT 84606
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Philippe Bourdin | 25 Mar 14:34 2015
Picon

pbs_sched fails on a head node host multiple interfaces


	Dear Torque team,

I have currently a severe problem with my head-node configuration.
As is known (but not written in the manual), pbs_sched will only listen 
on the interface, where the ´hostname´ points to. Since we have multiple 
interfaces, and only one of them points to the cluster nodes, I need to 
have pbs_sched to listen on this one interface. And clearly, because 
this subnet is strictly inaccessible from the outside world, we cannot 
set the hostname to that interface, because that would break other things.

My current solution is to hack into the sources and add another option 
'-B' to define a hostname to bind to, if that option given. The diffs 
are rather simple. Starting from the latest release 5.1.0-1:

1) initialize the host variable in pbs_sched.c (at line 753):
host[0] = '\0';

2) to check for a '-B' option and store to host, add (at line 849):
       case 'b':
         host = optarg;
         break;

3) put an if-condition around the current one (around lines 957):
   if (host[0] == '\0') {
     [...]
   }

Can you confirm that this *should* do the trick for pbs_sched?
And is there a way you might add this to the official sources?

If you don't do that on a multi-interfaces host, your jobs are just 
hanging in the queue with state 'Q' and just nothing happens, except you 
find some note in a logfile, where it (mistakenly!) says:
"job allocation request exceeds currently available cluster nodes"
even though pbs_nodes clearly lists enough free nodes.

Furthermore: How do I restrict the pbs_server to bind only to one 
interface and not to all? May someone give me a pointer or a patch?

Thanks and best greetings,

	Philippe Bourdin.
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Ryan Novosielski | 24 Mar 16:44 2015
Picon

Re: GeForce Nvidia cards


Maybe somehow providing a way to exclude a user-definable GPU (by PCI
ID or something else) would be a good idea? Though maybe you don't
often need to worry about an nVidia GPU being an on-board display in a
compute cluster? They seem to use fairly low-end stuff, knowing no one
much plans to use the display GPU for anything at all.

On 03/24/2015 11:00 AM, Ken Nielson wrote:
> Brock,
> 
> We have been reporting the display GPU since we have been using the
> nvml library and it has not caused any problems up until now. It
> seemed to us that we would not want to report the display card and
> so we thought why not just get rid of it all together. The response
> from everyone makes it clear that is not a good idea.
> 
> Apparently it is not broken so we won't fix it.
> 
> Ken
> 
> On Tue, Mar 24, 2015 at 8:53 AM, Brock Palen <brockp <at> umich.edu 
> <mailto:brockp <at> umich.edu>> wrote:
> 
> Maybe we could mask a PCI device ID?  Or an NVML device ID?  I 
> understand, even if it is uncommon, to not want the display GPU to 
> be scheduled for jobs.
> 
> Brock Palen www.umich.edu/~brockp <http://www.umich.edu/~brockp> 
> Assoc. Director Advanced Research Computing - TS XSEDE Campus
> Champion brockp <at> umich.edu <mailto:brockp <at> umich.edu> (734)936-1985
> <tel:%28734%29936-1985>
> 
> 
> 
>> On Mar 24, 2015, at 10:52 AM, Ken Nielson
>> <knielson <at> adaptivecomputing.com
> <mailto:knielson <at> adaptivecomputing.com>> wrote:
>> 
>> Byron,
>> 
>> Thanks for you input. I am glad I asked the question. It looks
>> like there is a wide range of NVIDIA GPU cards in use out there.
>> We will continue to report all GPUs we find.
>> 
>> Ken
>> 
>> On Mon, Mar 23, 2015 at 8:11 PM, Byron Lengsfield
>> <byron.lengsfieldiii <at> hgst.com
>> <mailto:byron.lengsfieldiii <at> hgst.com>>
> wrote:
>> We use Titans, Titan Blacks and Titan-Z cards in our compute
>> clusters.
>> 
>> These are not Tesla cards.
>> 
>> 
>> 
>> Regards,
>> 
>> Byron Lengsfield PhD
>> 
>> Recording Integration Lab
>> 
>> HGST, a Western Digital Company
>> 
>> San Jose, Ca
>> 
>> 
>> 
>> From: torqueusers-bounces <at> supercluster.org
> <mailto:torqueusers-bounces <at> supercluster.org> 
> [mailto:torqueusers-bounces <at> supercluster.org 
> <mailto:torqueusers-bounces <at> supercluster.org>] On Behalf Of Ken
> Nielson
>> Sent: Monday, March 23, 2015 9:43 AM To: torqueusers; Torque
>> Developers mailing list Subject: [torqueusers] GeForce Nvidia
>> cards
>> 
>> 
>> 
>> Hi all,
>> 
>> Currently Torque will display all Nvidia gpu devices reported by
>> the nvml libraries. Including the graphics card which is often a
>> GeForce or Quadro card used for video display. Would it be a
>> problem for anyone if we did not report display gpu devices and
>> only displayed Tesla type gpu devices?
>> 
>> Ken
>> 
>> 
>> --
>> 
>> <~WRD000.jpg>
>> 
> <~WRD000.jpg><~WRD000.jpg><~WRD000.jpg><~WRD000.jpg><~WRD000.jpg><~WRD000.jpg>
>
> 
> 
>> Ken Nielson Sr. Software Engineer
>> 
>> +1 801.717.3700 <tel:%2B1%20801.717.3700> office    +1
> 801.717.3738 <tel:%2B1%20801.717.3738> fax
>> 
>> 1712 S. East Bay Blvd, Suite 300     Provo, UT 84606
>> 
>> www.adaptivecomputing.com <http://www.adaptivecomputing.com>
>> 
>> 
>> 
>> 
>> _______________________________________________ torqueusers
>> mailing list torqueusers <at> supercluster.org
>> <mailto:torqueusers <at> supercluster.org> 
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> 
>> 
>> 
>> 
>> --
>> 
>> 
>> Ken Nielson Sr. Software Engineer +1 801.717.3700
>> <tel:%2B1%20801.717.3700> office    +1
> 801.717.3738 <tel:%2B1%20801.717.3738> fax
>> 1712 S. East Bay Blvd, Suite 300     Provo, UT 84606 
>> www.adaptivecomputing.com <http://www.adaptivecomputing.com> 
>> _______________________________________________ torqueusers
>> mailing list torqueusers <at> supercluster.org
>> <mailto:torqueusers <at> supercluster.org> 
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> _______________________________________________ torqueusers mailing
> list torqueusers <at> supercluster.org
> <mailto:torqueusers <at> supercluster.org> 
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 
> 
> 
> --  Adaptive Computing <http://www.adaptivecomputing.com> Twitter
> <http://twitter.com/AdaptiveMoab> LinkedIn 
> <http://www.linkedin.com/company/448673?goback=.fcs_GLHD_adaptive+computing_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits>
>
> 
YouTube <http://www.youtube.com/adaptivecomputing> GooglePlus
> <https://plus.google.com/u/0/102155039310685515037/posts> Facebook 
> <http://www.facebook.com/pages/Adaptive-Computing/314449798572695?fref=ts>
>
> 
RSS <http://www.adaptivecomputing.com/feed> 	
> Ken Nielson Sr. Software Engineer +1 801.717.3700 office    +1
> 801.717.3738 fax 1712 S. East Bay Blvd, Suite 300     Provo, UT
> 84606 www.adaptivecomputing.com <http://www.adaptivecomputing.com>
> 
> 
> 

--

-- 
____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
|| \\UTGERS      |---------------------*O*---------------------
||_// Biomedical | Ryan Novosielski - Senior Technologist
|| \\ and Health | novosirj <at> rutgers.edu - 973/972.0922 (2x0922)
||  \\  Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
     `'
Martin Siegert | 23 Mar 19:48 2015
Picon
Picon

torque-5.1.0 with --enable-maxint-jobids

Hi,

I am in the process of compiling torque-5.1.0 and am wondering what
effects --enable-maxint-jobids will have: will torque be able load
existing jobs that are in the queue?

configure issues the following warning:
WARNING: You are enabling job ids to go up to the maximum value of an int
on your system. Depending on the size of your system, this may make existing
jobs impossible to upgrade. BE CERTAIN BEFORE PROCEEDING!

Without --enable-maxint-jobids the maximum jobid appears to be 99999999
and we are reasonably close to that number.

I created a tar-archive of /var/spool/torque. Is it safe to try torque
compiled with --enable-maxint-jobids and if that causes all jobs to be
discarded go back to --disable-maxint-jobids and restore /var/spool/torque
to its previous state?

Cheers,
Martin

--

-- 
Martin Siegert
WestGrid/ComputeCanada
Simon Fraser University
Burnaby, British Columbia
Ken Nielson | 23 Mar 17:43 2015

GeForce Nvidia cards

Hi all,

Currently Torque will display all Nvidia gpu devices reported by the nvml libraries. Including the graphics card which is often a GeForce or Quadro card used for video display. Would it be a problem for anyone if we did not report display gpu devices and only displayed Tesla type gpu devices?

Ken

--

Ken Nielson Sr. Software Engineer
+1 801.717.3700 office    +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300     Provo, UT 84606
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Eva Hocks | 17 Mar 22:27 2015
Picon

torque 4.2.10


I am upgrading my cluster next Monday and found version 4.2.10 on git
but not on the AC download page (the latest on
http://www.adaptivecomputing.com/support/download-center/torque-download/
is 4.2.9.tar.gz)

Will 4.2.10 be available as tar archive or just from git?

Thanks
Eva
Michel Béland | 17 Mar 21:18 2015
Picon

Torque upgrade from 2.3 and 2.5 to 5.1

Hi,

I am preparing an upgrade to our various clusters running Torque. We 
have in production Torque 2.3.6, 2.5.3 and 2.5.9. On a small test 
cluster using virtual machines, I played with the upgrade from 2.5.3 to 
5.1 and noticed that if I had jobs still in queue (not running) 
submitted from Torque 2.5.3, I could see these jobs with qstat v5.1 
after the upgrade.

I thought that the format of the server database had changed from 2.5.3 
to 5.1. I planned to have a transient Torque 2.5.3 server on alternate 
ports with some nodes running pbs_mom 2.5.3 in order to run all the 
queued jobs submitted before the upgrade. I am wondering now if this is 
really needed.

Is Torque 5.1 supposed to be able to correctly run jobs submitted with 
2.5.3? I already know about job array changes between Torque 2.3.6 and 
2.5.3. I also know about the -u option to use when upgrading from 2.5.9.

Thanks,

--

-- 
Michel Béland, analyste en calcul scientifique
michel.beland <at> calculquebec.ca
bureau S-250, pavillon Roger-Gaudry (principal), Université de Montréal
téléphone : 514 343-6111 poste 3892     télécopieur : 514 343-2155
Calcul Québec (www.calculquebec.ca)
Calcul Canada (calculcanada.ca)

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Bas van der Vlies | 12 Mar 11:54 2015
Picon

ANNOUNCE: New version of torque python interface (4.6.0)

=========== 4.6.0
 * Added support for torque version 5.x. Note: the rm interface does not work for me on debian wheezy.
   Author: Bas van der Vlies

 * PBSQuery had been update to new cpuid range for torque 5
   Author: stijnd : deweirdt add ugent dot be 
   Applied by : Bas van der Vlies

==============================================================

the latest stable pbs_python interface is available from:
            ftp://ftp.surfsara.nl/pub/outgoing/pbs_python.tar.gz

Information, documentation and reporting bugs for the package:
            https://oss.trac.surfsara.nl/pbs_python

===== Brief description =========================================

Pbs_python interface is a wrapper class for the TORQUE C LIB API. Now you can
write utilities/extensions in Python instead of C.

--- Testing the package:

The test programs are include as a reference how to use the pbs python module.
You have to edit some test programs to reflect your Torque installation.

pbsmon.py               - ascii xpbsmon
rack_pbsmon.py          - ascii xpbsmon by rack layout
pbsnodes-a.py           - pbsnodes -a
pbs_version.py          - print server version
set_property.py         - set some node properties
resmom_info.py          - queries the pbs_mom daemon on the nodes
logpbs.py                       - Shows the usage of the PBS logging routines
new_interface.py        - Example how to use PBSQuery module
PBSQuery.py             - python <install_path>/PBSQuery.py (has builtin demo)
sara_nodes.py           - We use this program to set the nodes offline/online.
                                  when there are no command line arguments. It will list
                                  the nodes that are down/offline.

For more info:
        https://oss.trac.surfsara.nl/pbs_python/wiki/TorqueExamples
---
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG  Amsterdam
| T +31 (0) 20 800 1300  | bas.vandervlies <at> surfsara.nl | www.surfsara.nl |
Ken Nielson | 10 Mar 01:12 2015

Re: Torque montoring

If the jobs have been queued in Torque you can follow them in Torque. Many users use Moab and Maui to submit their jobs to Torque.

On Mon, Mar 9, 2015 at 3:14 AM, Amarinder Singh Thind <mr.thind <at> outlook.com> wrote:
Dear Sir/Maa'm,



ganglia is not working on cluster so I want to monitor jobs using torque. Can I monitor all jobs through torque even which are not submitted through torque?


Regards:
Amarinder Singh Thind
Project fellow
IISER-Mohali
M.tech.Bioinformatics
University of Hyderabad




--

Ken Nielson Sr. Software Engineer
+1 801.717.3700 office    +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300     Provo, UT 84606
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Christopher Smowton | 9 Mar 20:30 2015
Picon

Zombie Jobs

Hi all,

Our Torque 4.2.6.1 install has got into an interesting situation: a few
jobs were dispatched to nodes' moms, but due to transient network problems
the server did not notice an acknowledgement. Thus the jobs successfully
completed without ever having been started as far as the server is
concerned.

Now these jobs appear to be occupying node slots in 'qnodes -f' output
(i.e. they are listed in the "jobs = ..." parameter of a node), but they
do not appear *at all* in qstat. It seems likely pbs_server has deleted
the job descriptors (or at least, the keys that qstat iterates over) but
left them referenced by the node object. We're probably quite lucky
pbs_server didn't segfault at that point.

Things we've tried to clean up the spurious jobs:

qdel -p <jobid>: appears to succeed, no effect
momctl -c <jobid> on node in question: appears to succeed, no effect
Restarting the pbs_mom on one such node: no effect

Ideally we would not restart pbs_server since this will lose queued jobs
and needs coordinating with users. Any other ideas to get rid of this
inconsistent internal state?

Chris

________________________________

This email is confidential and intended solely for the use of the person(s) ('the intended recipient') to
whom it was addressed. Any views or opinions presented are solely those of the author and do not
necessarily represent those of the Cancer Research UK Manchester Institute or the University of
Manchester. It may contain information that is privileged & confidential within the meaning of
applicable law. Accordingly any dissemination, distribution, copying, or other use of this message, or
any of its contents, by any person other than the intended recipient may constitute a breach of civil or
criminal law and is strictly prohibited. If you are NOT the intended recipient please contact the sender
and dispose of this e-mail as soon as possible.
Ken Nielson | 9 Mar 18:09 2015

nvidia-smi and Torque

Hi all,

We are looking to deprecate the use of the GPU utility nvidia-smi in the next major release of Torque. This will make a requirement to use the NVIDIA nvml library when enabling NVIDIA GPUs.

Is this going to create problems for anyone?

Regards

Ken

--

Ken Nielson Sr. Software Engineer
+1 801.717.3700 office    +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300     Provo, UT 84606
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

Gmane