Grigory Shamov | 24 Apr 17:21 2015
Picon

How to use epilogue.parallel in Torque 2.5?

Hi All,

The epilogue.parallel feature, is it available for  users? Or is it like the system epilogue? Somehow I
didnt find the answer in the Torque documentation. 
What would be the qsub syntax to use epilogue.parallel if it is for  users?

--
Grigory Shamov
Westgrid/University of Manitoba
Avalon Johnson | 24 Apr 02:34 2015
Picon

Problem with JOBIDs when hostname in server_name file is a CNAME


Hello all,

We have a multi-homed host for which we would like to use a 'server_name' that 
is one of the hosts "aliases".

Although we have the following entry in the 'server_name' file

           hostalias.dom1.usc.edu

(BTW I also changed SERVERHOST in the torque.cfg to no avail)

when jobs are created torque appends

           realinterfacename.dom1.usc.edu

It appears that torque does a "gethostbyname" on either the hostname of the 
machine or or the SERVERHOST or   "-H " parameter.

Is there a way to FORCE torque to use a specified hostname when to append to the 
numeric job id.

Basically I would like to be able to use one of the aliases for a host as the 
string torque appends to the JOBID.

Thanks.

Avalon Johnson
ITS HPCC
USC
(Continue reading)

Dave Ulrick | 22 Apr 19:20 2015
Picon
Picon

Issues after implementing a local DNS server

I'm running TORQUE 4.2.9 on a 60-node HPC. Up to now, I've relied on 
/etc/hosts files for host name resolution, but today I've implemented a 
local DNS server for resolving the cluster's private IPs. Making this 
happen involved rolling back some, ahem, prevarications in the /etc/hosts 
files that caused a host name such as "foo.niu.edu" that resolves in 
public DNS to a publically routable IP to resolve to one of the host's 
private IPs (e.g, 192.168.1.x). Basically, what I want to do is to 
change /var/spool/torque/server_name from:

torque_server.public.domain

to:

torque_server.private.domain

I've changed /var/spool/torque/server_name on all nodes--TORQUE server, 
compute nodes, and login node--to the brand new DNS name for the TORQUE 
server's 192.168.1.x IP. Some things work fine including batch job 
submission and 'qstat', but other things fail: interactive 'qsub' jobs 
hang and 'qdel <job id>' returns without error but takes no action. I can 
work around these issues by adding to /etc/hosts on the login node:

<private_torque_server_ip> torque_server.public.domain

and /etc/hosts on the compute nodes:

<private_login_node_ip> login_node.public.domain

I have an /etc/hosts.equiv on the TORQUE server that includes both the 
public and private DNS names for the login node. 'set server acl_hosts' on 
(Continue reading)

Steven Lo | 21 Apr 23:27 2015
Picon

pbs_server restart crash when array jobs are running


Hi,

We are running Torque 4.1.5.1 and Maui 3.1.1 on RedHat 5.8 system.

Recently, we noticed that pbs_server crash with segfault after ~15 
seconds of each restart.  We believe
the array job is causing the crash since we noticed that the Torque 
spool directory has files
with extension JB, SC, TA and AR file.  After we clean out those files, 
pbs_server will start
without crashing.

We found out 4.2 has similar problem:

"If bad job array files exit at startup, pbs_server may segfault. If you 
encounter this behavior, move the offending .JB and .AR files out of the 
$TORQUE_HOME/server_priv/jobs and $TORQUE_HOME/server_priv/arrays 
directories, respectively (TRQ-1427). "

Is there a fix for the 4.1 version or we have to move up to 4.2 (with 
fix, of course)?

Thanks.

Steven.
Andrus, Brian Contractor | 21 Apr 20:45 2015

naccesspolicy

Ok, I submitted a job:

qsub -I -X -l nodes=1:gpus=1,naccesspolicy=singleuser,walltime=23:00:00

 

I expected to get a single node with a GPU for 23 hours and be the only user on it during that time.

Seems that it is good for Resource_List:

    Resource_List.naccesspolicy = singleuser

    Resource_List.neednodes = 1:gpus=1

    Resource_List.nodect = 1

    Resource_List.nodes = 1:gpus=1

    Resource_List.pmem = 1b

    Resource_List.walltime = 23:00:00

 

Ok. I am on the node … lo and behold… ANOTHER user’s job gets put there too!

 

SO:

What ARE the options to naccesspolicy and what do they mean?

I have seen mention of:

SHARED

SINGLEJOB

SINGLETASK

SINGLEUSER

UNIQUEUSER

 

Using Torque 5.1.0 and Moab 8.1.0

 

Brian Andrus

ITACS/Research Computing

Naval Postgraduate School

Monterey, California

voice: 831-656-6238

 

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Byron Lengsfield | 21 Apr 19:13 2015

FIFO

Hi,

 

I am using Torque 5.1.0 with GPUs.

 

The pbs scheduler is not using a FIFO scheduling algorithm which is a problem for us.

 

I changed the sched_config file to invoke the option:       strict_fifo:  true   ALL

but the scheduling is still not FIFO.  Can you suggest a fix.

 

The previous version of sched_config (default) was similar to the one we are using under Torque 4.2.6.

which worked well.

 

Thanks,

 

Byron Lengsfield  PhD

Recording Integration Lab

HGST, A Western DIigital Co.

San Jose, Ca

 

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Mario Lang | 21 Apr 15:33 2015
Picon

Limiting a queue to a certain number of processor cores?

Hi.

We are trying to figure out how to use Torque to create a queue which
would only allow a certain maximum number of processors to be used at
any given time.  On the system we want to establish this, we have 16
core nodes.  Our goal would be to limit the number of concurrently used
processor cores in that queue to something like 800, on a cluster with a
little more then 2000 cores available in total.  Initial investigation seems to
indicate that the version of Torque we are currently using (2.4.16) does
not support this at all.  The resources_available.procct queue attribute looks
like what I want, but it is only available from Torque 2.5 onwards.  So
we installed Torque 4.2.9 on a test system.  We couldn't quite get it to
work as expected, so upgraded to 4.2.10.  Here, however, the
resources_available.procct queue attribte seems no longer available.  I
can set it with something like

set queue foo resources_available.procct = 40

but when I later invoke "print queue foo" the attribute is not shown.
However, I also do not get a warning that this queue attribute is not
supported.  Looks like a bug to me.

Question is: Is resources_available.procct gone again?  If so, how are
we supposed to get the (IMHO rather basic) feature described above
otherwise.  Is there any other queue attribute that does an equivalent
job to procct?  Note that I do not want to limit the number of nodes, I
really want to limit the number of total CPU cores in use, across all
nodes of the system.

TIA
--

-- 
Mario Lang
Graz University of Technology
IT Services - Computing
Steyrergasse 30/1, 8010 Graz, Austria - Europe
Phone: +43 316 873 6897
Mobile: +43 664 60 873 6897
Email: mlang <at> tugraz.at
www.zid.tugraz.at
Taras Shapovalov | 20 Apr 11:10 2015

What stable version of Torque do you recommend?

Hi guys,

Could you advise what version of Torque we need to install on a cluster now in case if there should no be any updates next 10 months? Lets assume Moab 8 may also be used in the future. I see there are 2 versions of 5.x: 5.1.0 and 5.0.1 on the download page and the both should work with Moab 8, so its unclear what choice we should make.

Best regards,

Taras
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Michel Béland | 14 Apr 22:46 2015
Picon

mom_hierarchy change with running jobs

Hello,

I just upgraded our biggest cluster to Torque 5.1 last week. We have a 
problem with some nodes being marked down although we can still login on 
them and jobs are still running. We want to change the 
server_priv/mom_hierarchy file to see if it would help. Can I just 
change the server_priv/mom_hierarchy file and restart pbs_server or 
should I be worried of any side effect?

Is there a way to broadcast the server_priv/mom_hierarchy change to the 
compute nodes without restarting pbs_server?

--

-- 
Michel Béland, analyste en calcul scientifique
michel.beland <at> calculquebec.ca
bureau S-250, pavillon Roger-Gaudry (principal), Université de Montréal
téléphone : 514 343-6111 poste 3892     télécopieur : 514 343-2155
Calcul Québec (www.calculquebec.ca)
Calcul Canada (calculcanada.ca)

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Brian Yang | 10 Apr 20:34 2015
Picon

qsub -k behaviors reverted

Hi all,

I have been using torque-2.5.12 for a while. When I realized that there were millions of STDIN.* files kept in my home directory, I decided to get rid of them and set "-k n" for all the qsub lines. The weird thing is, it still kept the STDIN.* files and I played with qsub options, here is the results:

-k eo   (no STDIN.* will be retained)
-k oe   (no STDIN.* will be retained)
-k e     (STDIN.o* will be retained)
-k o     (STDIN.e* will be retained)
-k n     (STDIN.* will be retained)

That's why I said "reverted"...

I am using linux kernel 2.6.32-220, x86_64 system.

I don't want to retain these STDIN.* files, I could definitely use "-k eo" to avoid this but I really want to know why it happens like this way.

Besides, is there a way to set it globally instead of inserting this option into all the qsub lines?

Thanks!
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
马凯 | 8 Apr 17:16 2015

The pbs server could not get the information of pbs moms

I have installed pbs_server and pbs_mom on my head node, and pbs_mom on all compute nodes.
After I start pbs_server and pbs_mom on my head node, and pbs_mom on all compute nodes, I got this:
===================================================================================================================================
gpu-cluster-1
     state = down
     power_state = Running
     np = 1
     ntype = cluster
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 8
...
administrator
     state = free
     power_state = Running
     np = 1
     ntype = cluster
     status = rectime=1428505640,macaddr=48:62:76:ff:b7:f5,cpuclock=Fixed,varattr=,jobs=,state=free,netload=2713443950,gres=,loadave=0.06,ncpus=8,physmem=32727660kb,availmem=47104340kb,totmem=49185384kb,idletime=11224,nusers=3,nsessions=7,sessions=1677 13377 13541 13556 13744 43860 43983,uname=Linux administrator 3.10.0-123.el7.x86_64 #1 SMP Mon Jun 30 12:09:22 UTC 2014 x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003
===========================================================================================================================================================
The node of administrator is my head node, and cpu-cluster-1 is one of my compute node. Why was the compute node's state always down? How could I solve it?


_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

Gmane