Richard Young | 25 May 00:21 2016
Picon
Picon

Sending jobs to another cluster queue

I am wondering if somebody has come across this problem before and can supply some help. I have two clusters,
one running PBSPro and the other running Torque/Maui. I have setup a queue on the PBSPro cluster to send
jobs to the Torque/Maui cluster however the jobs just sit in the queue waiting to run.

If I run the command "echo hostname|qsub -q default <at> hpc-sunadmin-prd-t1" on the PBSpro cluster it comes
back with the error of:

pbs_iff: error returned: 15031
pbs_iff: error returned: 15031
No Permission.
qsub: cannot connect to server hpc-sunadmin-prd-t1 (errno=15007)

Whereas if I run the same command on the Torque/Maui cluster it returns the correct jobno. On the
Torque/Maui cluster I have added the PBSPro login and admin nodes to both the acl_hosts and submit_hosts
options and this has made no difference. I have also tried adding acl_user entries but this again has made
no difference.

Has anybody setup this type of system before and can supply some insight into fixing the problem.

Thanks

---------------------------------------------------------------------
Richard A. Young
ICT Services
HPC Systems Engineer
University of Southern Queensland
Toowoomba, Queensland 4350
Australia 
Email: Richard.Young <at> usq.edu.au   Phone: (07) 46315557   
Mob:   0437544370          Fax:   (07) 46312798 
(Continue reading)

Vince Forgetta | 22 May 15:02 2016
Picon
Gravatar

Elevated Group Priority on Subet of Nodes

I am using a recent build of Torque/Maui (w/ PBS) to schedule jobs on a cluster with heterogenous hardware.  Hardware consists on two set of 10 nodes for which I would like to have two group have access, but elevated priority on one of the sets of nodes. For example:

    Node set A of 10 nodes has elevated priority for User Group 1
    Node set B of 10 nodes has elevated priority for User Group 2

I am familiar with how this is accomplished for all nodes, which is documented here:

http://docs.adaptivecomputing.com/maui/5.1.3priorityusage.php

However, I am unfamiliar on the best strategy to set this type of priority on a subset of the cluster. From what I can ascertain from the Maui docs it may be done using partitions, but it is unclear to me how to set group priority on specific partition using GROUPCFG. Example

GROUPCFG[group1] PLIST=setA:setB PRIORITY=10000

Would seem to not set elevated priority for just setA.

How do I so this?

Thanks

Vince

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
David Beer | 20 May 23:28 2016
Gravatar

Potential Modification to Epilogue Behavior

All,

The long-standing behavior of epilogue scripts is that if they cannot open the stdout / stderr files for the job, then they do not execute. We've received a request to have the epilogue execute whether or not it can open the output files, and simply log a message if it cannot open them. Obviously, we can add a parameter to make it behave this way if needed.

However, in considering this issue and how I know many sites use epilogue scripts - which sometimes include essential tear-down functionality - it would seem to me that most sites would probably want the epilogue to execute whether or not it can append the output files. 

I would welcome your feedback.

Cheers,

--
David Beer | Torque Architect
Adaptive Computing
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
UFUK YILMAZ | 17 May 09:04 2016
Picon

Job dissapearing on qstat, but still running on backgroud and completing without any problem?

Hello,

 

I am running a job, but it says completed at qstat in a second. Job is still running on backgroud and completing without any problem.

 

I am using Torque 5.1.0 with default Torque scheduler.

 

tracejob prints:

 

/cm/shared/apps/torque/var/spool/mom_logs/20160517: No such file or directory

 

Job: 1920.master.cm.cluster

 

05/17/2016 09:41:54  A    queue=shortq

05/17/2016 09:41:55  S    Job Modified at request of root <at> master.cm.cluster

05/17/2016 09:41:55  L    Job Run

05/17/2016 09:41:55  S    Job Run at request of root <at> master.cm.cluster

05/17/2016 09:41:55  S    child reported success for job after 0 seconds

                          (dest=???), rc=0

05/17/2016 09:41:55  S    preparing to send 'b' mail for job

                          1920.master.cm.cluster to uyilmaz <at> master.cm.cluster

                          (---)

05/17/2016 09:41:55  S    Not sending email: User does not want mail of this

                          type.

05/17/2016 09:41:55  A    user=uyilmaz group=se account=uyilmaz_eu

                          jobname=pstm_script_all_cube queue=shortq

                          ctime=1463467314 qtime=1463467314 etime=1463467314

                          start=1463467315 owner=uyilmaz <at> master.cm.cluster

                          exec_host=cn04/0-19+cn05/0-19

                          Resource_List.neednodes=2:ppn=20

                          Resource_List.nodect=2 Resource_List.nodes=2:ppn=20

05/17/2016 09:42:04  S    obit received - updating final job usage info

05/17/2016 09:42:04  S    job exit status 0 handled

05/17/2016 09:42:04  S    Exit_status=0 resources_used.cput=00:00:02

                          resources_used.energy_used=0 resources_used.mem=0kb

                          resources_used.vmem=0kb

                          resources_used.walltime=00:00:03

05/17/2016 09:42:04  S    preparing to send 'e' mail for job

                          1920.master.cm.cluster to uyilmaz <at> master.cm.cluster

                          (Exit_status=0

05/17/2016 09:42:04  S    Not sending email: User does not want mail of this

                          type.

05/17/2016 09:42:04  S    on_job_exit valid pjob: 1920.master.cm.cluster

                          (substate=50)

05/17/2016 09:42:04  A    user=uyilmaz group=se account=uyilmaz_eu

                          jobname=pstm_script_all_cube queue=shortq

                          ctime=1463467314 qtime=1463467314 etime=1463467314

                          start=1463467315 owner=uyilmaz <at> master.cm.cluster

                          exec_host=cn04/0-19+cn05/0-19

                          Resource_List.neednodes=2:ppn=20

                          Resource_List.nodect=2 Resource_List.nodes=2:ppn=20

                          session=7530 total_execution_slots=40

                          unique_node_count=2 end=1463467324 Exit_status=0

                          resources_used.cput=00:00:02

                          resources_used.energy_used=0 resources_used.mem=0kb

                          resources_used.vmem=0kb

                          resources_used.walltime=00:00:03

 

##############################################################

 

 

 

As you see there is also configuration issue, there is no /cm/shared/apps/torque/var/spool/mom_logs/20160517 file. I collected mom_logs from compute nodes’ local files.

 

 

mom_logs cn04  <at> /cm/local/apps/torque/var/spool/mom_logs/20160517

 

 

05/17/2016 09:41:55;0001;   pbs_mom.6136;Job;job_nodes;job: 1920.master.cm.cluster numnodes=2 numvnod=40

05/17/2016 09:41:55;0002;   pbs_mom.6136;Job;1920.master.cm.cluster;allocate_demux_sockets: stdout: 9:50387  stderr: 10:32788

05/17/2016 09:41:55;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;im_request:rec req 'ALL_OKAY' (0) for job 1920.master.cm.cluster from 10.51.13.105:888 ev 3896 task 0 cookie 7DE104B1ABA6D5A9144A3BAEC51A2EF7

05/17/2016 09:41:55;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;im_request: all sisters have reported in, launching job locally

05/17/2016 09:41:55;0001;   pbs_mom.6136;Job;1920.master.cm.cluster;phase 2 of job launch successfully completed

05/17/2016 09:41:55;0001;   pbs_mom.6136;Job;TMomFinalizeJob3;Job 1920.master.cm.cluster read start return code=0 session=7530

05/17/2016 09:41:55;0001;   pbs_mom.6136;Job;TMomFinalizeJob3;job 1920.master.cm.cluster started, pid = 7530

05/17/2016 09:41:55;0001;   pbs_mom.6136;Job;1920.master.cm.cluster;exec_job_on_ms:job successfully started

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;scan_for_terminated;pid 7530 harvested for job 1920.master.cm.cluster, task 1, exitcode=0

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;kill_task: killing pid 7549 task 1 gracefully with sig 15

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;kill_task: process (pid=7549/state=R) after sig 15

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;kill_task: process (pid=7549/state=R) after sig 15

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;kill_task: process (pid=7549/state=R) after sig 15

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;kill_task: process (pid=7549/state=R) after sig 15

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;kill_task: process (pid=7549/state=R) after sig 15

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;kill_task: process (pid=7549/state=R) after sig 15

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;kill_task: process (pid=7549/state=R) after sig 15

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;kill_task: process (pid=7549/state=R) after sig 15

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;kill_task: process (pid=7549/state=R) after sig 15

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;kill_task: process (pid=7549/state=R) after sig 15

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;kill_task: process (pid=7549/state=R) after sig 15

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;kill_task: process (pid=7549/state=R) after sig 15

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;kill_task: process (pid=7549/state=R) after sig 15

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;kill_task: process (pid=7549/state=R) after sig 15

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;kill_task: process (pid=7549/state=R) after sig 15

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;kill_task: process (pid=7549/state=Z) after sig 15

05/17/2016 09:41:58;0080;   pbs_mom.6136;Job;1920.master.cm.cluster;scan_for_terminated: job 1920.master.cm.cluster task 1 terminated, sid=7530

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;job was terminated

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;check_jobs_main_process: master task has exited - sent kill job request to 1 sisters

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;task is dead

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;is_job_state_exiting:job is in non-exiting substate WAIT_SISTER_KILL_CONFIRM, no obit sent at this time

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;im_request:rec req 'ALL_OKAY' (0) for job 1920.master.cm.cluster from 10.51.13.105:450 ev 3897 task 0 cookie 7DE104B1ABA6D5A9144A3BAEC51A2EF7

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;handle_im_kill_job_response:KILL_JOB acknowledgement received

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;handle_im_kill_job_response: ALL DONE, set EXITING job 1920.master.cm.cluster

05/17/2016 09:41:58;0008;   pbs_mom.6136;Job;kill_job;mother_superior_cleanup: sending signal 9, "KILL" to job 1920.master.cm.cluster, reason: local task termination detected

05/17/2016 09:41:58;0080;   pbs_mom.6136;Job;1920.master.cm.cluster;epilog subtask created with pid 7658 - substate set to JOB_SUBSTATE_OBIT - registered post_epilogue

05/17/2016 09:42:04;0080;   pbs_mom.6136;Req;post_epilogue;preparing obit message for job 1920.master.cm.cluster

05/17/2016 09:42:04;0080;   pbs_mom.6136;Job;1920.master.cm.cluster;obit sent to server

05/17/2016 09:42:04;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;attempting to copy file 'head01.cm.cluster:/home/uyilmaz/test/pstm_script_all_cube.o1920'

05/17/2016 09:42:04;0008;   pbs_mom.6136;Job;1920.master.cm.cluster;forking to user, uid: 551  gid: 102  homedir: '/home/uyilmaz'

05/17/2016 09:42:04;0008;   pbs_mom.6136;Job;req_deletejob;1920.master.cm.cluster

05/17/2016 09:42:04;0080;   pbs_mom.6136;Job;mom_deljob;deleting job 1920.master.cm.cluster in state EXITED

05/17/2016 09:42:04;0080;   pbs_mom.6136;Job;1920.master.cm.cluster;removed job script

 

########################################

 

cn05 <at> /cm/local/apps/torque/var/spool/mom_logs/20160517

 

05/17/2016 09:41:55;0008;   pbs_mom.6120;Job;1920.master.cm.cluster;im_request:rec req 'JOIN_JOB' (1) for job 1920.master.cm.cluster from 10.51.13.104:447 ev 3896 task 0 cookie 7DE104B1ABA6D5A9144A3BAEC51A2EF7

05/17/2016 09:41:55;0008;   pbs_mom.6120;Job;1920.master.cm.cluster;im_join_job_as_sister: JOIN_JOB 1920.master.cm.cluster node 1

05/17/2016 09:41:55;0001;   pbs_mom.6120;Job;job_nodes;job: 1920.master.cm.cluster numnodes=2 numvnod=40

05/17/2016 09:41:55;0008;   pbs_mom.6120;Job;1920.master.cm.cluster;JOIN JOB as node 1

05/17/2016 09:41:58;0008;   pbs_mom.6120;Job;1920.master.cm.cluster;im_request:rec req 'KILL_JOB' (2) for job 1920.master.cm.cluster from 10.51.13.104:601 ev 3897 task 0 cookie 7DE104B1ABA6D5A9144A3BAEC51A2EF7

05/17/2016 09:41:58;0008;   pbs_mom.6120;Job;kill_job;im_kill_job_as_sister: sending signal 9, "KILL" to job 1920.master.cm.cluster, reason: kill_job message received

 

 

Regards,

 

Ufuk YILMAZ

 

 

 

Kisiye özel bu mesaj ve içerigindeki bilgiler gizlidir. Mesaj içeriginde bulunan bilgi, fikir ve yorumlar, sadece göndericiye aittir. Turkiye Petrolleri A.O. (TP) bu mesajin içerigi ve ekleri ile ilgili olarak hukuksal hiçbir sorumluluk kabul etmez. Yetkili alicilardan biri degilseniz, bu mesajin herhangi bir sekilde ifsa edilmesi, kullanilmasi, kopyalanmasi, yayilmasi veya mesajda yeralan hususlarla ilgili olarak herhangi bir islem yapilmasinin kesinlikle yasak oldugunu bildiririz.Böyle bir durumda lütfen hemen mesajin göndericisini bilgilendiriniz ve mesaji sisteminizden siliniz.Internet ortaminda gönderilen e-posta mesajlarindaki hata ve/veya eksikliklerden veya virüslerden dolayi mesajin göndericisi herhangi bir sorumluluk kabul etmemektedir.
Tesekkür ederiz.

*** Bu mail zararli icerige karsi, TP Antivirus Sistemleri tarafindan taranmistir. ***

The information contained in this communication may contain confidential or legally privileged information. Responsibility about sent contents belongs to sender. Turkish Petroleum (TP) doesn't accept any legal responsibility for the contents and attachments of this message. If you are not the intended recipient you are hereby notified that any disclosure, use, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited. If you have received this communication in error, please notify the sender immediately by responding to this e-mail and then delete it from your system. The sender does not accept any liability for any errors or omissions or any viruses in the context of this message which arise as a result of internet transmission.
Thank you.

*** This mail was scanned for known viruses by TP Antivirus systems.***

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
A. Upadhyay | 10 May 13:18 2016

pbsnodes: Server has no node list MSG=node list is empty ERROR

Dear torqueusers,

I am a novice to torque and HPC cluster. I have following configuration on a workstation 2 CPUs
(octa-core) + 2 Nvidia GPUs (Quadro 4200). I wish to use the CPU (all 16 cores) + GPU
combination for testing a parallel code.

I am configuring torque on openSuSe Leap 42.1. I have configures my machine with a static ip
'192.168.1.1' and hostname 'plasma01'. My torque (torque-6.0.1-1456945733_daea91b) has been
compiled with following:

./configure --without-tcl --with-nvidia-gpus --enable-nvidia-gpus --prefix=/opt/torque-6.0.1
--with-nvml-include=/usr/include/nvidia/gdk --with-nvml-lib=/usr/local/cuda/lib64 --enable-
docs --enable-mom --enable-server --enable-clients --disable-gui --with-default-server=plasma01
--with-scp --enable-cpuset

plasma01:# make
plasma01:# make packages
plasma01:# ./torque-package-server-linux-x86_64.sh --install
plasma01:# ./torque-package-clients-linux-x86_64.sh --install
plasma01:# ./torque-package-mom-linux-x86_64.sh --install
plasma01:# ./torque-package-devel-linux-x86_64.sh --install

and setup torque using following commands:

plasma01:# cd /opt/torque/sbin
plasma01:# ./pbs_sever
plasma01:# ./pbs_sched
plasma01:# ./pbs_mom

plasma01:# ./torque.setup ajitup
initializing TORQUE (admin: ajitup <at> plasma01)

You have selected to start pbs_server in create mode.
If the server database exists it will be overwritten.
do you wish to continue y/(n)?y
root 8401 1 1 16:22 ? 00:00:00 pbs_server -t create
Max open servers: 9
set server operators += ajitup <at> plasma01
Max open servers: 9
set server managers += ajitup <at> plasma01
plasma01:# # qmgr -c 'p s'
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Create and define queue long
#
create queue long
set queue long queue_type = Execution
set queue long resources_default.nodes = 1
set queue long enabled = True
set queue long started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = plasma01
set server managers = ajitup <at> plasma01
set server operators = ajitup <at> plasma01
set server default_queue = batch
set server log_events = 2047
set server mail_from = adm
set server node_check_rate = 150
set server tcp_timeout = 300
set server job_stat_rate = 300
set server poll_jobs = True
set server down_on_error = True
set server mom_job_sync = True
set server keep_completed = 300
set server moab_array_compatible = True
set server nppcu = 1
set server timeout_for_job_delete = 120
set server timeout_for_job_requeue = 120

plasma01:# pbsnodes
pbsnodes: Server has no node list MSG=node list is empty - check 'server_priv/nodes' file

While /var/spool/torque/server_name has 'plasma01' and /var/spool/torque/server_priv/nodes
has 'plasma01 np=16' and /etc/hosts has '192.168.1.1 plasma01'.

I am not able to resolve this error in pbs_connect which is not able to connect to the host
(localhost or plasma01 defined as above).

Kindly help me.

Thanks,
Ajit

Get your own FREE website, FREE domain & FREE mobile app with Company email.  
Know More >
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Gabriel A. Devenyi | 9 May 21:38 2016
Picon
Gravatar

Announce - qbatch 1.0 - Execute shell command lines in parallel (serial farm) on SGE/PBS clusters

We (Gabriel A. Devenyi and Jon Pipitone) would like to announce the 1.0 release of qbatch, a command-line tool for easily running a list of commands in parallel (serial farming) on a compute cluster. This tool takes the list of commands, divides them up into batches of arbitrary size, and then submits each batch as a separate job or as part of an array job. qbatch also gives you a consistent interface to use to submit commands on PBS and SGE clusters or locally, (support for others are planned/in testing, PRs welcome) while setting requirements for processors, walltime, memory and job dependencies. This tool can be used as a quick interface to spread work out on a cluster, or as the glue for for connecting a simple pipeline to a cluster (see https://github.com/CobraLab/antsRegistration-MAGeT for a sample implementation)

The target audience of qbatch is two-fold: it is immediately available for users of PBS or SGE clusters to simplify their job construction, in addition, through the use of environment variables, cluster administrations can craft a default qbatch deployment which allows new cluster users to quickly submit jobs which honours the cluster’s policies.

For more information, check out our github webpage here: http://github.com/pipitone/qbatch
qbatch is also available in pypi via pip install qbatch


Gabriel A. Devenyi B.Eng. Ph.D.
Research Computing Associate
Computational Brain Anatomy Laboratory
Cerebral Imaging Center
Douglas Mental Health University Institute
Affiliate, Department of Psychiatry
McGill University
t: 514.761.6131x4781
e: gdevenyi <at> gmail.com

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Paul Harder II | 6 May 19:29 2016

Bug?

I'm seeing what appears to be a Torque (v6.0.0) bug in the Amazon Elastic Cloud environment.  My program submits a bunch of individual jobs and job arrays. One of the job arrays has a job that starts to run but gets terminated at a random point when the Amazon cloud server it is on gets reclaimed by the cloud. I know it was running, because it managed to write about 20 lines of text to its log file -- yet the Torque stdout and stderr files do not exist. The "qstat -a" command says that the job array is in R status (running). The "qstat -at | grep 3444" command lists all of the individual jobs in job array 3444, but the specific job that recorded that log file does not exist.

My suspicion is that "qstat -a" is not looking at a table of jobs to see how many of them are reporting completion, but instead simply looking at a count of the number of jobs that have reported completion. When the job array was submitted, PBS noted that the job array had 81 jobs. The qstat command sees that only 80 jobs are reporting completion, so it thinks that the array is still running. But "qstat -at" does not even know that one of the jobs ever existed. When the node that it was running on was downed, PBS simply forgot the job.

Summarizing: The very existence of a job within a job array is forgotten, yet its failure to complete prevents the job array from completing.

 

Paul H. Harder II, Ph.D., AMS / Senior Software Engineer, Meteorologist 
Information Management Services / Flight Services Engineering
2925 Briarpark Dr., 7th Floor, Houston, TX 77042
Phone: 713-430-7081 / Mobile: 281-793-7743 / Fax: 713-430-7074 

Paul.Harder <at> rockwellcollins.com 

 

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Sridhar Acharya Malkaram | 3 May 16:33 2016

Help with nodes staying 'down' (torque)

(I am resending this message with correct subject line. Administrators  please remove my earlier message with wrong subject).

The following are logs on server and node01.

ON SERVER:

> tail /var/lib/torque/server_logs/20160503
05/03/2016 09:19:04;0008;PBS_Server.2942;Job;0.XXXXX.cluster;Job Modified at request of root <at> XXXXX.cluster
05/03/2016 09:19:04;0040;PBS_Server.2942;Req;node_spec;job allocation request exceeds currently available cluster nodes, 1 requested, 0 available
05/03/2016 09:19:04;0008;PBS_Server.2942;Job;0.XXXXX.cluster;could not locate requested resources '1:ppn=1' (node_spec failed) job allocation request exceeds currently available cluster nodes, 1 requested, 0 available
05/03/2016 09:19:04;0080;PBS_Server.2942;Req;req_reject;Reject reply code=15046(Resource temporarily unavailable MSG=job allocation request exceeds currently available cluster nodes, 1 requested, 0 available), aux=0, type=RunJob, from root <at> XXXXX.cluster
05/03/2016 09:19:04;0008;PBS_Server.2942;Job;0.XXXXX.cluster;Job Modified at request of root <at> XXXXX.cluster
05/03/2016 09:19:48;0002;PBS_Server.2958;Svr;PBS_Server;Torque Server Version = 4.2.10, loglevel = 0
05/03/2016 09:24:56;0002;PBS_Server.2940;Svr;PBS_Server;Torque Server Version = 4.2.10, loglevel = 0

> tail -f sched_logs/20160503
05/03/2016 07:48:19;0080; pbs_sched.2825;Svr;main;brk point 98287616
05/03/2016 07:58:24;0080; pbs_sched.2825;Svr;main;brk point 98811904
05/03/2016 08:08:29;0080; pbs_sched.2825;Svr;main;brk point 99336192
05/03/2016 08:18:34;0080; pbs_sched.2825;Svr;main;brk point 99860480
05/03/2016 08:28:39;0080; pbs_sched.2825;Svr;main;brk point 100384768
05/03/2016 08:38:44;0080; pbs_sched.2825;Svr;main;brk point 100909056
05/03/2016 08:48:49;0080; pbs_sched.2825;Svr;main;brk point 101433344
05/03/2016 08:58:54;0080; pbs_sched.2825;Svr;main;brk point 102486016
05/03/2016 09:19:04;0080; pbs_sched.2825;Svr;main;brk point 103010304
05/03/2016 09:29:09;0080; pbs_sched.2825;Svr;main;brk point 103534592



ON node01:
> tail /var/lib/torque/mom_logs/20160503
05/03/2016 09:26:57;0002;   pbs_mom.15663;n/a;mom_server_all_update_stat;composing status update for server
05/03/2016 09:26:57;0008;   pbs_mom.15663;Job;scan_for_terminated;entered
05/03/2016 09:26:57;0080;   pbs_mom.15663;Svr;mom_get_sample;proc_array load started
05/03/2016 09:26:57;0002;   pbs_mom.15663;Svr;get_cpuset_pidlist;/dev/cpuset/torque contains 0 PIDs
05/03/2016 09:26:57;0080;   pbs_mom.15663;n/a;mom_get_sample;proc_array loaded - nproc=0
05/03/2016 09:26:57;0008;   pbs_mom.15663;Job;scan_for_terminated;pid 15682 not tracked, statloc=0, exitval=0
05/03/2016 09:27:42;0002;   pbs_mom.15663;n/a;mom_server_all_update_stat;composing status update for server
05/03/2016 09:27:42;0008;   pbs_mom.15663;Job;scan_for_terminated;entered
05/03/2016 09:27:42;0080;   pbs_mom.15663;Svr;mom_get_sample;proc_array load started
05/03/2016 09:27:42;0002;   pbs_mom.15663;Svr;get_cpuset_pidlist;/dev/cpuset/torque contains 0 PIDs
05/03/2016 09:27:42;0080;   pbs_mom.15663;n/a;mom_get_sample;proc_array loaded - nproc=0
05/03/2016 09:27:42;0008;   pbs_mom.15663;Job;scan_for_terminated;pid 15684 not tracked, statloc=0, exitval=0



On 05/03/2016 09:16 AM, torqueusers-request <at> supercluster.org wrote:
Send torqueusers mailing list submissions to torqueusers <at> supercluster.org To subscribe or unsubscribe via the World Wide Web, visit http://www.supercluster.org/mailman/listinfo/torqueusers or, via email, send a message with subject or body 'help' to torqueusers-request <at> supercluster.org You can reach the person managing the list at torqueusers-owner <at> supercluster.org When replying, please edit your Subject line so it is more specific than "Re: Contents of torqueusers digest..." Today's Topics: 1. Re: Help with nodes staying 'down' (torque) (Bidwell, Matt) ---------------------------------------------------------------------- Message: 1 Date: Tue, 3 May 2016 13:16:39 +0000 From: "Bidwell, Matt" <Matt.Bidwell <at> nrel.gov> Subject: Re: [torqueusers] Help with nodes staying 'down' (torque) To: Torque Users Mailing List <torqueusers <at> supercluster.org> Message-ID: <646e99dfc9e74c2fb61e69d5c9b01346 <at> xp11mbx3.nrel.gov> Content-Type: text/plain; charset="utf-8" What do the logs in server_logs & mom_logs say? -Matt From: torqueusers-bounces <at> supercluster.org [mailto:torqueusers-bounces <at> supercluster.org] On Behalf Of Sridhar Acharya Malkaram Sent: Monday, May 02, 2016 12:40 PM To: torqueusers <at> supercluster.org Subject: [torqueusers] Help with nodes staying 'down' (torque) Hi, I?d be grateful if you can provide your suggestions for the following problem. I recently upgraded packages and the torque packages were updated to the latest rpm versions. However, I am unable to get the nodes to active state.
qnodes
node01.cluster state = down np = 12 properties = allcomp,gpu,compute ntype = cluster mom_service_port = 15002 mom_manager_port = 15003 gpus = 1 node02.cluster state = down np = 12 properties = allcomp,gpu,compute ntype = cluster mom_service_port = 15002 mom_manager_port = 15003 gpus = 1 ?. ? ?
momctl -d 3 -h node01
Host: node01.cluster/node01.cluster Version: 4.2.10 PID: 12009 Server[0]: XXXXXX.cluster (10.1.1.254:15001) WARNING: no messages received from server WARNING: no messages sent to server HomeDirectory: /var/lib/torque/mom_priv stdout/stderr spool directory: '/var/lib/torque/spool/' (108669845blocks available) NOTE: syslog enabled MOM active: 1755 seconds Check Poll Time: 45 seconds Server Update Interval: 45 seconds LogLevel: 7 (use SIGUSR1/SIGUSR2 to adjust) Communication Model: TCP MemLocked: TRUE (mlock) TCP Timeout: 60 seconds Prolog: /var/lib/torque/mom_priv/prologue (disabled) Alarm Time: 0 of 10 seconds Trusted Client List: 10.1.1.1:0,10.1.1.254:0,127.0.0.1:0: 0 Copy Command: /usr/bin/scp -rpB NOTE: no local jobs detected diagnostics complete The nodes and server were active and functioning properly previously. I checked all the config files and they seem to be correct. I can ssh to node01 or to other nodes and back to server, without password. The hostname of the server is same in server_name on both server and client config files, as well as in /etc/hosts entries. The munge and trqauthd daemons are running. As can be seen from the momctl command issued on the server, it provides output, but the WARNINGs indicate that server and client are not communicating. I can?t get a clue in the server or mom logs. Could you provide some directions to resolve this? I am providing output of other key commands.
uname -a
Linux stinger.cluster 2.6.32-573.7.1.el6.centos.plus.x86_64 #1 SMP Wed Sep 23 03:02:55 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
rpm -aq | grep torque
torque-drmaa-devel-4.2.10-9.el6.x86_64 torque-server-4.2.10-9.el6.x86_64 torque-pam-4.2.10-9.el6.x86_64 torque-libs-4.2.10-9.el6.x86_64 torque-scheduler-4.2.10-9.el6.x86_64 torque-drmaa-4.2.10-9.el6.x86_64 torque-debuginfo-5.0.1-1.adaptive.el6.x86_64 torque-mom-4.2.10-9.el6.x86_64 torque-client-4.2.10-9.el6.x86_64 torque-docs-4.2.10-9.el6.noarch torque-4.2.10-9.el6.x86_64 torque-devel-4.2.10-9.el6.x86_64 torque-gui-4.2.10-9.el6.x86_64
qstat -q
server: XXXXXX.cluster Queue Memory CPU Time Walltime Node Run Que Lm State ---------------- ------ -------- -------- ---- --- --- -- ----- batch -- -- -- -- 0 1 6 E R ----- -----
qmgr -c 'p s'
# # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch max_running = 6 set queue batch resources_max.ncpus = 8 set queue batch resources_max.nodes = 1 set queue batch resources_default.ncpus = 1 set queue batch resources_default.neednodes = 1:ppn=1 set queue batch resources_default.walltime = 24:00:00 set queue batch max_user_run = 6 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = XXXXXX.cluster set server acl_hosts += node01 set server default_queue = batch set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 300 set server job_stat_rate = 45 set server poll_jobs = True set server mom_job_sync = True set server next_job_number = 1 set server authorized_users = root <at> node01.cluster<mailto:root <at> node01.cluster> set server moab_array_compatible = True set server nppcu = 1 0 1
cat /var/lib/torque/server_priv
node01.cluster np=12 gpus=1 allcomp gpu compute node02.cluster np=12 gpus=1 allcomp gpu compute node03.cluster np=12 gpus=1 allcomp gpu compute node04.cluster np=12 gpus=1 allcomp gpu compute node05.cluster np=12 gpus=1 allcomp gpu compute node06.cluster np=12 gpus=1 allcomp gpu compute node07.cluster np=12 gpus=1 allcomp gpu compute node08.cluster np=12 gpus=1 allcomp gpu compute node09.cluster np=12 gpus=1 allcomp gpu compute Sridhar Acharya Malkaram Assistant Research Professor, Bioinformatics 208 B, Hamblin Hall Dept. of Biology West Virginia State University Institute, WV-25112 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20160503/ea693461/attachment.html ------------------------------ _______________________________________________ torqueusers mailing list torqueusers <at> supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers End of torqueusers Digest, Vol 142, Issue 2 *******************************************

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Sridhar Acharya Malkaram | 3 May 15:37 2016

Re: torqueusers Digest, Vol 142, Issue 2

The following are logs on server and node01.

ON SERVER:

> tail /var/lib/torque/server_logs/20160503
05/03/2016 09:19:04;0008;PBS_Server.2942;Job;0.XXXXX.cluster;Job Modified at request of root <at> XXXXX.cluster
05/03/2016 09:19:04;0040;PBS_Server.2942;Req;node_spec;job allocation request exceeds currently available cluster nodes, 1 requested, 0 available
05/03/2016 09:19:04;0008;PBS_Server.2942;Job;0.XXXXX.cluster;could not locate requested resources '1:ppn=1' (node_spec failed) job allocation request exceeds currently available cluster nodes, 1 requested, 0 available
05/03/2016 09:19:04;0080;PBS_Server.2942;Req;req_reject;Reject reply code=15046(Resource temporarily unavailable MSG=job allocation request exceeds currently available cluster nodes, 1 requested, 0 available), aux=0, type=RunJob, from root <at> XXXXX.cluster
05/03/2016 09:19:04;0008;PBS_Server.2942;Job;0.XXXXX.cluster;Job Modified at request of root <at> XXXXX.cluster
05/03/2016 09:19:48;0002;PBS_Server.2958;Svr;PBS_Server;Torque Server Version = 4.2.10, loglevel = 0
05/03/2016 09:24:56;0002;PBS_Server.2940;Svr;PBS_Server;Torque Server Version = 4.2.10, loglevel = 0

> tail -f sched_logs/20160503
05/03/2016 07:48:19;0080; pbs_sched.2825;Svr;main;brk point 98287616
05/03/2016 07:58:24;0080; pbs_sched.2825;Svr;main;brk point 98811904
05/03/2016 08:08:29;0080; pbs_sched.2825;Svr;main;brk point 99336192
05/03/2016 08:18:34;0080; pbs_sched.2825;Svr;main;brk point 99860480
05/03/2016 08:28:39;0080; pbs_sched.2825;Svr;main;brk point 100384768
05/03/2016 08:38:44;0080; pbs_sched.2825;Svr;main;brk point 100909056
05/03/2016 08:48:49;0080; pbs_sched.2825;Svr;main;brk point 101433344
05/03/2016 08:58:54;0080; pbs_sched.2825;Svr;main;brk point 102486016
05/03/2016 09:19:04;0080; pbs_sched.2825;Svr;main;brk point 103010304
05/03/2016 09:29:09;0080; pbs_sched.2825;Svr;main;brk point 103534592



ON node01:
> tail /var/lib/torque/mom_logs/20160503
05/03/2016 09:26:57;0002;   pbs_mom.15663;n/a;mom_server_all_update_stat;composing status update for server
05/03/2016 09:26:57;0008;   pbs_mom.15663;Job;scan_for_terminated;entered
05/03/2016 09:26:57;0080;   pbs_mom.15663;Svr;mom_get_sample;proc_array load started
05/03/2016 09:26:57;0002;   pbs_mom.15663;Svr;get_cpuset_pidlist;/dev/cpuset/torque contains 0 PIDs
05/03/2016 09:26:57;0080;   pbs_mom.15663;n/a;mom_get_sample;proc_array loaded - nproc=0
05/03/2016 09:26:57;0008;   pbs_mom.15663;Job;scan_for_terminated;pid 15682 not tracked, statloc=0, exitval=0
05/03/2016 09:27:42;0002;   pbs_mom.15663;n/a;mom_server_all_update_stat;composing status update for server
05/03/2016 09:27:42;0008;   pbs_mom.15663;Job;scan_for_terminated;entered
05/03/2016 09:27:42;0080;   pbs_mom.15663;Svr;mom_get_sample;proc_array load started
05/03/2016 09:27:42;0002;   pbs_mom.15663;Svr;get_cpuset_pidlist;/dev/cpuset/torque contains 0 PIDs
05/03/2016 09:27:42;0080;   pbs_mom.15663;n/a;mom_get_sample;proc_array loaded - nproc=0
05/03/2016 09:27:42;0008;   pbs_mom.15663;Job;scan_for_terminated;pid 15684 not tracked, statloc=0, exitval=0



On 05/03/2016 09:16 AM, torqueusers-request <at> supercluster.org wrote:
Send torqueusers mailing list submissions to torqueusers <at> supercluster.org To subscribe or unsubscribe via the World Wide Web, visit http://www.supercluster.org/mailman/listinfo/torqueusers or, via email, send a message with subject or body 'help' to torqueusers-request <at> supercluster.org You can reach the person managing the list at torqueusers-owner <at> supercluster.org When replying, please edit your Subject line so it is more specific than "Re: Contents of torqueusers digest..." Today's Topics: 1. Re: Help with nodes staying 'down' (torque) (Bidwell, Matt) ---------------------------------------------------------------------- Message: 1 Date: Tue, 3 May 2016 13:16:39 +0000 From: "Bidwell, Matt" <Matt.Bidwell <at> nrel.gov> Subject: Re: [torqueusers] Help with nodes staying 'down' (torque) To: Torque Users Mailing List <torqueusers <at> supercluster.org> Message-ID: <646e99dfc9e74c2fb61e69d5c9b01346 <at> xp11mbx3.nrel.gov> Content-Type: text/plain; charset="utf-8" What do the logs in server_logs & mom_logs say? -Matt From: torqueusers-bounces <at> supercluster.org [mailto:torqueusers-bounces <at> supercluster.org] On Behalf Of Sridhar Acharya Malkaram Sent: Monday, May 02, 2016 12:40 PM To: torqueusers <at> supercluster.org Subject: [torqueusers] Help with nodes staying 'down' (torque) Hi, I?d be grateful if you can provide your suggestions for the following problem. I recently upgraded packages and the torque packages were updated to the latest rpm versions. However, I am unable to get the nodes to active state.
qnodes
node01.cluster state = down np = 12 properties = allcomp,gpu,compute ntype = cluster mom_service_port = 15002 mom_manager_port = 15003 gpus = 1 node02.cluster state = down np = 12 properties = allcomp,gpu,compute ntype = cluster mom_service_port = 15002 mom_manager_port = 15003 gpus = 1 ?. ? ?
momctl -d 3 -h node01
Host: node01.cluster/node01.cluster Version: 4.2.10 PID: 12009 Server[0]: XXXXXX.cluster (10.1.1.254:15001) WARNING: no messages received from server WARNING: no messages sent to server HomeDirectory: /var/lib/torque/mom_priv stdout/stderr spool directory: '/var/lib/torque/spool/' (108669845blocks available) NOTE: syslog enabled MOM active: 1755 seconds Check Poll Time: 45 seconds Server Update Interval: 45 seconds LogLevel: 7 (use SIGUSR1/SIGUSR2 to adjust) Communication Model: TCP MemLocked: TRUE (mlock) TCP Timeout: 60 seconds Prolog: /var/lib/torque/mom_priv/prologue (disabled) Alarm Time: 0 of 10 seconds Trusted Client List: 10.1.1.1:0,10.1.1.254:0,127.0.0.1:0: 0 Copy Command: /usr/bin/scp -rpB NOTE: no local jobs detected diagnostics complete The nodes and server were active and functioning properly previously. I checked all the config files and they seem to be correct. I can ssh to node01 or to other nodes and back to server, without password. The hostname of the server is same in server_name on both server and client config files, as well as in /etc/hosts entries. The munge and trqauthd daemons are running. As can be seen from the momctl command issued on the server, it provides output, but the WARNINGs indicate that server and client are not communicating. I can?t get a clue in the server or mom logs. Could you provide some directions to resolve this? I am providing output of other key commands.
uname -a
Linux stinger.cluster 2.6.32-573.7.1.el6.centos.plus.x86_64 #1 SMP Wed Sep 23 03:02:55 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
rpm -aq | grep torque
torque-drmaa-devel-4.2.10-9.el6.x86_64 torque-server-4.2.10-9.el6.x86_64 torque-pam-4.2.10-9.el6.x86_64 torque-libs-4.2.10-9.el6.x86_64 torque-scheduler-4.2.10-9.el6.x86_64 torque-drmaa-4.2.10-9.el6.x86_64 torque-debuginfo-5.0.1-1.adaptive.el6.x86_64 torque-mom-4.2.10-9.el6.x86_64 torque-client-4.2.10-9.el6.x86_64 torque-docs-4.2.10-9.el6.noarch torque-4.2.10-9.el6.x86_64 torque-devel-4.2.10-9.el6.x86_64 torque-gui-4.2.10-9.el6.x86_64
qstat -q
server: XXXXXX.cluster Queue Memory CPU Time Walltime Node Run Que Lm State ---------------- ------ -------- -------- ---- --- --- -- ----- batch -- -- -- -- 0 1 6 E R ----- -----
qmgr -c 'p s'
# # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch max_running = 6 set queue batch resources_max.ncpus = 8 set queue batch resources_max.nodes = 1 set queue batch resources_default.ncpus = 1 set queue batch resources_default.neednodes = 1:ppn=1 set queue batch resources_default.walltime = 24:00:00 set queue batch max_user_run = 6 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = XXXXXX.cluster set server acl_hosts += node01 set server default_queue = batch set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 300 set server job_stat_rate = 45 set server poll_jobs = True set server mom_job_sync = True set server next_job_number = 1 set server authorized_users = root <at> node01.cluster<mailto:root <at> node01.cluster> set server moab_array_compatible = True set server nppcu = 1 0 1
cat /var/lib/torque/server_priv
node01.cluster np=12 gpus=1 allcomp gpu compute node02.cluster np=12 gpus=1 allcomp gpu compute node03.cluster np=12 gpus=1 allcomp gpu compute node04.cluster np=12 gpus=1 allcomp gpu compute node05.cluster np=12 gpus=1 allcomp gpu compute node06.cluster np=12 gpus=1 allcomp gpu compute node07.cluster np=12 gpus=1 allcomp gpu compute node08.cluster np=12 gpus=1 allcomp gpu compute node09.cluster np=12 gpus=1 allcomp gpu compute Sridhar Acharya Malkaram Assistant Research Professor, Bioinformatics 208 B, Hamblin Hall Dept. of Biology West Virginia State University Institute, WV-25112 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20160503/ea693461/attachment.html ------------------------------ _______________________________________________ torqueusers mailing list torqueusers <at> supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers End of torqueusers Digest, Vol 142, Issue 2 *******************************************

-- Sridhar A Malkaram, Ph.D. Assistant Research Professor, Bioinformatics 208 B, Hamblin Hall Dept. of Biology West Virginia State University Institute, WV-25112
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Sridhar Acharya Malkaram | 2 May 20:40 2016

Help with nodes staying 'down' (torque)

Hi,

I’d be grateful if you can provide your suggestions for the following problem. I recently upgraded packages and the torque packages were updated to the latest rpm versions. However, I am unable to get the nodes to active state.


> qnodes
node01.cluster
     state = down
     np = 12
     properties = allcomp,gpu,compute
     ntype = cluster
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 1

node02.cluster
     state = down
     np = 12
     properties = allcomp,gpu,compute
     ntype = cluster
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 1
….



> momctl -d 3 -h node01
Host: node01.cluster/node01.cluster   Version: 4.2.10   PID: 12009
Server[0]: XXXXXX.cluster (10.1.1.254:15001)
  WARNING:  no messages received from server
  WARNING:  no messages sent to server
HomeDirectory:          /var/lib/torque/mom_priv
stdout/stderr spool directory: '/var/lib/torque/spool/' (108669845blocks available)
NOTE:  syslog enabled
MOM active:             1755 seconds
Check Poll Time:        45 seconds
Server Update Interval: 45 seconds
LogLevel:               7 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    TCP
MemLocked:              TRUE  (mlock)
TCP Timeout:            60 seconds
Prolog:                 /var/lib/torque/mom_priv/prologue (disabled)
Alarm Time:             0 of 10 seconds
Trusted Client List:  10.1.1.1:0,10.1.1.254:0,127.0.0.1:0:  0
Copy Command:           /usr/bin/scp -rpB
NOTE:  no local jobs detected

diagnostics complete



The nodes and server were active and functioning properly previously. 
I checked all the config files and they seem to be correct.
I can ssh to node01 or to other nodes and back to server, without password. 
The hostname of the server is same in server_name on both server and client config files, as well as in /etc/hosts entries.
The munge and trqauthd daemons are running. As can be seen from the momctl command issued on the server, it provides output, but the WARNINGs indicate that server and client are not communicating. I can’t get a clue in the server or mom logs. Could you provide some directions to resolve this?

I am providing output of other key commands.


> uname -a
Linux stinger.cluster 2.6.32-573.7.1.el6.centos.plus.x86_64 #1 SMP Wed Sep 23 03:02:55 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux


> rpm -aq | grep torque
torque-drmaa-devel-4.2.10-9.el6.x86_64
torque-server-4.2.10-9.el6.x86_64
torque-pam-4.2.10-9.el6.x86_64
torque-libs-4.2.10-9.el6.x86_64
torque-scheduler-4.2.10-9.el6.x86_64
torque-drmaa-4.2.10-9.el6.x86_64
torque-debuginfo-5.0.1-1.adaptive.el6.x86_64
torque-mom-4.2.10-9.el6.x86_64
torque-client-4.2.10-9.el6.x86_64
torque-docs-4.2.10-9.el6.noarch
torque-4.2.10-9.el6.x86_64
torque-devel-4.2.10-9.el6.x86_64
torque-gui-4.2.10-9.el6.x86_64

> qstat -q

server: XXXXXX.cluster

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
batch              --      --       --      --    0   1  6   E R
                                               ----- -----
  
> qmgr -c 'p s'
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch max_running = 6
set queue batch resources_max.ncpus = 8
set queue batch resources_max.nodes = 1
set queue batch resources_default.ncpus = 1
set queue batch resources_default.neednodes = 1:ppn=1
set queue batch resources_default.walltime = 24:00:00
set queue batch max_user_run = 6
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = XXXXXX.cluster
set server acl_hosts += node01
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 300
set server job_stat_rate = 45
set server poll_jobs = True
set server mom_job_sync = True
set server next_job_number = 1
set server authorized_users = root <at> node01.cluster
set server moab_array_compatible = True
set server nppcu = 1
                                                 0     1

> cat /var/lib/torque/server_priv
node01.cluster np=12 gpus=1  allcomp gpu compute
node02.cluster np=12 gpus=1  allcomp gpu compute
node03.cluster np=12 gpus=1  allcomp gpu compute
node04.cluster np=12 gpus=1  allcomp gpu compute
node05.cluster np=12 gpus=1  allcomp gpu compute
node06.cluster np=12 gpus=1  allcomp gpu compute
node07.cluster np=12 gpus=1  allcomp gpu compute
node08.cluster np=12 gpus=1  allcomp gpu compute
node09.cluster np=12 gpus=1  allcomp gpu compute









Sridhar Acharya Malkaram
Assistant Research Professor, Bioinformatics
208 B, Hamblin Hall
Dept. of Biology
West Virginia State University
Institute, WV-25112

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Nicholas Lindberg | 29 Apr 22:13 2016

adding extra information to Torque termination e-mails

Hello,

I’ve researched this a bit and wanted to ask quickly to the list before pursuing another route— but using the “mail_body_fmt” parameter, is it possible to add things to the termination e-mail (upon job completion) such as wall time, number of processors used, and/or custom values?  Or is this best suited for an epilogue script solution?  

A few of our users have requested that the e-mail they get upon job completion contain some of the accounting information, and while I know I could do this using an epilogue script, I was wondering if there was any way to add it to the normal termination e-mail so that users weren’t getting two e-mails per job, etc.

Thanks.

--

Nick Lindberg

Director of Engineering

Milwaukee Institute 

414-269-8332 (O)

608-215-3508 (M)


_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

Gmane