Martin Siegert | 21 May 22:14 2015
Picon
Picon

file resource

Hi,

the documentation at
http://docs.adaptivecomputing.com/torque/5-1-0/help.htm#topics/torque/2-jobs/requestingRes.htm
says:
file - The amount of total disk requested for the job.
I am wondering whether that is actually correct. As far as I can tell file
actually requests an amount of disk space per requested processor (core).
E.g., our nodes have about 540GB of disk space available. I job that requests
-l procs=4,file=300gb
gets scheduled on 4 nodes (one core per node). A job that requests
-l nodes=2:ppn=2,file=300gb
does not run at all: checkjob shows that all nodes are "rejected: Disk".
However, a job with
-l nodes=2:ppn=2,file=250gb
runs.

These experiments indicate that the documentation is incorrect.
(this is with torque-5.1.0 and moab-8.1.0)

Cheers,
Martin

--

-- 
Martin Siegert
WestGrid/ComputeCanada
Simon Fraser University
Burnaby, British Columbia
Brown, David M JR | 19 May 22:31 2015

Hey I'm the Fedora/EPEL guy.

Hi,

I'm the new Fedora/EPEL maintainer for torque. ... (waiting to be drilled on changing things...)

I've been pushing some changes to the torque package recently to clear up the backlog of bugs and CVEs that
were present in various versions of EPEL and Fedora. I'm sorry if some of these changes caused issues, I
know how it is to operate a large compute cluster and how dealing with updates can cause issues.

I've currently got updates pushed to all Fedora and EPEL versions currently supported and have done some
testing with these packages before pushing. They should show up in the testing repositories soon. I've
been able to at least build an All-In-One cluster and submit jobs on all versions of the software pushed.

If you are familiar with using Chef then please checkout the cookbook for torque http://github.com/dmlb2000/torque-cookbook.git.

My plan moving forward is to keep all the versions of torque as close as possible to what's supported by the
primary development team. I'll not going to patch CVEs on old versions of the code to resolve security
issues, I'll just update to the version that fixes the issue. I'll also try to keep them as close to the same
version as possible as well. This means that running EL5/6/7 compute nodes should be fine and it provides a
path to upgrade your OSs when you get the chance. I'm going to be investigating torque 5.0/5.1 though I'll
be sure to run through my test suite before I push changes, so feel free to ask or keep track of github to see
when version bumps happen.

As always feel free to email me with questions, I'm new to torque but not new to HPC.

Thanks,
- David Brown
Michel Béland | 14 May 21:09 2015
Picon

Job rejected by all possible destinations

Hello,

We are running Torque 5.1 on a Linux cluster. It seems that the error 
message "Job rejected by all possible destinations" is not shown 
anymore, except for interactive jobs:

belandmi <at> briaree2[1012]% qsub -lnodes=5:ppn=12 -lwalltime=10:0:0 -W 
group_list=ghb-221-02 -I
qsub: waiting for job 3481328.egeon2 to start
qsub: PBS: Job rejected by all possible destinations (check syntax, 
queue resources, ...)
belandmi <at> briaree2[1013]%

However, if the job is not interactive, we get this:

belandmi <at> briaree2[1015]% qsub -lnodes=5:ppn=12 -lwalltime=10:0:0 -W 
group_list=ghb-221-02 <<END
? sleep 3600
? END
3481331.egeon2
belandmi <at> briaree2[1016]% qstat -a 3481331.egeon2
qstat: Unknown Job Id Error 3481331.egeon2
belandmi <at> briaree2[1017]%

On the server, we can see the following error message with tracejob:

[root <at> egeon2 maui]# tracejob 3481331.egeon2
/var/spool/pbs/mom_logs/20150514: No such file or directory
/var/spool/pbs/sched_logs/20150514: No such file or directory

(Continue reading)

Sven Schumacher | 13 May 14:00 2015
Picon
Picon

Torque/Maui-Compatibility up to which version?

Hi,

up to which version of Torque is Maui compatible...?
I'm just setting up a new cluster (our former one was running 2.5-series
of Torque with Maui) and would like to choose a newer version of Torque
instead without loosing compatibility for Maui.

Which benefits in comparions to Maui has Moab?

Thanks

Sven

--

-- 
Sven Schumacher - Systemadministrator Tel: (0511)762-2753
Leibniz Universitaet Hannover
Institut für Turbomaschinen und Fluid-Dynamik       - TFD
Appelstraße 9 - 30167 Hannover
Institut für Kraftwerkstechnik und Wärmeübertragung - IKW
Callinstraße 36 - 30167 Hannover	

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Simon Kainz | 6 May 13:41 2015
Picon

starvation & multiple queues

Using the fifo-scheduler, is there a possibility to set up 2 queues,
with only 1 affecting starvation?

What i mean is that i do not want to make jobs from Queue A to trigger
the starvation process, which would then delay all jobs from all queues.

Looking through the source, i see that dedicated queues are excluded
from the starvation handling. So I set one queue to "dedicated" (by
setting an approriate dedicated_prefix in the schelduer config file),
but now the other queue is stuck, with all jobs having this comment:

"Not Running: Dedicated time conflict"

Obviously i didn't get the dedicated queue concept right.

So, is it possible to have 2 queues with either "per-queue-starvation",
or with only starvation by one queue?

Thanks in advance.

--

-- 
DI Simon Kainz
Graz, University of Technology
Department Computing
Phone: ++43 (0) 316 / 873 6885
Simon Kainz | 6 May 09:07 2015
Picon

starvation & multiple queues

Using the fifo-scheduler, is there a possibility to set up 2 queues,
with only 1 affecting starvation?

What i mean is that i do not want to make jobs from Queue A to trigger
the starvation process, which would then delay all jobs from all queues.

Looking through the source, i see that dedicated queues are excluded
from the starvation handling. So I set one queue to "dedicated" (by
setting an approriate dedicated_prefix in the schelduer config file),
but now the other queue is stuck, with all jobs having this comment:

"Not Running: Dedicated time conflict"

Obviously i didn't get the dedicated queue concept right.

So, is it possible to have 2 queues with either "per-queue-starvation",
or with only starvation by one queue?

Thanks in advance.

--

-- 
DI Simon Kainz
Graz, University of Technology
Department Computing
Phone: ++43 (0) 316 / 873 6885
W. Graham McCullough | 4 May 19:47 2015
Picon

Regarding which torque 5 version to use

Hello all!

We're about to roll out a new cluster here, and I suspect that the best version of torque 5 right now is 5.0.2, but that's confusing the brass around here due to torque 5.1.0 having been out for a while. I heard that 5.0.2 has memory leak patches for 5.1.0 somewhere; is this correct? If so, why the two branches so early in torque 5? What's the end goal for the two packages?

Thanks for your time!

Graham McCullough
HPC Systems Administrator
Purdue University
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
eatdirt | 1 May 15:18 2015
Picon

test

test
eatdirt | 1 May 13:27 2015
Picon

prologalarm time not respected on torque-4.2.10

Hi people,
I got an epilog.user + epilog system script running on our local cluster 
allowing some users to use local scratch and get back their data afterwards.

The doc says that the prologalarm time is by default 5mn; and indeed 
after 5mn all these scripts are killed (but the nodes are not marked as 
down though). We wanted to increase that time and set:

$prologalarm 3600

in blabla/mom_priv/config

Our problem is that this setting is not respected, the scripts are still 
killed after 5mn instead of 1h. Momctl reports the new prologalarm to 
really be 3600; but that just does not work.

That's look like a torque bug; may anyone help me with this issue... :-/

------

Here an example:
On the running node, that's the pbs_mom logs:

05/01/2015 12:25:27;0080;   pbs_mom.28100;Svr;preobit_preparation;top
05/01/2015 12:26:07;0008;   pbs_mom.28100;Job;scan_for_terminated;pid
35895 not tracked, statloc=0, exitval=0
05/01/2015 12:26:52;0008;   pbs_mom.28100;Job;scan_for_terminated;pid
35938 not tracked, statloc=0, exitval=0
05/01/2015 12:27:37;0008;   pbs_mom.28100;Job;scan_for_terminated;pid
35991 not tracked, statloc=0, exitval=0
05/01/2015 12:28:22;0008;   pbs_mom.28100;Job;scan_for_terminated;pid
36047 not tracked, statloc=0, exitval=0
05/01/2015 12:28:37;0002;   pbs_mom.28100;Svr;pbs_mom;Torque Mom
Version = 4.2.10, loglevel = 1
05/01/2015 12:29:07;0008;   pbs_mom.28100;Job;scan_for_terminated;pid
36090 not tracked, statloc=0, exitval=0
05/01/2015 12:29:52;0008;   pbs_mom.28100;Job;scan_for_terminated;pid
36138 not tracked, statloc=0, exitval=0
05/01/2015 12:30:24;0080;   pbs_mom.28100;Job;2355.cosmo;obit sent to
server
05/01/2015 12:30:24;0008;   pbs_mom.28100;Job;2355.cosmo;forking to
user, uid: 500  gid: 100  homedir: '/home/chris'
05/01/2015 12:30:26;0080;   pbs_mom.28100;Job;2355.cosmo;removed job
script

The job has been terminated at 12:25, preobit started and pbs_mon "obit" 
the job at 12:30, 5 minutes later.

momctl -d3 for that very same node reports a prolog alarm at 3600s; 
which is thus not respected !!!!

Server[0]: cosmo (192.168.0.1:15001)
   Last Msg From Server:   374 seconds (DeleteJob)
   WARNING:  no messages sent to server
HomeDirectory:          /var/lib/torque/mom_priv
stdout/stderr spool directory: '/var/lib/torque/spool/'
   (104704215blocks available)
NOTE:  syslog enabled
MOM active:             65834 seconds
Check Poll Time:        45 seconds
Server Update Interval: 45 seconds
LogLevel:               1 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    TCP
MemLocked:              TRUE  (mlock)
TCP Timeout:            60 seconds
Prolog:                 /var/lib/torque/mom_priv/prologue (enabled)
Prolog Alarm Time:      3600 seconds
Alarm Time:             0 of 10 seconds
Trusted Client List:

127.0.0.1:0,192.168.0.1:0,192.168.0.1:15003,192.168.0.2:15003,192.168.0.3:15003,192.168.0.4:15003,192.168.0.5:15003,192.168.0.6:15003,192.168.0.7:15003,192.168.0.8:15003,192.168.0.9:15003,192.168.0.10:15003,192.168.0.11:15003,192.168.0.12:15003,192.168.0.13:15003,192.168.0.14:15003,192.168.0.15:15003,192.168.0.16:15003,192.168.0.17:15003,192.168.0.18:0,192.168.0.18:15003,192.168.0.19:15003,192.168.0.20:15003,192.168.0.21:15003,192.168.0.22:15003,192.168.0.23:15003,192.168.0.24:15003:
   0
Copy Command:           /usr/bin/scp -rpB
NOTE:  no local jobs detected
eatdirt | 1 May 12:46 2015
Picon

prologalarm time not respected on torque-4.2.10

Hi people,
I got an epilog.user + epilog system script running on our local cluster 
allowing some users to use local scratch and get back their data afterwards.

The doc says that the prologalarm time is by default 5mn; and indeed 
after 5mn all these scripts are killed (but the nodes are not marked as 
down though). We wanted to increase that time and set:

$prologalarm 3600

in blabla/mom_priv/config

Our problem is that this setting is not respected, the scripts are still 
killed after 5mn instead of 1h. Momctl reports the new prologalarm to 
really be 3600; but that just does not work.

That's look like a torque bug; may anyone help me with this issue... :-/

------

Here an example:
On the running node, that's the pbs_mom logs:

05/01/2015 12:25:27;0080;   pbs_mom.28100;Svr;preobit_preparation;top
05/01/2015 12:26:07;0008;   pbs_mom.28100;Job;scan_for_terminated;pid
35895 not tracked, statloc=0, exitval=0
05/01/2015 12:26:52;0008;   pbs_mom.28100;Job;scan_for_terminated;pid
35938 not tracked, statloc=0, exitval=0
05/01/2015 12:27:37;0008;   pbs_mom.28100;Job;scan_for_terminated;pid
35991 not tracked, statloc=0, exitval=0
05/01/2015 12:28:22;0008;   pbs_mom.28100;Job;scan_for_terminated;pid
36047 not tracked, statloc=0, exitval=0
05/01/2015 12:28:37;0002;   pbs_mom.28100;Svr;pbs_mom;Torque Mom
Version = 4.2.10, loglevel = 1
05/01/2015 12:29:07;0008;   pbs_mom.28100;Job;scan_for_terminated;pid
36090 not tracked, statloc=0, exitval=0
05/01/2015 12:29:52;0008;   pbs_mom.28100;Job;scan_for_terminated;pid
36138 not tracked, statloc=0, exitval=0
05/01/2015 12:30:24;0080;   pbs_mom.28100;Job;2355.cosmo;obit sent to
server
05/01/2015 12:30:24;0008;   pbs_mom.28100;Job;2355.cosmo;forking to
user, uid: 500  gid: 100  homedir: '/home/chris'
05/01/2015 12:30:26;0080;   pbs_mom.28100;Job;2355.cosmo;removed job
script

The job has been terminated at 12:25, preobit started and pbs_mon 
"obit" the job at 12:30, 5 minutes later.

momctl -d3 for that very same node reports a prolog alarm at 3600s; 
which is thus not respected !!!!

Server[0]: cosmo (192.168.0.1:15001)
   Last Msg From Server:   374 seconds (DeleteJob)
   WARNING:  no messages sent to server
HomeDirectory:          /var/lib/torque/mom_priv
stdout/stderr spool directory: '/var/lib/torque/spool/'
   (104704215blocks available)
NOTE:  syslog enabled
MOM active:             65834 seconds
Check Poll Time:        45 seconds
Server Update Interval: 45 seconds
LogLevel:               1 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    TCP
MemLocked:              TRUE  (mlock)
TCP Timeout:            60 seconds
Prolog:                 /var/lib/torque/mom_priv/prologue (enabled)
Prolog Alarm Time:      3600 seconds
Alarm Time:             0 of 10 seconds
Trusted Client List:

127.0.0.1:0,192.168.0.1:0,192.168.0.1:15003,192.168.0.2:15003,192.168.0.3:15003,192.168.0.4:15003,192.168.0.5:15003,192.168.0.6:15003,192.168.0.7:15003,192.168.0.8:15003,192.168.0.9:15003,192.168.0.10:15003,192.168.0.11:15003,192.168.0.12:15003,192.168.0.13:15003,192.168.0.14:15003,192.168.0.15:15003,192.168.0.16:15003,192.168.0.17:15003,192.168.0.18:0,192.168.0.18:15003,192.168.0.19:15003,192.168.0.20:15003,192.168.0.21:15003,192.168.0.22:15003,192.168.0.23:15003,192.168.0.24:15003:
   0
Copy Command:           /usr/bin/scp -rpB
NOTE:  no local jobs detected
Paul Raines | 4 May 15:24 2015
Picon

limit jobs/walltime on per day basis


Is there a way with torque/maui to limit the number of jobs (or even
batter the cumulative walltime) a user can use in a particular queue
on a per day basis?

I want to setup a queue with a very high priority users can submit
to but limit the use of it by restricting a users jobs in it
taking no more than two hours each day.  They could have one two
hour job or eight 15 minute jobs for example in that queue over
one day.

---------------------------------------------------------------
Paul Raines                     http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street     Charlestown, MA 02129	    USA

The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.

Gmane