Yectli Huerta | 27 Jan 22:21 2015
Picon

CUDA issues with torque 5.0.1

Hello,

I wonder if anybody else out there is having issues with the CUDA environment. We just upgraded
 to torque 5.0.1, moab 8.0.1, and the cuda code is core dumping whenever we request more than one
node

i tried cuda sdk 6.0, and 6.5, i also installed the latest cuda drivers, version 331.62 . we
are running centos 6.6

 % qsub -I -l walltime=0:30:00,nodes=2:ppn=2:gpus=4
qsub: waiting for job 5424 to start
qsub: job 5424 ready

/usr/local/cuda-6.5/samples/bin/x86_64/linux/release> ./deviceQuery
./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Segmentation fault (core dumped)

/usr/local/cuda-6.5/samples/bin/x86_64/linux/release> ./deviceQuery
./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Segmentation fault (core dumped)

% qsub -I -l walltime=0:30:00,nodes=1:ppn=2:gpus=4
qsub: waiting for job 5425 to start
qsub: job 5425 ready
(Continue reading)

David Beer | 26 Jan 18:32 2015

TORQUE On Arm

All,

I'm curious if anyone out there has tried using TORQUE on the ARM architecture? Has anyone ported it? Also, is there any interest in doing this?

--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
mjm-www | 26 Jan 18:22 2015
Picon
Picon

Array job index limits

Using Torque 4.2.8 & maui 3.2.6 on centos 5/6

I have had a user recently start using job arrays on the cluster. He is 
submitting them through a package called Raccoon 2 
http://autodock.scripps.edu/resources/raccoon2.

It submits 5000 task array jobs to torque (max_job_array_size is set to 
5000), the whole job has about 150,000 tasks. It uses the arrayid to 
keep track of its tasks:-

ie the first batch is submitted with arrayid's 1-5000, once the first 
batch has completed it submits the second batch with arrayid's 
5001-10000 and so on

There appears to be an issue when the arrayid goes to 100000.

I can reproduce this on a small scale by asking for

-bash-3.2$ qsub -t 99999-100001 test.sh
qsub: submit error (Bad Job Array Request)
-bash-3.2$

I did find (on 
http://docs.adaptivecomputing.com/mwm/Content/commands/msub.html) this 
statement about Moab "Moab enforces an internal limit of 100,000 
sub-jobs that a single array job submission can specify." and wondered 
if this reflected some limit in torque/maui.

Is this a hard limit, or is there a way of increasing this limit? Or am 
I doing something wrong.

Mark Meenan
IT Services
University of Glasgow

The University of Glasgow, charity number SC004401
Matt Westlake | 22 Jan 23:34 2015
Picon
Picon

Allow job submission whilst all nodes are marked offline

Hi All,

This is probably something that I have overlooked in the documentation but I've been searching for a while
and haven't found a real answer.

We run a small 10 node cluster for a local research group using Torque 2.5.7 as our resource manager. We have a
fairly simple cluster configuration with a single execution queue and jobs are rarely executed over more
than a single node.

Recently we've had the need to power down the compute nodes in the cluster for power work in the datacenter.
Some time before the scheduled work we offlined the nodes `pbsnodes -o` with the aim to allow running
compute to complete and stop further jobs form being allocated.

This then resulted in all queued jobs being removed with "Job Deleted because it would never run". The
following is a job trace of one of the deleted jobs:

01/21/2015 14:57:47  S    Job Queued at request of user <at> cluster, owner = user <at> cluster, job name =
src_+16+0+16+28, queue = batch
01/21/2015 14:57:47  S    Job Modified at request of Scheduler <at> cluster
01/21/2015 14:57:47  L    Not enough of the right type of nodes available
01/21/2015 14:57:47  S    enqueuing into batch, state 1 hop 1
01/21/2015 14:57:47  A    queue=batch
01/21/2015 17:42:00  L    Job Deleted because it would never run
01/21/2015 17:42:00  S    Job deleted at request of Scheduler <at> cluster
01/21/2015 17:42:00  S    dequeuing from batch, state COMPLETE
01/21/2015 17:42:00  A    requestor=Scheduler <at> cluster

Here is a copy of our current queue config:
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
set server scheduling = True
set server acl_hosts = cluster
set server acl_hosts += localhost
set server operators = root <at> *
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server next_job_number = 19784

Below is a copy of our nodes file:
node01 np=16 gpus=4
node02 np=16 gpus=4
node03 np=16 gpus=4
node04 np=16 gpus=4
node05 np=16 gpus=4
node06 np=16 gpus=4
node07 np=16 gpus=4
node08 np=16 gpus=4
node09 np=16 gpus=4
node10 np=16 gpus=4

Is there way to allow job submission whilst all avaialble compute nodes are offline?

Thanks in advance for your help.

Regards
--

-- 
Matt Westlake
Server and Storage Services | Technology Services | The University of Adelaide
David Beer | 22 Jan 18:57 2015

Mpiexec / Hydra With 4.2.9

I recently heard someone mention having difficulty with using mpiexec and hydra on TORQUE 4.2.9. I'm wondering if anyone else has had issues with this or if it is likely a configuration problem on the user's end. Any thoughts?

--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Novosielski, Ryan | 21 Jan 17:28 2015
Picon

pbsnodes not reporting Maui status

Hi there. Running Torque 2.5.13 with Maui 3.3.1. In Maui, I have MAXLOAD set so that a load of 0.9 or higher
will cause the node to get marked busy so that jobs will not be sent there. This is a good idea for a variety of
reasons (eg. the node is not really idle).

At one point, pbsnodes would report the node status as busy. This appears to have stopped happening. While
Maui’s checknode command will report the node as busy and, I’m assuming, prevent Torque from
scheduling jobs to run on those nodes, something has clearly changed here for the worse.

Any idea where to look?

Thanks!

--
____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
|| \\UTGERS      |---------------------*O*---------------------
||_// Biomedical | Ryan Novosielski - Senior Technologist
|| \\ and Health | novosirj <at> rutgers.edu - 973/972.0922 (2x0922)
||  \\  Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
     `'

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Ken Nielson | 19 Jan 20:15 2015

Torque and CUDA_VISIBLE_DEVICES

Hi all,

In Torque 4.2.8 we made it so the environment varaible CUDA_VISIBLE_DEVICES would be set by default for NVIDIA GPU jobs. This fixed a problem for some users and broke the current job scripts for others. 

We would like to remedy the problem by putting more control of whether the variable is set at job submission time or as a MOM config option.

Here is a proposal. Please let me know if it will work for you or not. If it does not work then please let me know what will.

Turn setting CUDA_VISIBLE_DEVICES off by default. Have a qsub option which indicates the CUDA_VISIBLE_DEVICES environment variable should be set for this job.

Also have a MOM config option which indicates whether the CUDA_VISIBLE_DEVICES environment variable will be set always or will not be set always.

I look forward to your feedback.

Ken

--

Ken Nielson Sr. Software Engineer
+1 801.717.3700 office    +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300     Provo, UT 84606
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Jean-Christophe Ducom | 17 Jan 01:45 2015
Picon

Resource usage limitation: scratch disk

All-
Is there an existing mechanism in Torque that is similar to Moab' 
RESOURCELIMITPOLICY to enforce the local disk space used (and requested 
with -l file=xxgb)?
Something like ign{cput,mem,vmem,walltime} in pbs_mom: ignfile?

If not, what mechanism would you suggest (I'm looking at writing a 
prologue script to create a filesystem image (dd,losetup,mkfs,mount)
Thank you for any suggestion
JC
Ken Nielson | 16 Jan 18:35 2015

documentation errors in acl_hosts and submit_hosts

Hi all,

We recently fixed a bug in the TORQUE 5.0 code that was introduced in the 5.0 code which broke part of the access logic to pbs_server. We reverted the patch and it now works like it always has. We discovered while fixing this problem that the documentation was incorrect around the acl_host_enable, acl_hosts, submit_hosts and allow_node_submit parameters. In our upcoming release the documentation will be corrected. In the mean time you can see what the changes will be at 

Regards

Ken
--

Ken Nielson Sr. Software Engineer
+1 801.717.3700 office    +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300     Provo, UT 84606
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
James | 16 Jan 07:13 2015
Picon

State R but not running (walltime increase without CPUtime)

Dear Torque users,

When I submit a job I am often getting the following strange behaviour:

36700.master               myjob        user          30:14:2 R long      (job is ok!)    
36701.master               myjob      user          00:00:00 R long      (job not running!)    

Instead of incrementing the time, the job just hangs in "R" status. Ssh to the node and running "top" shows the software is not running.

This is specific to the node, so the 1st job in the list above happened to run on a node which is not having this problem, but the bottom one did. Sometimes rebooting the offending node clears the problem, and sometimes the problem persists even after reboot, making the node unusable.

This also happens per user too, so while one user may be having this problem with the node, another may be fine.

Can anyone help in debugging this? Where should I start? I include some information below. You can see that CPU time stays zero while walltime increases.

Thank you,

James

Job Id: 36704.master.domain
    Job_Name = myjob
    Job_Owner = user <at> master.domain
    resources_used.cput = 00:00:00
    resources_used.mem = 15356kb
    resources_used.vmem = 564312kb
    resources_used.walltime = 00:41:18
    job_state = R
    queue = long
    server = head.node.com
    Checkpoint = u
    ctime = Fri Jan 16 14:06:25 2015
    Error_Path = master.domain:/home/user/Simulation/myjob.e36704
    exec_host = comp09/17+comp09/16+comp09/15+comp09/14+comp09/13+comp09/12
    exec_port = 15003+15003+15003+15003+15003+15003
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Fri Jan 16 14:06:25 2015
    Output_Path = master.domain:/home/user/Simulation/myjob.o36704
    Priority = 0
    qtime = Fri Jan 16 14:06:25 2015
    Rerunable = True
    Resource_List.nodect = 1
    Resource_List.nodes = 1:ppn=6
    session_id = 23447
    Variable_List = PBS_O_QUEUE=long,PBS_O_HOST=master.domain,
    PBS_O_HOME=/home/user,PBS_O_LANG=ja_JP.UTF-8,PBS_O_LOGNAME=user,
    PBS_O_PATH= <many directories.....>
    PBS_O_SHELL=/bin/bash,PBS_SERVER=master.domain,
    PBS_O_WORKDIR=/home/user/Simulation
    comment = Job started on Fri Jan 16 at 14:06
    etime = Fri Jan 16 14:06:25 2015
    submit_args = myjob
    start_time = Fri Jan 16 14:06:25 2015
    start_count = 1
    fault_tolerant = False
    submit_host = master.domain
    init_work_dir = /home/user/Simulation


_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Lee, H. | 15 Jan 00:04 2015
Picon

Re: strange behavior of qsub caused be max_user_queuable

Hi Ken,

yes, thanks!  We are indeed trying 5.0.1 in our testbed environment.

Hong

---------------------
dr. Hurng-Chun Lee
ICT manager

Donders Institute for Brain, Cognition and Behaviour, 
Centre for Cognitive Neuroimaging
Radboud University Nijmegen

e-mail: h.lee <at> donders.ru.nl
tel: +31(0) 243610977
web: http://www.ru.nl/donders/

On 14 Jan 2015, at 22:56, Ken Nielson <knielson <at> adaptivecomputing.com> wrote:

Dr. Lee,

This is a race condition that appears to have been introduced with the 4.2 code base. We have a proposed fix for the problem but have not yet implemented it.It should be fixed in the 4.2.10 release. But that is not a promise. In Torque 5.0 we changed the mechanism that counts the queued jobs and so far we have only had one report of the problem. We are currently diagnosing that problem. Even so the 5.0 code base seems to be working much better than the 4.2 code base.

If an upgrade to 5.0 something you can do?

Ken

On Wed, Jan 14, 2015 at 12:17 PM, Lee, H. <h.lee <at> donders.ru.nl> wrote:
Hi Ken,

Thanks for the answer.  Since our users are affected by this issue, do you know which version in branch 4 does not have this issue?  We may consider to run a lower version until the fix is in place.

Thanks again.

Cheers, Hong

---------------------
dr. Hurng-Chun Lee
ICT manager

Donders Institute for Brain, Cognition and Behaviour, 
Centre for Cognitive Neuroimaging
Radboud University Nijmegen

e-mail: h.lee <at> donders.ru.nl

On 13 Jan 2015, at 20:39, Ken Nielson <knielson <at> adaptivecomputing.com> wrote:

Hong,

This is a known issue. For now the only thing you can do is reboot pbs_server to clear the max user count.

We are working on this.

Ken

On Tue, Jan 13, 2015 at 12:22 PM, Lee, H. <h.lee <at> donders.ru.nl> wrote:
Dear all,

(ps. I am trying to post my issue again to the mailinglist, as the first post does seem to get through.  Being bounced, maybe?) 

We just upgraded torque from 2.5.11 to 4.2.9 in our cluster.  Maui 3.3.1 is used for the job scheduler.  After few days of running, we started to encounter a strange behavior.

We have the following setting for a queue called ‘batch’ and restricted the total amount of queuable jobs to 2000 per user using max_user_queuable.

===
create queue batch
set queue batch queue_type = Execution
set queue batch max_queuable = 20000
set queue batch max_user_queuable = 2000
set queue batch resources_max.mem = 274877906944b
set queue batch resources_max.walltime = 48:00:00
set queue batch resources_default.mem = 10485760b
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 00:01:00
set queue batch disallowed_types = interactive
set queue batch enabled = True
set queue batch started = True
===  

Our problem is that for some users who do not have any jobs in the queue (i.e. qstat -a shows empty), the job submission gives mysteriously an error saying that he user has reached the maximum number of jobs in the queue. For example,

===
winfra <at> mentat001:~
528 $ qstat -a

winfra <at> mentat001:~
529 $ echo '/bin/hostname' | qsub -q batch
qsub: submit error (Maximum number of jobs already in queue for user MSG=total number of current user's jobs exceeds the queue limit: user winfra <at> mentat001.dccn.nl, queue batch)
===

Did anyone encounter similar issue before?

We are also looking for help to investigate this issue.  Any suggestion would be appreciated.  Thank you very much in advance.

Please have the full setting dumped from qmgr below (if it will be helpful):

===
Max open servers: 9
Qmgr: print server
#
# Create queues and set their attributes.
#
#
# Create and define queue vgl
#
create queue vgl
set queue vgl queue_type = Execution
set queue vgl max_queuable = 50
set queue vgl max_user_queuable = 5
set queue vgl resources_max.mem = 10737418240b
set queue vgl resources_max.walltime = 08:00:00
set queue vgl resources_default.mem = 4294967296b
set queue vgl resources_default.neednodes = vgl
set queue vgl resources_default.walltime = 02:00:00
set queue vgl max_user_run = 2
set queue vgl enabled = True
set queue vgl started = True
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch max_queuable = 20000
set queue batch max_user_queuable = 2000
set queue batch resources_max.mem = 274877906944b
set queue batch resources_max.walltime = 48:00:00
set queue batch resources_default.mem = 10485760b
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 00:01:00
set queue batch disallowed_types = interactive
set queue batch enabled = True
set queue batch started = True
#
# Create and define queue short
#
create queue short
set queue short queue_type = Execution
set queue short max_queuable = 20000
set queue short max_user_queuable = 2000
set queue short resources_max.mem = 8589934592b
set queue short resources_max.walltime = 02:00:00
set queue short resources_default.mem = 4294967296b
set queue short resources_default.walltime = 01:59:59
set queue short disallowed_types = interactive
set queue short enabled = True
set queue short started = True
#
# Create and define queue verylong
#
create queue verylong
set queue verylong queue_type = Execution
set queue verylong max_queuable = 20000
set queue verylong max_user_queuable = 2000
set queue verylong resources_max.mem = 68719476736b
set queue verylong resources_max.walltime = 72:00:00
set queue verylong resources_default.mem = 4294967296b
set queue verylong resources_default.walltime = 71:59:59
set queue verylong disallowed_types = interactive
set queue verylong enabled = True
set queue verylong started = True
#
# Create and define queue interactive
#
create queue interactive
set queue interactive queue_type = Execution
set queue interactive max_queuable = 2000
set queue interactive max_user_queuable = 4
set queue interactive resources_max.mem = 68719476736b
set queue interactive resources_max.walltime = 72:00:00
set queue interactive resources_default.mem = 1073741824b
set queue interactive resources_default.walltime = 00:01:00
set queue interactive max_user_run = 2
set queue interactive enabled = True
set queue interactive started = True
#
# Create and define queue matlab
#
create queue matlab
set queue matlab queue_type = Execution
set queue matlab max_queuable = 20000
set queue matlab max_user_queuable = 2000
set queue matlab resources_max.mem = 274877906944b
set queue matlab resources_max.walltime = 24:00:00
set queue matlab resources_default.mem = 10485760b
set queue matlab resources_default.neednodes = matlab
set queue matlab resources_default.walltime = 00:01:00
set queue matlab disallowed_types = interactive
set queue matlab enabled = True
set queue matlab started = True
#
# Create and define queue veryshort
#
create queue veryshort
set queue veryshort queue_type = Execution
set queue veryshort max_queuable = 20000
set queue veryshort max_user_queuable = 2000
set queue veryshort resources_max.mem = 8589934592b
set queue veryshort resources_max.walltime = 00:20:00
set queue veryshort resources_default.mem = 4294967296b
set queue veryshort resources_default.walltime = 00:19:59
set queue veryshort disallowed_types = interactive
set queue veryshort enabled = True
set queue veryshort started = True
#
# Create and define queue test
#
create queue test
set queue test queue_type = Execution
set queue test max_queuable = 10
set queue test max_user_queuable = 2
set queue test max_user_run = 2
set queue test enabled = True
set queue test started = True
#
# Create and define queue automatic
#
create queue automatic
set queue automatic queue_type = Route
set queue automatic route_destinations = long
set queue automatic route_destinations += verylong
set queue automatic route_destinations += batch
set queue automatic route_destinations += interactive
set queue automatic enabled = True
set queue automatic started = True
#
# Create and define queue long
#
create queue long
set queue long queue_type = Execution
set queue long max_queuable = 20000
set queue long max_user_queuable = 2000
set queue long resources_max.mem = 8589934592b
set queue long resources_max.walltime = 24:00:00
set queue long resources_default.mem = 4294967296b
set queue long resources_default.walltime = 23:59:59
set queue long disallowed_types = interactive
set queue long enabled = True
set queue long started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = torque.dccn.nl
set server acl_hosts += dccn-l029.dccn.nl
set server default_queue = automatic
set server log_events = 128
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 300
set server job_stat_rate = 300
set server poll_jobs = True
set server mom_job_sync = True
set server mail_domain = dccn-l029.dccn.nl
set server keep_completed = 43200
set server submit_hosts = mentat001
set server submit_hosts += mentat002
set server submit_hosts += mentat003
set server submit_hosts += mentat203
set server submit_hosts += mentat204
set server submit_hosts += mentat205
set server submit_hosts += mentat206
set server submit_hosts += mentat207
set server submit_hosts += mentat208
set server submit_hosts += mentat301
set server submit_hosts += mentat302
set server submit_hosts += dccn-c003
set server submit_hosts += dccn-c004
set server submit_hosts += dccn-c005
set server submit_hosts += dccn-c006
set server submit_hosts += dccn-c007
set server submit_hosts += dccn-c008
set server submit_hosts += dccn-c009
set server submit_hosts += dccn-c010
set server submit_hosts += dccn-c011
set server submit_hosts += dccn-c012
set server submit_hosts += dccn-c013
set server submit_hosts += dccn-c014
set server submit_hosts += dccn-c015
set server submit_hosts += dccn-c016
set server submit_hosts += dccn-c017
set server submit_hosts += dccn-c018
set server submit_hosts += dccn-c019
set server submit_hosts += dccn-c020
set server submit_hosts += dccn-c021
set server submit_hosts += dccn-c022
set server submit_hosts += dccn-c023
set server submit_hosts += dccn-c024
set server submit_hosts += dccn-c025
set server submit_hosts += dccn-c026
set server submit_hosts += dccn-c027
set server submit_hosts += dccn-c028
set server submit_hosts += dccn-c349
set server submit_hosts += dccn-c350
set server submit_hosts += dccn-c351
set server submit_hosts += dccn-c352
set server submit_hosts += dccn-c353
set server submit_hosts += dccn-c354
set server submit_hosts += dccn-c355
set server submit_hosts += dccn-c356
set server submit_hosts += dccn-c357
set server submit_hosts += dccn-c358
set server submit_hosts += dccn-c359
set server submit_hosts += dccn-c360
set server submit_hosts += dccn-c361
set server submit_hosts += dccn-c362
set server submit_hosts += dccn-c363
set server submit_hosts += dccn-c364
set server submit_hosts += dccn-c365
set server submit_hosts += dccn622
set server auto_node_np = True
set server next_job_number = 7794350
set server clone_batch_size = 150
set server clone_batch_delay = 10
set server record_job_info = True
set server record_job_script = False
set server job_log_file_max_size = 1024
set server job_log_file_roll_depth = 1000
set server job_log_keep_days = 3
set server moab_array_compatible = True
set server nppcu = 1
===

Cheers, Hong

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers




--

Ken Nielson Sr. Software Engineer
+1 801.717.3700 office    +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300     Provo, UT 84606
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers


_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers




--

Ken Nielson Sr. Software Engineer
+1 801.717.3700 office    +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300     Provo, UT 84606
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

Gmane