Ole Holm Nielsen | 5 Feb 08:28 2016
Picon
Picon

ANNOUNCE: Updated pbsacct-1.4.11 simple Torque accounting statistics tool

Dear Torque users,

There is an updated version 1.4.11 of the very simple accounting 
statistics tool "pbsacct" for the Torque batch system available at
ftp://ftp.fysik.dtu.dk/pub/Torque/pbsacct-1.4.11.tar.gz.
Please report bugs etc. to Ole.H.Nielsen <at> fysik.dtu.dk.

CHANGELOG of Version 1.4.11:

pbsjobs: Fixed error found with some Torque versions: The job start time 
"start=0" is a bug.
Try to calculate a sensible wait-time despite this.

pbsacct: User accounting generalized: A user may now belong to several 
groups.
Usage in each group will be accounted separately.

pbsacct: Sanity check for empty accounting file.

Defining AWK=/bin/gawk in several scripts allows setting your system's 
awk version.

--

-- 
Ole Holm Nielsen
Department of Physics, Technical University of Denmark
Daniel Barker | 5 Feb 15:46 2016
Picon

qpeek replacement

Does anyone have any suggestions for a qpeek replacement?  My understanding is that qpeek is deprecated. qpeek often leaves a tail process on the compute node after the job completes. I realize that I could use an epilogue to kill the process, but I would like that function to be contained within qpeek.

--
Dan Barker
​​



_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
John | 4 Feb 13:36 2016
Picon
Gravatar

Problems with FIFO behaviour

I am running torque-6.0.0.1 under arch linux.  It is configured using all the defaults, so I expect it to use the fifo scheduler in scheduler.cc.  I only have a 4 processor machine, but I want to be able schedule batch jobs to run sequentially in the order I submit them.  I only have one queue, called all.q.  Below is a the output from qstat;

Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
100.localhost              STDIN            john                   0 Q all.q         
101.localhost              STDIN            john                   0 Q all.q         
102.localhost              STDIN            john                   0 Q all.q         
103.localhost              STDIN            john                   0 Q all.q         
104.localhost              STDIN            john                   0 Q all.q         
105.localhost              STDIN            john                   0 Q all.q         
107.localhost              STDIN            john                   0 Q all.q         
108.localhost              STDIN            john                   0 R all.q         
109.localhost              STDIN            john                   0 R all.q         
110.localhost              STDIN            john                   0 R all.q         
111.localhost              STDIN            john                   0 R all.q        

They are all 1 processor jobs, it has correctly run 4 jobs, but it has chosen to run the last job submitted first LIFO, I think.  I have tried to compile the basl based scheduler but this now seems very out of date and gives very many compilation errors.

If anyone has any tips on how to proceed I would be grateful.  I have tried changes to the pbs_sched.conf file such as strict_fifo, but to no avail....

Thanks, John

qmgr -c 'p s'
#
# Create queues and set their attributes.
#
#
# Create and define queue all.q
#
create queue all.q
set queue all.q queue_type = Execution
set queue all.q max_running = 4
set queue all.q resources_max.ncpus = 4
set queue all.q resources_max.nodes = 1
set queue all.q resources_min.ncpus = 1
set queue all.q resources_default.ncpus = 1
set queue all.q resources_default.neednodes = 1:ppn=1
set queue all.q resources_default.nodect = 1
set queue all.q resources_default.nodes = 1
set queue all.q resources_available.ncpus = 4
set queue all.q enabled = True
set queue all.q started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = WEBER-UNIX
set server default_queue = all.q
set server log_events = 2047
set server mail_from = adm
set server node_check_rate = 150
set server tcp_timeout = 300
set server job_stat_rate = 300
set server poll_jobs = True
set server down_on_error = True
set server mom_job_sync = True
set server next_job_number = 112
set server moab_array_compatible = True
set server nppcu = 1
set server timeout_for_job_delete = 120
set server timeout_for_job_requeue = 120

pbs_sched.conf file
round_robin: False    all
by_queue: True        prime
by_queue: True        non_prime
strict_fifo: false    ALL
fair_share: false    ALL
help_starving_jobs    false    ALL
sort_queues    true    ALL
load_balancing: false    ALL
sort_by: no_sort    ALL
dedicated_prefix: ded
max_starve: 24:00:00
half_life: 24:00:00
unknown_shares: 10

pbsnodes -a
WEBER-UNIX
     state = job-exclusive
     power_state = Running
     np = 4
     ntype = cluster
     jobs = 0/111.localhost.localdomain,1/110.localhost.localdomain,2/109.localhost.localdomain,3/108.localhost.localdomain
     status = rectime=1454589303,cpuclock=Fixed,varattr=,jobs=111.localhost.localdomain(cput=0,energy_used=0,mem=6304kb,vmem=37264kb,walltime=763,session_id=14101) 110.localhost.localdomain(cput=0,energy_used=0,mem=6420kb,vmem=37264kb,walltime=763,session_id=14104) 109.localhost.localdomain(cput=0,energy_used=0,mem=6424kb,vmem=37264kb,walltime=763,session_id=14107) 108.localhost.localdomain(cput=0,energy_used=0,mem=6400kb,vmem=37264kb,walltime=763,session_id=14110),state=free,netload=886706889,gres=,loadave=0.07,ncpus=4,physmem=16391544kb,availmem=25664620kb,totmem=33876340kb,idletime=1987,nusers=2,nsessions=19,sessions=307 461 808 829 849 997 1002 823 1092 1175 4085 8837 8875 8976 14101 14104 14107 14110 23189,uname=Linux WEBER-UNIX 4.4.1-1-ARCH #1 SMP PREEMPT Mon Feb 1 08:16:18 CET 2016 x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003

sync_time: 1:00:00



_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Lev Lafayette | 1 Feb 04:59 2016
Picon
Picon

mdiag-n versus pbsnodes -ln

A cluster has just had an outage with one of two NFS attached storage arrays going awry. As
part of the process of bringing it back, I've noticed that there is a difference between the Maui/Moab
command mdiag -n (i.e., mdiag -n | grep "Down") and the PBS command pbsnodes -ln

I was expecting the two lists of downed nodes to be similar but they're actually quite different. I
am right in speculating that they get their information from different sources?

Lev Lafayette, BA (Hons), GradCertTerAdEd (Murdoch), GradCertPM, MBA (Tech Mngmnt) (Chifley)
HPC Support and Training Officer
Department of Infrastructure Services, University of Melbourne
Eva Hocks | 28 Jan 19:14 2016
Picon

maui crash when setting reservation


maui 3.3.1 and torque 4.2.9 on CentOS 6.2

maui 3.3.1 crashes (segfault) when I try to set a system reservation:
# setres -s 8:00:00_02/08 -e 08:00:00_02/09 ALL

(gdb) r -d
Starting program: /opt/maui/sbin/maui -d
[Thread debugging using libthread_db enabled]

Program received signal SIGSEGV, Segmentation fault.
0x0000003935a83da0 in __memset_sse2 () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install maui-3.3.1-4.x86_64
(gdb) bt
#0  0x0000003935a83da0 in __memset_sse2 () from /lib64/libc.so.6
Cannot access memory at address 0x7ffffffd4958

I have set DEFAULT_RES_DEPTH 48 in msched.h.

Any idea?

Thanks
Eva

PS.  debuginfo-install shows: No debuginfo packages available to install
Eva Hocks | 28 Jan 01:51 2016
Picon

maui crash when setting reservation


maui 3.3.1 crashes (segfault) when I try to set a reservation:

(gdb) r -d
Starting program: /opt/maui/sbin/maui -d
[Thread debugging using libthread_db enabled]

Program received signal SIGSEGV, Segmentation fault.
0x0000003935a83da0 in __memset_sse2 () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install maui-3.3.1-4.x86_64
(gdb) bt
#0  0x0000003935a83da0 in __memset_sse2 () from /lib64/libc.so.6
Cannot access memory at address 0x7ffffffd4958

I have set DEFAULT_RES_DEPTH 48 in msched.h.

Any idea?

Thanks
Eva

PS.  debuginfo-install shows: No debuginfo packages available to install
Mahmoud A. A. Ibrahim | 22 Jan 09:31 2016
Picon

Delete line after execution

Dear Torque Users
Sometimes, we have a power cut and every time it happens, the submitted jobs restarted from beginning.
Each submitted file has millions of lines, each line has a single command.
So, we were wondering if there is a "FAST" option to delete the line from the submitted file after its execution.
Thanks in advance
Sincerely;
M. Ibrahim

--
Mahmoud A. A. Ibrahim
Head of CompChem Lab, Chemistry Department,
Faculty of Science, Minia University, Minia 61519, Egypt.
Email: m.ibrahim <at> compchem.net
            m.ibrahim <at> mu.edu.eg
Website: www.compchem.net
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Glen Beane | 19 Jan 18:20 2016
Picon
Gravatar

dependency question

I want a job to run if any one of a number of jobs exits with a non-zero exit status.

The problem with afternotok is if I include multiple jobs, the job with the afternotok dependency is aborted if any one of the jobs listed in the dependency finishes successfully.  What I want is a dependency like afternotok, but the job will only be aborted if ALL of the jobs specified in the dependency after finished successfully, and it will run if at least one of the jobs exits with a non-zero value.  Is this possible?
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Andrus, Brian Contractor | 13 Jan 16:09 2016

Killing array ghost 'main' job

So,

 

I (rather) often get a situation where torque is showing the array main job as a running job like:

[root <at> cluster ~]# qstat -tr

 

cluster.internal

                                                                                  Req'd       Req'd       Elap

Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory      Time    S   Time

----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------

68[].cluster.internal  bandrus     default  m2p7                --      1      5       1gb 720:00:00 R 720:00:00

68[319].cluster.inter  bandrus     default  m2p7-319          34203     1      5       1gb 720:00:00 R  05:24:14

68[328].cluster.inter  bandrus     default  m2p7-328          33636     1      5       1gb 720:00:00 R  05:42:34

68[330].cluster.inter  bandrus     default  m2p7-330         138372     1      5       1gb 720:00:00 R  03:39:31

68[331].cluster.inter  bandrus     default  m2p7-331         138799     1      5       1gb 720:00:00 R  03:24:38

68[332].cluster.inter  bandrus     default  m2p7-332         139079     1      5       1gb 720:00:00 R  03:15:12

68[333].cluster.inter  bandrus     default  m2p7-333         141469     1      5       1gb 720:00:00 R  03:05:15

 

The problem is that SOMEHOW torque sees that one job as using up all the possible slot_limits that may exist, so once this happens, NO other jobs in that array will start:

 

qrun: Invalid request MSG=Cannot run job. Array slot limit is 16 and there are already 16 jobs running

 68[334].cluster.internal

 

The only thing I have seen work is to qdel all remaining jobs, resubmit them and then when the currently running ones complete, I have to delete the .TA file that still exists in /var/spool/torque/server_priv/jobs and restart pbs_server

 

So.. is HOW do I kill that stupid main job entry so the others will run??

 

I am so close to being so done with torque…

 

Anyone played with slurm much?

 

 

Brian Andrus

ITACS/Research Computing

Naval Postgraduate School

Monterey, California

voice: 831-656-6238

 

 

 

 

 

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Andrus, Brian Contractor | 6 Jan 19:30 2016

qdel part of an array not working in torque 6.0.0

So I am trying to delete a range of jobs that are on hold in an array:

 

qdel -t 314-749 58[]

 

But nothing changes.

However if I do:

qdel 58[314]

then job 58-314 shows Complete (as expected).

 

Looks like yet another array bug…

 

Grrr…

 

Brian Andrus

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Troy Baer | 5 Jan 22:41 2016

pbstools 3.1 released

It's been ages since I did a release of pbstools (and the project has 
moved, again...), so I figured it was about time.

https://www.osc.edu/~troy/pbstools
https://www.osc.edu/sites/osc.edu/files/staff_files/troy/pbstools-3.1.tar.gz
http://svn.osc.edu/repos/pbstools/releases/pbstools-3.1

This is a collection of utilities that have been developed at OSC, NICS, 
and elsewhere to aid in the administration and management of PBS 
variants (including OpenPBS, PBS Pro, and TORQUE). They have been 
developed primarily on Linux clusters running TORQUE, but most of them 
should be applicable to virtually any system running a PBS variant.

Changes and additions in this release:

* Most of the components can now be built as RPMs directly from the 
tarball (e.g. "rpmbuild -ta pbstools-3.1.tar.gz").  Note that this has 
been only lightly tested, and feedback is welcome.

* Addition of job-vm-launch.  This allows one to start a VM inside a PBS 
job if the compute node is running libvirtd.

* Addition of pbs-spark-submit.  This is a way to launch Apache Spark 
applications inside a PBS job.  It is described in detail in 
"Integrating Apache Spark into PBS-Based HPC Environments" from the 
proceedings of the XSEDE '15 conference.

* Updates to parallel-command-processor to fix a bug where rank 0 would 
occasionally segfault in MPI_Finalize() due to outstanding communications.

* Updates to reaver to make it work with the XML-based .JB files found 
in newer version of TORQUE.

* The pbsacct database schema now includes an "sw_app" column that is 
indexed.  That results in a ~200x speedup for software usage queries 
relative to the old approach.  An example indexing program, 
sw_app-index, is included.

* Some of the older, deprecated components (e.g. qpeek, qps, and 
dezombify) have been moved into a "deprecated" directory, but they're 
still available.

* Lots of minor documentation and bug fixes.

Comments and suggestions on this software are welcomed.

     --Troy

--

-- 

Troy Baer
Senior HPC Systems Engineer
Ohio Supercomputer Center
http://www.osc.edu/

Gmane