Andrew Mather | 2 Mar 01:58 2015
Picon

Torque/Moab very slow to start jobs.

Hi All

We're running Moab 7.2.9 and Torque 4.2.9  Yes, old, but we have never had the current problem with either this, or previous versions....

Over the last couple of months, we've found that the rate at which jobs are being started on our cluster is very slow.  We have some users who send many hundreds, sometimes into thousands of jobs to the queue.  The jobs tend to run for 5 -10 minutes, sometimes less.

There is no problem with the rate at which these jobs are accepted into the queue.

Previously, when a block of these jobs would be submitted, it would take 1-2 polling intervals and all jobs would be running, however we're finding that even though there are ample idle machines and the user has not hit any running/idle job limits, the system seems to be having trouble keeping up and there's seldom more than a hundred of these jobs running at a time.  What's worse is that it also affects jobs submitted after the ones in question.

At the same time, any attempts to use moab commands (mdiag etc) simply time out.

Load on the management server is generally 1 or less and neither the PBS or Moab processes consume much above 5% CPU

What this means is that when the queue is full of these short-running jobs, we struggle to get cluster utilisation up to a decent level and users are asking why their jobs take so long to start while there's idle cores.

Neither PBS nor Moab logs show any obvious faults or errors, even at reasonably high log levels...it just seems like the queues can't keep up with these jobs and that seems unusual.

Any assistance, or advice about where we could start looking would be welcome.

Thanks,
Andrew





--
-
https://picasaweb.google.com/107747436224613508618
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"Unless someone like you, cares a whole awful lot, nothing is going to get better...It's not !" - The Lorax
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
A committee is a cul-de-sac, down which ideas are lured and then quietly strangled.
  Sir Barnett Cocks
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
"A mind is like a parachute. It doesnt work if it's not open." :- Frank Zappa
-
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
LAHAYE Olivier | 27 Feb 18:37 2015
Picon

Hell configuring torque with private nodes.

Hi,

I'm desperately trying to configure torque on the following specific cluster setup without any success so far:

- 2 Ethernet interfaces:
      - eno1 (public with IP registered in company DNS intranet)
      - eno2 (private with IP registered in /etc/hosts only)
- nodes on private side with hostnames registered in /etc/hosts
- private network is not routed on intranet
- /etc/hosts with all private ip (head and compute nodes) same on head and compute nodes.
- head part of company dnsdomainname
- private IPs only resolve with /etc/hosts
- pbs_server started with -H private-hostname
- /etc/torque/server_name containes private-hostname
- /etc/nsswitch.conf contains hosts: files dns myhostname

=> NO way.

# qmgr
Max open servers: 9
Qmgr: p s
qmgr obj= svr=default: Invalid credential MSG=Hosts do not match
Qmgr: q

=> My head must stay in company DNS intranet domainname
=> My nodes can't see this domainname

Question: is there a way to have pbs_server work on this kind of configuration?
bonus question: is there any progress on that point: http://www.supercluster.org/pipermail/torquedev/2012-June/004134.html

PS: I'm running torque 4.2.9 on CentOS-7
PPS: Attached is the systemd service file for pbs_server (not yet written for other daemons; will do if I can fix my pbs_server issue. (default init script fails to restart the service in compatibility mode. it doesn nothing in fact; that's why I did write a systemd service file that works as expected).

Many thanks for help.

Best regards,

Olivier.

--
   Olivier LAHAYE
   CEA DRT/LIST/DIR
Attachment (pbs_server.service): application/octet-stream, 454 bytes
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
David Beer | 24 Feb 17:42 2015

Odd Issue With Dual-socket 12 Core Boxes

Hi all,

One of our customers/users is having issues where jobs won't use the second socket of a few boxes. It appears to be an OS problem, but I'm curious if anyone else has this hardware and has experienced this issue.

The hardware is HP DL385 G7 with AMD Opteron 6234. It is Dual Socket 12 Cores 2.4MHZ. He is running TORQUE without cpusets configured. TORQUE sees 24 cores and when jobs are launched that use the entire machine, the $PBS_NODEFILE has 24 entries, but when observed through top you can see that only 12 cpus are used.

He noticed that for some reason the OS claims to have the first numa node as 0 and the second as 2 - there is no numa node 1. Also, in numa node 0 are cpus 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22 instead of 0-11. Numa node 2 has the odd numbered cpus. 

Has anyone else seen this and found a workaround? He is considering altering the bios to view the host as SMP, but is worried that this might end up hurting performance as often as it helps. Any suggestions are welcome.

--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Drake, JD | 23 Feb 16:04 2015

pbs_mom restart issue

Hello all,

 

I had previously inquired about this in the past but this bit us again over the weekend.  We had some issues which required us to restart all of our pbs_mom's.  We do this via issuing a script command using the Red Hat Satellite Server of 'service pbs_mom restart'.  When this happens, by all appearances from a satellite server standpoint, everything went fine.  If we check a particular node and perform a 'service pbs_mom status', we get the following message: "pbs_mom dead but pid file exists".

 

However, we tried manually as well as via the satellite server in running the pbs_mom script in init.d  with the restart arugment and were successful.

 

Is there some issue with restarting the mom via the service command and/or via a RedHat satellite server?

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
m.roth | 20 Feb 16:03 2015
Picon

A remote submission question, followup

Well, I've been googling, and not finding, so here's the followup to my
original question:
1. Is it the case that there is no provision for qsub submitting jobs to
torque
     queues on *different* clusters with munge authentication? That is,
submit jobs from
     serverA to serverB, serverC..., each of which is running pbs_sched,
pbs_server, and
     the munge daemon?
2. If my users want to do this, is it the case that all servers providing
torque queues
     MUST have the same munge key?

If this is the case, I'm not sure if this is a suggested enhancement for
torque or munge....

       mark
m.roth | 19 Feb 17:50 2015
Picon

A remote submission question

Hi, folks,

   Please feel free to point me to a link - this *has* to have been
discussed before, but I haven't found it googling.

   Running CentOS 6, torque 2.5.7.9. We have two servers; both are running
pbs sched, server, and mom. What we now want to do is submit from
server to hbs (technical term, honkin' big server). I've used qmgr to
add server to submit_hosts, and opened port 15001 on hbs, and I no
longer get no route to host; instead, now, this is what happens:
echo "sleep 30" |  qsub -q  <at> <hbs>
munge: Error: Unable to access "/var/run/munge/munge.socket.2": No such
file or directory
qsub: Invalid credential

Do I have to open another port? Is there another parm to set?

Thanks in advance.

        mark
Douglas Holt | 18 Feb 16:54 2015

Using NVML prevents GPU reset

Configuring Torque to use NVML instead of nvidia-smi prevents resetting GPUs because pbs_mom keeps devices loaded.

 

Resetting the GPU is needed, for example, to change ECC modes without rebooting, which ends the job.

 

Is there some method for getting pbs_mom to release the driver other than sending SIGKILL and recovering with -p?

 

 

 

(previous subject)

 

David Beer dbeer at adaptivecomputing.com

Fri Oct 31 10:01:06 MDT 2014

Previous message: [torqueusers] pbs_mom segfault caused by unexpected output from             nvidia-smi

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Douglas,

 

For an immediate workaround, you can configure TORQUE to use nvidia's API

instead of the smi command to get the data. This code path does not copy to

a fixed-sized buffer and therefore won't segfault on you. The documentation

for how to configure this way is here:

http://docs.adaptivecomputing.com/suite/8-0/basic/help.htm#topics/moabWorkloadManager/topics/accelerators/nvidiaGpus.htm?Highlight=nvml

 

Note: using the API is also faster than using the smi command.

 

We will also fix the issue of copying to the fixed buffer here, but I would

advise anyone to switch to the API version instead of the smi command.

 

This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Taras Shapovalov | 18 Feb 15:24 2015

How to submit to a queue without nodes

Hi guys,

Do you see any way to allow jobs submission to a queue without any nodes?

Say in Slurm I can use node state FUTURE and the workload manager will not try to resolve the hostname. In Torque any hostnames should be resolved in order to be added via qmgr (and they will appear in pbsnodes).

The use case scenario is to put a "placeholder" with some number of slots, and allow users to submit their jobs before any nodes are added (when, say, cloud nodes are created dynamically).

Thanks,

Taras
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Galloway, Michael D. | 13 Feb 22:44 2015
Picon

remote submit host and firewalls

?good day all,

we are working on getting a remote submit host outside our firewall working with our torque/maui test
cluster. submission is working but we trying to understand what firewall ports need to be open for
results/errors/etc to get returned to the submit host. we did a tcpdump and it seemed that the pbs server
was only wanting to contact the submit host on TCP/15001 so we opened it but its still not completing queue
submissions correctly. we've looked for documentation but are struggling a bit. any help appreciated.

--- michael
Liam Gretton | 11 Feb 23:36 2015
Picon
Picon

Single epilogue for array jobs?

Is there a way to have an equivalent to an epilogue script for array
jobs that only runs once, when all tasks in the array have completed?

An epilogue script otherwise runs each time at the end of every task.

The use-case is a way of concatenating output files from a bunch of
tasks to a single file on completion. This could be done within each
task, but on our parallel file system that approach means file contents
get interleaved in the single common file.

The only solution I can think of at the moment is having a separate job
for mopping up after the array that uses depend=after:jobid so that it
doesn't start until the array job completes.

--

-- 
Liam Gretton                                    liam.gretton <at> le.ac.uk
HPC Architect                                http://www.le.ac.uk/its/
IT Services                                   Tel: +44 (0)116 2522254
University Of Leicester, University Road
Leicestershire LE1 7RH, United Kingdom

torque 4.2.6: qdel kill_delay not working

I have the same issue with the kill_delay parameter not working. Is there any update on the topic?

Philipp Funk

 

 

 

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

Gmane