We're running Moab 7.2.9 and Torque 4.2.9 Yes, old, but we have never had the current problem with either this, or previous versions....
Over the last couple of months, we've found that the rate at which jobs are being started on our cluster is very slow. We have some users who send many hundreds, sometimes into thousands of jobs to the queue. The jobs tend to run for 5 -10 minutes, sometimes less.
There is no problem with the rate at which these jobs are accepted into the queue.
Previously, when a block of these jobs would be submitted, it would take 1-2 polling intervals and all jobs would be running, however we're finding that even though there are ample idle machines and the user has not hit any running/idle job limits, the system seems to be having trouble keeping up and there's seldom more than a hundred of these jobs running at a time. What's worse is that it also affects jobs submitted after the ones in question.
At the same time, any attempts to use moab commands (mdiag etc) simply time out.
Load on the management server is generally 1 or less and neither the PBS or Moab processes consume much above 5% CPU
What this means is that when the queue is full of these short-running jobs, we struggle to get cluster utilisation up to a decent level and users are asking why their jobs take so long to start while there's idle cores.
Neither PBS nor Moab logs show any obvious faults or errors, even at reasonably high log levels...it just seems like the queues can't keep up with these jobs and that seems unusual.
Any assistance, or advice about where we could start looking would be welcome.
"Unless someone like you, cares a whole awful lot, nothing is going to get better...It's not !" - The Lorax
A committee is a cul-de-sac, down which ideas are lured and then quietly strangled. Sir Barnett Cocks
"A mind is like a parachute. It doesnt work if it's not open." :- Frank Zappa