Egor Tur | 2 Oct 11:49 2007
Picon

no PBS sched socket connections ready

												
Hi folk.

When I specify the verbosity of Maui logging to 9 (from 3) I see more and more 
messages in maui.log files:

10/02 11:37:46 MRMCheckEvents()
10/02 11:37:46 INFO:     no PBS sched socket connections ready
10/02 11:37:46 MSUAcceptClient(5,ClientSD,HostName,TCP)
10/02 11:37:46 INFO:     accept call failed, errno: 11 (Resource temporarily unavailable)
10/02 11:37:46 INFO:     all clients connected.  servicing requests

I cannot solve this problem. Also I use only maui scheduler.
torque is 2.1.8 version
maui   is 3.2.6p20 version.

Thanx.

												
Egor Tur | 2 Oct 13:24 2007
Picon

time-shared queue.

		
 Hi,

Are there possibility to create time-shared queue?

For example, I create queue with 8 CPUs on 4 nodes. Each nodes have 4 CPUs.
server_priv/nodes file have :ts attribute for all nodes and np=2

Queue have next attributes:
										
create queue cpu8
set queue cpu8 queue_type = Execution
set queue cpu8 resources_default.nodect = 8
set queue cpu8 resources_default.nodes = node001:ppn=2+node002:ppn=2+node003:ppn=2+node004:ppn=2
set queue cpu8 enabled = True
set queue cpu8 started = True

When I submit jobs to this cpu8 queue these jobs always have Q status and not running.

In maui log in this case:
10/02 12:23:42 MPBSJobModify(3730,Resource_List,Resource,node001+node002+node003+node004)
10/02 12:23:42 ERROR:    job '3730' cannot be started: (rc: 15062  errmsg: 'Unknown node  MSG=job allocation
request exceeds
available cluster nodes, 8 requested, 0 available'  hostlist: 'node001+node002+node003+node004')
10/02 12:23:42 MPBSJobModify(3730,Resource_List,Resource,node001:ppn=2+node002:ppn=2+node003:ppn=2+node004:ppn=2)
10/02 12:23:42 WARNING:  cannot start job '3730' through resource manager
10/02 12:23:42 ALERT:    job '3730' deferred after 1 failed start attempts (API failure on last attempt)

When I remove :ts attribute from nodes file (in this case I have ntype of queue is cluster) jobs are
submitting and running.
(Continue reading)

Daniel.G.Roberts | 2 Oct 17:35 2007

Torque/Maui scheduling to deadnodes?

Hello All
We are running the following versions of torque/maui..
Maui version maui-3.2.6p16-snap.1157560841
 /opt/sched/commands/sbin/pbs_server --version
version: 2.1.6
My question is this..our cluster is quite busy..has about 100 nodes..
 
Every once in a great while the system goes haywire in the following way..
 
We might have hundreds of jobs running without any problems..and then at some point in time 10 nodes or so might become available..
Lets say it is nodes 1-10 that could be used..
What happens next is this..the queued jobs are then all scheduled against node1 and fly through the system without ever scheduling jobs against the remaining nodes 9-10.
When the user calls and says all his jobs have failed and we need to figure out what has happened..we realize at this point that>
 
Node one is only pingable and we can't rsh into the target node1 to see what is going on..When we go to the console of node1..we see that maybe it has suffered a disk crash and is in a weird state still somewhat limping along..
BUT from the scheduler point of view...pbsnodes -a reports the node1 as free and it is pingable from the headnode...and because of such pbsnodes status report all the queued jobs get delivered to node1 and disappear...
How do we get around this problem?  Has this particular issue been addressed in newer versions of maui/torque?
Thanks for any advice
Dan
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Miles O'Neal | 2 Oct 18:12 2007

Re: no PBS sched socket connections ready

Egor Tur said...
|
|												
|Hi folk.
|
|When I specify the verbosity of Maui logging to 9 (from 3) I see more and more 
|messages in maui.log files:
|
|10/02 11:37:46 MRMCheckEvents()
|10/02 11:37:46 INFO:     no PBS sched socket connections ready

If you have lots of machines, commands
connecting, and/or a fast polling interval,
you may need to bump up the number of sockets
on the server, and tune the kernel to do things
like release sockets faster.

Have you read the tuning guide and searched
for such things in the archives yet?

-Miles
Joshua Bernstein | 4 Oct 21:51 2007

Jobs stuck in Queue

Hello All,

I'm having a problem handling running MPI based jobs linked against a 
MPICH under TORQUE

The problem is this, in my jobs script, I try to start an MPI job in the 
same why I would outside TORQUE:

---
#PBS -j oe
<code to set BEOWULF_JOB_MAP based on PBS_NODEFILE>
exec ./mpijob
---

This of course correctly starts the jobs on the nodes, but if I do a 
qdel, to kill the job, the job leaves the TORQUE queue, but the 
processes still stay on the nodes. This behavior has lead me to use mpiexec.

So, if I use mpiexec a la:

---
#PBS -j oe
<code to set BEOWULF_JOB_MAP based on PBS_NODEFILE>
mpiexec -comm none ./mpijob
---

The jobs again, start properly on the nodes (albeit a bit slower), and 
then when I do a qdel, the processes get properly cleaned off the nodes. 
The trouble here is that the job still shows up in the TORQUE queue 
marked as running. The only way to clean up this job is to remove its 
entries from $PBS_HOME/server_priv/job and from $PBS_HOME/mom_priv/jobs

Any ideas to help point me in the right direction?

-Joshua Bernstein
Software Engineer
Penguin Computing
Joshua Bernstein | 4 Oct 22:10 2007

Re: Jobs stuck in Queue

Interesting,

Bill Wichser wrote:
> Are you giving it enough time to clear the data from Torque?  Sometimes 
> it takes a bit.

What would you say a "bit"? I'd imagine it would clear out after at 
least 30 seconds, if not right away.

> Also try using qsig instead of qdel for running jobs.

Whats the difference? Doesn't a qdel send a SIGKILL?

Also, the jobs are clearly getting the SIGKILL, because a ps on the node 
shows that the jobs don't exist. I'm doing a watch ps, and I can see 
that right after I issue the qdel, the processes begin to clean 
themselves up and eventually disappear from the process table.

-Joshua Bernstein
Software Engineer
Penguin Computing
Bill Wichser | 4 Oct 22:17 2007
Picon

Re: Jobs stuck in Queue


Joshua Bernstein wrote:
> Interesting,
> 
> Bill Wichser wrote:
>> Are you giving it enough time to clear the data from Torque?  
>> Sometimes it takes a bit.
> 
> What would you say a "bit"? I'd imagine it would clear out after at 
> least 30 seconds, if not right away.

In my experience with Torque/PBS, a "bit" can be longer than 30 seconds.

> 
>> Also try using qsig instead of qdel for running jobs.
> 
> Whats the difference? Doesn't a qdel send a SIGKILL?
> 
> Also, the jobs are clearly getting the SIGKILL, because a ps on the node 
> shows that the jobs don't exist. I'm doing a watch ps, and I can see 
> that right after I issue the qdel, the processes begin to clean 
> themselves up and eventually disappear from the process table.

I've had much better "luck" with sending qsig to running jobs and qdel 
to those not running.  Things may have changed in recent releases but 
long ago, PBS & openPBS days, qdel just never seemed to get it all right.

Bill

> 
> -Joshua Bernstein
> Software Engineer
> Penguin Computing
> _______________________________________________
> torqueusers mailing list
> torqueusers <at> supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
Joshua Bernstein | 4 Oct 22:35 2007

Re: Jobs stuck in Queue


Bill Wichser wrote:
> 
> 
> Joshua Bernstein wrote:
>> Interesting,
>>
>> Bill Wichser wrote:
>>> Are you giving it enough time to clear the data from Torque?  
>>> Sometimes it takes a bit.
>>
>> What would you say a "bit"? I'd imagine it would clear out after at 
>> least 30 seconds, if not right away.
> 
> In my experience with Torque/PBS, a "bit" can be longer than 30 seconds.

Interesting... You'd think its would happen right away.

>>
>>> Also try using qsig instead of qdel for running jobs.
>>
>> Whats the difference? Doesn't a qdel send a SIGKILL?
>>
>> Also, the jobs are clearly getting the SIGKILL, because a ps on the 
>> node shows that the jobs don't exist. I'm doing a watch ps, and I can 
>> see that right after I issue the qdel, the processes begin to clean 
>> themselves up and eventually disappear from the process table.
> 
> I've had much better "luck" with sending qsig to running jobs and qdel 
> to those not running.  Things may have changed in recent releases but 
> long ago, PBS & openPBS days, qdel just never seemed to get it all right.

Doesn't seem to work with qsig either. Seems my luck isn't as good :-(

-Josh
Garrick Staples | 5 Oct 01:02 2007
Picon

Re: Jobs stuck in Queue

On Thu, Oct 04, 2007 at 12:51:07PM -0700, Joshua Bernstein alleged:
> Hello All,
> 
> I'm having a problem handling running MPI based jobs linked against a 
> MPICH under TORQUE
> 
> The problem is this, in my jobs script, I try to start an MPI job in the 
> same why I would outside TORQUE:
> 
> ---
> #PBS -j oe
> <code to set BEOWULF_JOB_MAP based on PBS_NODEFILE>
> exec ./mpijob
> ---

exec?  Why replace the top-level shell process?

 
> This of course correctly starts the jobs on the nodes, but if I do a 
> qdel, to kill the job, the job leaves the TORQUE queue, but the 
> processes still stay on the nodes. This behavior has lead me to use mpiexec.

At least the processes on the MS node are killed, right?

 
> So, if I use mpiexec a la:
> 
> ---
> #PBS -j oe
> <code to set BEOWULF_JOB_MAP based on PBS_NODEFILE>
> mpiexec -comm none ./mpijob
> ---

comm none?  That's only for non-MPI programs.

 
> The jobs again, start properly on the nodes (albeit a bit slower), and 
> then when I do a qdel, the processes get properly cleaned off the nodes. 
> The trouble here is that the job still shows up in the TORQUE queue 
> marked as running. The only way to clean up this job is to remove its 
> entries from $PBS_HOME/server_priv/job and from $PBS_HOME/mom_priv/jobs

First, manually deleting files is bad.  If you really must purge jobs use
'momctl -c' to clear it from the node, and 'qdel -p' to clear it from the
server.  That said, never use those commands!

If you look in pbs_mom's log file, you'll probably find an error message
related to not being able to talk to the server.

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Markus Seto | 4 Oct 17:24 2007
Picon

disabling direct access to compute nodes

Hi, I've recently started fiddling with a torque installation, and was wondering if it's possible to disable direct access to the compute nodes from the master node.  I've noticed some users cheating the system and directly logging into compute nodes to run jobs, and I want to force them to use the queue system, but I was told that direct access with ssh keys is needed for torque to run.  any ideas?

markus

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

Gmane