hitesh chugani | 18 Apr 01:15 2014
Picon

shared compute nodes

Hello,


I have a question, can i have a shared compute nodes between two pbs_servers? If so, what should be the entry in TORQUE_HOME/mom_priv/config ?Thanks in advance.



Regards,
Hitesh Chugani.
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Zhang,Jun | 15 Apr 21:55 2014

There are CPUs available, but job is queued

Out of my 16 nodes, there are 3 cpus seem vacant, they belong to two different nodes. At this time I submit a job, it is being queued. I am under any limit of the queue the job trying to execute. Can somebody help?

 

Jun

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Ken Nielson | 15 Apr 19:14 2014

TORQUE master branch head reset in github

Hi all,

We had a bad merge into the  master branch of TORQUE last Friday, April 11, 2014. We have since removed that bad merge by resetting the head of master back to the last commit before the merge. There have been no other commits since the merge so nothing else has been lost.

We strongly discourage anyone from using the master branch for any production environment and do not support the master branch for such purposes. If you have pulled from the master branch since last Friday be aware it has changed.


Please respond with any questions.

Regards

Ken Nielson
--

Ken Nielson Sr. Software Engineer
+1 801.717.3700 office    +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300     Provo, UT 84606
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Clotho Tsang | 15 Apr 03:40 2014

Torque HA is unable to handle network disconnection?

I am testing Torque' HA feature by using "--ha" option of pbs_server.
Torque version: 4.2.2.
OS: CentOS 6.4 x86_64
The Torque servers and the file server are VMs.
The file server is only a NFS server for testing purpose.

I set it up by 
  • mounting /var/spool/torque/server_priv/ on an external file server.
  • removing $pbsserver from /var/spool/torque/mom_priv/config
  • adding both server hostnames to /var/spool/torque/server_name
  • setting managers, operators, acl_hosts, lock_file_check_time, lock_file_update_time at qmgr
  • setting up separated maui on both hosts, no HA setup for Maui. (for testing purpose, will change later)
With the setup, I pass the following tests:
  • Both nodes should able to see same results of following commands: pbsnodes, qstat
  • Stop pbs_server of a node (with command "kill"), the commands above should keep working.
But I fail the following test:
  • Disconnect network of one node, the commands above should keep working.
Instead, pbsnodes / qstat hangs forever. I trace it with strace, find that they are waiting
for connection with disconnected host. Of course it will fail.

Before I disconnect the network, server.lock stores the PID of server1 (m20).
After I disconnect the network of server1, server.lock stores the PID of server2 (m30).
Its timestamp is kept updated.

[root <at> m30 torque]# ls -l /var/spool/torque/server_priv/server.lock
-rw------- 1 root root 5 Apr  9 17:01 /var/spool/torque/server_priv/server.lock
[root <at> m30 torque]# date
Wed Apr  9 17:01:44 CST 2014
[root <at> m30 torque]# cat /var/spool/torque/server_priv/server.lock
9339
[root <at> m30 torque]# ps aux | grep pbs_server
root      9339  0.0  3.9 574196 26988 ?        Sl   16:54   0:00 /usr/sbin/pbs_server -d /var/spool/torque --ha
root     11219  0.0  0.1 103236   856 pts/0    R+   16:59   0:00 grep pbs_server
[root <at> m30 torque]# pbsnodes
(hang here)
^C
[root <at> m30 torque]# strace pbsnodes
:
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
connect(3, {sa_family=AF_INET, sin_port=htons(15001), sin_addr=inet_addr("192.168.122.20")}, 16) = -1 EINPROGRESS (Operation now in progress)
select(4, NULL, [3], NULL, {300, 0}

(hang here. 192.168.122.20 is the IP of server1)

Is it the expected behavior that Torque's native HA cannot handle network disconnection?


--
Clotho Tsang
Senior Software Engineer
Cluster Technology Limited
Email: clotho <at> clustertech.com
Tel: (852) 2655-6129
Fax: (852) 2994-2101
Website: www.clustertech.com
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Islam, Sharif | 14 Apr 22:31 2014

nallocpolicy

Our current NODEALLOCATIONPOLICY is set to PRIORITY (We are using torque
4.2.6 and moab 7.2.7).

We are testing other policies and was wondering about the “-l
nallocpolicy” resource manager extension
(http://docs.adaptivecomputing.com/mwm/7-2-7/help.htm#topics/resourceManage

rs/rmextensions.html#nallocpolicy).

It seems when I have NODEALLOCATIONPOLICY set to PRIORITY in moab, it is
not honoring "-l nallocpolicy=CONTIGUOUS”. However, when I have
NODEALLOCATIONPOLICY set to CONTIGUOUS, I was able to get a priority
allocation with  "-l nallocpolicy=PRIORITY”.  Is this expected? We also
have this set in moab: NODECFG[DEFAULT]      PRIORITYF=‘-NODEINDEX’ (Cray
XE/XK machine).



—sharif 


-- 
Sharif Islam 
System Engineer 
Blue Waters (http://www.ncsa.illinois.edu/BlueWaters/)

3006 E NCSA, 1205 W. Clark St. Urbana, IL


_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
David Beer | 14 Apr 22:20 2014

Improvements Coming in the June Release of TORQUE

All,

At MoabCon 2 weeks ago we discussed a lot of the scalability and throughput improvements that are being made in Ascent. For those who couldn't come or would like a review, here's a blog entry on the same topic: http://www.adaptivecomputing.com/blog-opensource/update-ascent-torque/

Cheers,

--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Enrico Morelli | 14 Apr 15:43 2014
Picon

exceeds available partition procs

Dear all,

I've a problem that I'm unable to understand. I've a server and a node with 24 processors.
I qsub 2 jobs (non parallel), one goes running the other goes in IDLE jobs:


ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

440                  apache    Running     1 99:20:45:27  Mon Apr 14 12:04:01

     1 Active Job        1 of   24 Processors Active (4.17%)
                         1 of    1 Nodes Active      (100.00%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

441                  apache       Idle     1 99:23:59:59  Mon Apr 14 12:15:54

1 Idle Job

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


Total Jobs: 2   Active Jobs: 1   Idle Jobs: 1   Blocked Jobs: 0



The checkjob show:

Creds:  user:apache  group:apache  class:batch  qos:DEFAULT
WallTime: 00:00:00 of 99:23:59:59
SubmitTime: Mon Apr 14 12:15:54
  (Time Queued  Total: 3:03:34  Eligible: 3:03:17)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
Flags:       RESTARTABLE

Reservation '441' (00:00:00 -> 99:23:59:59  Duration: 99:23:59:59)
Messages:  exceeds available partition procs
PE:  1.00  StartPriority:  183
job cannot run in partition DEFAULT (insufficient idle procs available: 0 < 1)


the showbf show:
----------------------------------------------------------------------------------------
backfill window (user: 'root' group: 'root' partition: ALL) Mon Apr 14 15:19:50

no procs available
----------------------------------------------------------------------------------------

mdiag -j

Name                  State Par Proc QOS     WCLimit R  Min     User    Group  Account  QueuedTime  Network  Opsys   Arch    Mem   Disk  Procs       Class Features

440                 Running DEF    1 DEF 99:23:59:59 1    1   apache   apache        -    00:14:08   [NONE] [NONE] [NONE]    >=0    >=0    NC0   [batch:1] [NONE]
WARNING:  job '440' utilizes more procs than dedicated (9.99 > 1)
441                    Idle ALL    1 DEF 99:23:59:59 1    1   apache   apache        -     3:04:12   [NONE] [NONE] [NONE]    >=0    >=0    NC0   [batch:1] [NONE]


Total Jobs: 2  Active Jobs: 1



I'm using 1 processor (for showq), but for mdiag the first job uses more processors than dedicated.

I'm going crazy, please someone can explain me this behaviour?

Thanks

--
-----------------------------------------------------------
  Enrico Morelli
  System Administrator | Programmer | Web Developer

  CERM - Polo Scientifico
  via Sacconi, 6 - 50019 Sesto Fiorentino (FI) - ITALY
------------------------------------------------------------
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Amanda Calatrava | 10 Apr 13:28 2014
Picon

Fwd: Torque + BLCR: qhold problems in establish_server_connection

Hi everybody,

I'm trying to use Torque 4.2.5 + Maui with BLCR (version 0.8.5), but I'm having problems checkpointing in the worker nodes. In the frontend (pbsserver), I can checkpoint (qchkpt) and hold (qhold) jobs without problems, and restart them as well. But with the worker nodes it doesn't occur the same.

I spent an important quantity of hours trying to figure out what's happening, but neither the mom log nor the server log gave me a clue (they don't report any error). When I execute the qhold command over a job that is executed in a worker node, it changes his running state (R) to complete (C) and the job is killed, receiving in the frontend (from where it was launched) the partial output that the job has generated.

Finally, I found a LOG_ERROR message in the /var/log/syslog:

Apr  9 12:19:21 tututu pbs_mom: LOG_DEBUG::blcr_checkpoint_job, checkpoint args: /var/spool/torque/mom_priv/blcr_checkpoint_script 23562 13.pbsserver.localdomain user1 user1 /tmp/13.pbsserver.localdomain.CK ckpt.13.pbsserver.localdomain.1397038761 15 -  2>&1 1>/dev/null

Apr  9 12:19:21 tututu checkpoint_script: Invoked: /var/spool/torque/mom_priv/blcr_checkpoint_script 23562 13.pbsserver.localdomain user1 user1 /tmp/13.pbsserver.localdomain.CK ckpt.13.pbsserver.localdomain.1397038761 15 -

Apr  9 12:19:22 tututu checkpoint_script: Subcommand (cr_checkpoint --signal 15 --tree 23562 --file ckpt.13.pbsserver.localdomain.1397038761) yielded rc=0:

Apr  9 12:19:24 tututu pbs_mom: LOG_ERROR::establish_server_connection, Job 13.pbsserver.localdomain failed 3 times to get connection to pbsserver.localdomain

What's going bad?? Because after that, the worker node is able to copy the partial output of the job to the pbsserver (frontend).  In the worker nodes I only have the pbs_mom, installed by the packages generated in the instalation of torque in the frontend.

Please, any suggestion is welcome... I'm getting crazy with that. If you need some log file or more information to help me, please, ask me for it.
Thanks!



_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Stevens, Philip | 9 Apr 17:44 2014
Picon

Server not responsive

Hi there,

I have some kind of a problem. Everything worked fine except some of the jobs were stuck in the queue although they finished correctly (a former collegue said that this might happen once in a while..). Anyways, after googling how to get rid of it I tried all of the found solutions but nothing helped. Though, I decided to restart the server using:

"qterm -t quick"

I thought everything should be working fine now (since it is the quick restart option?) but after several minutes this error message keeps popping up when using qstat -a to check the queue:

Error communicating with node1(IP-Adress)
Cannot connect to default server host 'node1' - check pbs_server daemon and/or trqauthd.
qstat: cannot connect to server node1 (errno=111) Connection refused.

running pbs_server (no message) as root as well as trqauthd
 (hostname:node1
pbs_server port is: 15001
trqauthd daemonized - port 15005)

I still get the same error message.
Comparing the hostname of the server and the name given in /var/spool/torque/server_name it is exactly the same.

So after I thought it is just a quick restart, the server now does not work at all.

Any suggestions how to fix this are highly appreciated!

Thank you.

Phil
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Zhang,Jun | 9 Apr 22:41 2014

Strange node status

Below is the output of “pbsnodes nodename”. I can ping this node, but ssh connection will never prompt for password (hang). Notice loadave=582! No running jobs. What are these 239 sessions? The console gives bluish wall paper, nothing could be done. This node got into this same situation a few days ago, and came back normal after a power recycle. I wonder if some awkward job caused it to be like this. Anybody saw this before?

 

     state = free

     np = 16

     properties = bigmem

     ntype = cluster

     status = rectime=1397075049,varattr=,jobs=,state=free,netload=217308690,gres=,loadave=582.02,ncpus=16,physmem=132112272kb,availmem=179553628kb,totmem=181264264kb,idletime=194992,nusers=1,nsessions=239,sessions=4712 4717 4727 4733 4736 4739 4744 4745 4755 4759 4765 4769 4779 4784 4787 4793 4804 4817 4820 4827 4830 4840 4853 4854 4861 4867 4874 4887 4890 4895 4899 4910 4919 4922 4931 4935 4942 4954 4957 4963 4968 4977 4988 4992 4998 5002 5012 5025 5026 5032 5038 5045 5058 5061 5067 5070 5081 5089 5092 5101 5106 5113 5125 5126 5133 5138 5147 5158 5161 5169 5172 5182 5195 5196 5202 5209 5216 5229 5232 5237 5240 5251 5259 5262 5272 5276 5283 5295 5298 5303 5309 5318 5329 5330 5339 5342 5352 5365 5366 5373 5383 5390 5403 5409 5413 5424 5432 5435 5444 5448 5455 5467 5470 5476 5481 5490 5501 5504 5511 5515 5525 5538 5539 5545 5551 5558 5571 5574 5580 5583 5594 5602 5605 5615 5620 5627 5639 5642 5647 5652 5661 5672 5675 5683 5686 5696 5709 5710 5716 5723 5730 5742 5745 5751 5754 5765 5773 5776 5786 5790 5797 5809 5812 5817 5823 5832 5843 5846 5853 5856 5866 5879 5880 5887 5893 5900 5913 5916 5921 5925 5936 5944 5947 5956 5960 5967 5979 5982 5988 5993 6002 6013 6016 6023 6027 6037 6050 6055 6061 6068 6081 6084 6090 6093 6104 6112 6115 6124 6129 6136 6148 6151 6156 6161 6170 6181 6184 6192 6195 6205 6218 6219 6225 6232 6239 6251 6254 6260 6263 6274 6282 6285 6295 6299 6306 6318 6321 6326 6332,uname=Linux dqshtc14 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64,opsys=linux

     mom_service_port = 15002

     mom_manager_port = 15003

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Dr. Henrik Schulz | 9 Apr 13:50 2014
Picon

cluster head node without ssh access for the users

Dear all,

I´m currently setting up a new cluster and I want to prevent users from logging in via ssh to the machine which runs pbs_server. Users will have to use a specified submit host.

When I configure my ssh daemon the way to allow only root login via ssh, no user can login to the machine, BUT: will the moms be able to transfer files to the pbs_server machine? As far as I have observed, the file copy processes happen within the user context and are not done by root.

Is rcp the only choice then?

Since the users are ldap users and no local users, I cannot use restricted shells (like rssh or scponly), because the users shall keep the ability to ssh to all other machines in the network.

Is there a solution which provides the torque functionality without ssh?

Thanks in advance!
Henrik

--
Dr. Henrik Schulz
Zentralabteilung Technischer Service
Abteilung Informationstechnologie
Helmholtz-Zentrum Dresden-Rossendorf
Tel: +49 (0351) 260 3268

Vorstand: Prof. Dr. Dr. h. c. Roland Sauerbrey, Prof. Dr. Dr. h. c. Peter Joehnk
Vereinsregister: VR 1693 beim Amtsgericht Dresden






_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

Gmane