Austgen, Brent | 23 Oct 21:48 2014
Picon

pbsnodes shows node states as down

Hello all,

 

I am working on developing a TORQUE roll for a Rocks cluster. The roll I’m working on successfully installs the correct rpms (torque & torque-client) on the client nodes, but they are not being set up correctly. I submitted a query (link) to this list recently concerning server-side issues with qmgr permissions. I was able to resolve those issues, so thank you for your help with that. Now it seems like I’m facing some client-side issues.

 

On the head/server node, I have the server_name set as ‘victory.clusterlab.intel.com’. The pbs_server, pbs_mom, pbs_sched, and trqauthd services are all running on the head node. Also on the server node, the server_priv/nodes file looks like this:

 

compute-0-0 np=32

compute-0-1 np=32

 

 

I have the server_name set as ‘victory.clusterlab.intel.com’ on the client nodes also. When I try to check the status of the client nodes using ‘pbsnodes’, I get the following output:

 

[root <at> victory ~]# pbsnodes

compute-0-0

     state = down

     np = 32

     ntype = cluster

     mom_service_port = 15002

     mom_manager_port = 15003

 

compute-0-1

     state = down

     np = 32

     ntype = cluster

     mom_service_port = 15002

     mom_manager_port = 15003

 

 

Here are some things I have tried:

·         restart the pbs_mom and trqauthd services on the client nodes

·         ‘ping’ and ‘ssh’ to the client nodes (I am able to do both)

·         ‘cat /etc/sysconfig/selinux’ to make sure selinux is disabled

·         run ‘netstat – tuplen’ to see that the ports are open and listening

·         run ‘pbsnodes –r’ on the head node

 

 

Might someone be able to give me a lead on other things to try or check?

 

Thanks again for your help.

 

--Brent

 

 

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Drake, Johnathan | 21 Oct 21:08 2014

pbs_mom dead but pid file exists

 
We are currently in the process of installing and configuring Moab Basic 8 and Torque 5.  After updating the moab nodes file, all of our pbs_mom's died.  When we do a 'service pbs_mom status', we get a 'pbs_mom dead but pid file exists' message back.  any ideas what would cause this?  We are issuing a service start command to it via our RedHat satellite server and it seems to be bringing the moms back online.
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Christopher Smowton | 22 Oct 18:07 2014
Picon

All pbs_submit jobs fail immediately; pbs_submit_hash works fine?

Hi all,

Until this morning I was happily using PBSPy to submit jobs to Torque
4.2.6.1. A peek at the source shows it uses old fashioned pbs_submit to
get its work done.

Today, every job submitted through that route is successfully submitted,
but fails immediately with an error like:

"-bash: line 1:
/var/spool/torque/mom_priv/jobs/207602.headnode01.cluster.SC: No such file
or directory"

However, using 'qsub' to submit jobs works fine. Turns out qsub uses
pbs_submit_hash, and I was able to boil the matter down to a minimal
testcase (error checking omitted for space):

Fails (submit works, but then job fails as above):

echo "/bin/sleep 30" > /tmp/test.sh

int fd = pbs_connect(0);

new_jobname = pbs_submit(fd, 0, "/tmp/test.sh", 0, 0);

Succeeds (job executes correctly):

int fd = pbs_connect(0);

char *new_jobname = NULL;

        memmgr* mm = 0;
memmgr_init(&mm, 0);

        job_data* job_attrs = 0;

hash_add_or_exit(&mm, &job_attrs, ATTR_v, "", ENV_DATA);

pbs_submit_hash(fd, &mm, job_attrs, NULL, "/tmp/test.sh", NULL, NULL,
&new_jobname, NULL);

I had to add the ATTR_v line to work around a bug in libtorque, which
segfaults if you pass no attributes at all; however adding the blank
ATTR_v to the pbs_submit path does not affect the result.

(code to do that:

struct attropl attrib;
        attrib.next = 0;
        attrib.name = ATTR_v;
        attrib.resource = 0;
        attrib.value = "";

pbs_submit(fd, &attrib, "/tmp/test.sh", 0, 0);

)

Googling shows that a .SC not found error usually relates to a full disk
or the like. However I am able to touch and remove files from the given
location on the login, head and mom nodes, and all have 10s of gigabytes
free on all mounted partitions.

The log files for both the pbs_server and pbs_mom that run the job show
nothing out of the ordinary, except that the job exited with code 127
(bash's not-found exit code, as expected).

So. Anyone got a clue wtf is happening? Any hints for how to extract more
information about what either the server or the mom is doing wrong without
disruptive restarts (since other jobs are working properly)?

Chris

________________________________

This email is confidential and intended solely for the use of the person(s) ('the intended recipient') to
whom it was addressed. Any views or opinions presented are solely those of the author and do not
necessarily represent those of the Cancer Research UK Manchester Institute or the University of
Manchester. It may contain information that is privileged & confidential within the meaning of
applicable law. Accordingly any dissemination, distribution, copying, or other use of this message, or
any of its contents, by any person other than the intended recipient may constitute a breach of civil or
criminal law and is strictly prohibited. If you are NOT the intended recipient please contact the sender
and dispose of this e-mail as soon as possible.
Mitchel Kagawa | 18 Oct 09:59 2014

Is it possible to define memory per core rather than per node?

I recently built a small 4 node cluster.  Each node has 4 sockets each with an 8 core Xeon chip, so 32 cores per node, and 64gb ram per node.  I want to define the amount of physical memory to 2gb per core so that it is evenly distributed amongst all cores.  Is it possible to set this in the “$torque/server_priv/nodes” file?  Thanks for any help!

 

~Mitchel

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Viswanath Pasumarthi | 17 Oct 11:42 2014
Picon

Batch job on nonidentical nodes

Hi,

How to use nonidentical nodes to run a batch job using Torque? Some nodes
have 4 procs while other have 8. I would like to assign a batch job to a
combination of nodes with unequal proc numbers. I am using Torque 4.2.3.1
on linux cluster with RHEL 5 OS.

Thank you,
Viswanath.
Christopher Smowton | 16 Oct 13:32 2014
Picon

Allocate memory per node for MPI / multinode jobs?

Hi all,

Perhaps my Googling is weak, but I can't find a way to request N nodes
distributed such that I don't break the physical memory per node budget.

For example, suppose you have nodes with 4 processors each and 16GB
memory. Then if I want to run an MPI job that will use around 6-7GB per
node.

Ideally I would say "qsub -l nodes=10 -l mem=7g" and Torque / Maui would
note that only 2 of those fit on a machine and give me 5 physical nodes, 2
processes per. However in fact it gives me all 4 processors on one node,
which will run out of physical memory.

Other failed solutions:

* Hacking around the problem by booking more processors than I need:
nodes=10:ppn=2. All 20 processors are exposed to MPI, rather than the
desired 10. I could filter the nodelist manually though if that's the only
place jobs look to find out where to spawn MPI subtasks?

* Using round-robin scheduling in Maui. I get nodes allocated on different
physical machines, but there are still free slots which other processes
might try to use and break my run.

Any ideas?

Chris

________________________________

This email is confidential and intended solely for the use of the person(s) ('the intended recipient') to
whom it was addressed. Any views or opinions presented are solely those of the author and do not
necessarily represent those of the Cancer Research UK Manchester Institute or the University of
Manchester. It may contain information that is privileged & confidential within the meaning of
applicable law. Accordingly any dissemination, distribution, copying, or other use of this message, or
any of its contents, by any person other than the intended recipient may constitute a breach of civil or
criminal law and is strictly prohibited. If you are NOT the intended recipient please contact the sender
and dispose of this e-mail as soon as possible.
Delphine Ramalingom | 14 Oct 07:09 2014
Picon

File permissions and owners

Hi,

Some rights and owners have been modified on my server system because of a wrong script.
And until this, I can't restart pbs_server :

#/etc/init.d/pbs_server start
Starting TORQUE Server: PBS_Server: LOG_ERROR::get_parent_and_child, Cannot find closing tag

PBS_Server: LOG_ERROR::svr_recov_xml, Error creating attribute resources_assigned
                                                           [  OK  ]

Have you got an idea in order to solve the problem ?

thank you.

Delphine
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Austgen, Brent | 13 Oct 18:47 2014
Picon

create queue permissions

Hello everyone,

 

I am interested in building a Torque 4.2.8 roll for a Rocks cluster (CentOS 6.5) I have built.

 

So far, my process has been to build the Torque RPMs with the .SPEC file included in the source, then to build a Rocks roll with those RPMs. The roll installation seems successful because I notice the RPMs are being installed.

 

On the head node, I have the following RPMs: torque, torque-devel, torque-server, torque-sched, & torque-client.

On the compute nodes, I have the following RPMs: torque & torque-client.

 

I know that pbs_mom is running on the compute nodes and that pbs_server, pbs_sched, and pbs_mom are all running on the head node.

 

The problem is that after I have everything installed, I am unable to create a queue. Here is some output from my screen.

 

 

[root <at> tirpitz ~]# qmgr -c "create queue my_queue"

qmgr obj=my_queue svr=default: Unauthorized Request  MSG=error in permissions (PERM_MANAGER)

 

 

After browsing through a few forums, I have been led to believe my problem might be in my /etc/hosts file.

 

 

[root <at> tirpitz ~]# cat /etc/hosts

# Added by rocks report host #

#        DO NOT MODIFY       #

#  Add any modifications to  #

#    /etc/hosts.local file   #

 

127.0.0.1       localhost.localdomain   localhost

 

10.1.255.254    compute-0-0.local       compute-0-0

10.1.255.204    compute-0-0-mic0.local  compute-0-0-mic0

10.1.255.253    compute-0-1.local       compute-0-1

10.1.255.203    compute-0-1-mic0.local  compute-0-1-mic0

10.1.255.252    compute-0-2.local       compute-0-2

10.1.255.202    compute-0-2-mic0.local  compute-0-2-mic0

10.1.255.251    compute-0-3.local       compute-0-3

10.1.255.201    compute-0-3-mic0.local  compute-0-3-mic0

10.1.1.1        tirpitz.local   tirpitz

172.30.170.6    tirpitz.(X).(Y).(Z)

 

 

I should also note that the /etc/hosts.local file does not exist.

 

Might someone be able to point out a mistake I’m making?

 

Kind regards,

 

Brent

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Renato Alves | 13 Oct 18:37 2014
Picon

Configuring Maui/Torque to map different queues to same nodes based on available resources

Hi everyone,

First of all, apologies if this message is sent to the wrong list but
the Maui mailing list seems dead or abandoned since last July.

I want to setup Maui/Torque to have queues mapping to specific nodes. I
currently have 2 queues (main, priv) and 2 nodes (node1, node2).

What I would like to achieve is the following:

main -> node1
priv -> node2, node1 (only when node2 is "full", jobs should go to node1)

I've tried to use PARTITIONs and have each queue map to a different
partition. This makes each queue send requests to each node but it's not
sufficient as it seems one cannot have the same node in more than one
partition.

I've also tried to implement reservations and use the "queue to node
mapping" feature but so far I've failed to make it work as intended.

My current configuration is:

# maui.cfg
NODEALLOCATIONPOLICY  MINRESOURCE
NODECFG[node1] RACK=1 SLOT=1 PARTITION=MAIN
NODECFG[node2] RACK=2 SLOT=1 PARTITION=PRIVATE

# torque server_priv/nodes
node0 np=32 main
node1 np=24 private

# torque qmgr
set queue main resources_default.neednodes = main
set queue main resources_default.nodes = 1
set queue priv resources_default.neednodes = private
set queue priv resources_default.nodes = 1

Can someone provide some guidance in getting the queue -> node mapping
work as intended ?

Thanks
Renato

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Eduardo A. Suárez | 10 Oct 12:54 2014
Picon

problem compiling torque 4.2.x

Hi,

I'm compiling torque 4.2.x (I've tried 4.2.7, 4.2.8 and 4.2.9 with the
same problem) in debian linux with gcc 4.9.1.

I get this error:

...
g++ -DHAVE_CONFIG_H -I. -I../../src/include  -I../../src/include
-DPBS_SERVER_HOME=\"/var/spool/torque\"
-DPBS_ENVIRON=\"/var/spool/torque/pbs_environment\" `xml2-config
--cflags` -g -DGEOMETRY_REQUESTS -DALWAYS_USE_CPUSETS -DNUMA_SUPPORT
-I/usr/include -fstack-protector -Wformat -Wformat-security
-DFORTIFY_SOURCE=2 -MT req_quejob.o -MD -MP -MF .deps/req_quejob.Tpo
-c -o req_quejob.o req_quejob.c

req_quejob.c: In function 'int req_commit(batch_request*)':
req_quejob.c:2002:31: error: 'snprint' was not declared in this scope
make[3]: *** [req_quejob.o] Error 1

Eduardo.-

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.
Eduardo A. Suárez | 10 Oct 01:20 2014
Picon

problem compiling torque 4.2.x

Hi,

I'm compiling torque 4.2.x (I've tried 4.2.7, 4.2.8 and 4.2.9 with the  
same problem) in debian linux with gcc 4.9.1.

I get this error:

...
g++ -DHAVE_CONFIG_H -I. -I../../src/include  -I../../src/include   
-DPBS_SERVER_HOME=\"/var/spool/torque\"  
-DPBS_ENVIRON=\"/var/spool/torque/pbs_environment\" `xml2-config  
--cflags` -g -DGEOMETRY_REQUESTS -DALWAYS_USE_CPUSETS -DNUMA_SUPPORT  
-I/usr/include -fstack-protector -Wformat -Wformat-security  
-DFORTIFY_SOURCE=2 -MT req_quejob.o -MD -MP -MF .deps/req_quejob.Tpo  
-c -o req_quejob.o req_quejob.c

req_quejob.c: In function 'int req_commit(batch_request*)':
req_quejob.c:2002:31: error: 'snprint' was not declared in this scope
make[3]: *** [req_quejob.o] Error 1

Eduardo.-

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

Gmane