Eva Hocks | 29 Jun 20:44 2015
Picon

torque 4.2.9 hierarchy file


Can anybody enlighten me on the hierarchy file?

The one created by the server running torque 4.2.9 has incorrect
entries, such as incorrect IP addresses even after the hostname of the
mom's is correctly resolved via DNS. It also has various moms on port 0
in addition to port 15003, the default.

The defect hierarchy file results in errors such as:

LOG_ERROR::tcp_request, bad connect from 10.1.255.132:528

Any help would be appreciated.

Thanks
Eva
Rui Zhang | 29 Jun 09:10 2015
Picon

HA pbs server setup

Hi All,

I would like to ask some advices for high available pbs server setup. I install torque-4.2.10 from epel
repository on two different virtual machines as pbs servers. I want them to be high available and load
balanced. I have two worker nodes as well. I set up the cluster based on the four machines and jobs are
submitted from either server. But somehow I think I just create two different pbs systems, one is server1 +
2 worker nodes, the other is server2 + 2 worker nodes because of the following reasons:
1. when I submit jobs from the two servers, the job id are created separately. Every time I submit three jobs
from server 1 and two jobs from server 2. For example for the second time I submit jobs, the id from server 1 is
4,5,6 and the id from server 2 is 3,4.
2. I created the queue on both server with same name. If I delete the queue from server 2 but again submit jobs,
it says:
	qsub: submit error (Unknown queue MSG=requested queue not found)
3. the server_priv/ is not shared by two servers. Only server_name is shared via NFS. I also set lock file
path by
	qmgr -c “set server lock_file=/nfs/path”, but it is hard to see whether it is successful or not.
Can anyone give me some suggestions how to check if I set up a high available server or not? I start the service
on both pbs servers by
	pbs_server —ha
Thanks in advance.

Cheers,
Rui
Attachment (smime.p7s): application/pkcs7-signature, 2346 bytes
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
(Continue reading)

Zahid Hossain | 25 Jun 19:44 2015
Picon

Torque Web Monitor

Hi, has anyone configured Torque Web Monitor 
(http://sourceforge.net/projects/torquemon/) with the free version of 
Torque ? It expects a command line tool "pbs_connect", which doesn't 
seem to exists.
Svancara, Randall | 24 Jun 18:49 2015

Failing Interactive Jobs using qsub

Hi

Everytime I submit an interactive form:

qsub -I -l nodes=1:ppn=24,walltime=2:00:00

The job exists immediately.  Trying to figure out why.  I have provided logs below.


Relevant PBS_SERVER logs
06/24/2015 09:08:10.549;64;PBS_Server.31538;Req;node_spec;entered spec=1:ppn=24
06/24/2015 09:08:10.549;64;PBS_Server.31538;Req;node_spec;job allocation debug: 1 requested, 664 svr_clnodes, 20 svr_totnodes
06/24/2015 09:08:10.549;01;PBS_Server.31538;Svr;PBS_Server;LOG_DEBUG::gpu_count, Counted 0 gpus available on node compute-2-1-ib.local
06/24/2015 09:08:10.549;01;PBS_Server.31538;Svr;PBS_Server;LOG_DEBUG::gpu_count, Counted 0 gpus free on node compute-2-1-ib.local
06/24/2015 09:08:10.549;64;PBS_Server.31538;Req;node_spec;job allocation debug(3): returning 1 requested
06/24/2015 09:08:10.550;08;PBS_Server.31538;Job;reply_send_svr;Reply sent for request type QueueJob on socket 9
06/24/2015 09:08:10.551;128;PBS_Server.31538;Req;dis_request_read;decoding command ReadyToCommit from rsvancara
06/24/2015 09:08:10.551;08;PBS_Server.31538;Job;dispatch_request;dispatching request ReadyToCommit on sd=9
06/24/2015 09:08:10.551;08;PBS_Server.31538;Job;62630.mgt2-ib.local;ready to commit job
06/24/2015 09:08:10.551;08;PBS_Server.31538;Job;reply_send_svr;Reply sent for request type ReadyToCommit on socket 9
06/24/2015 09:08:10.551;08;PBS_Server.31538;Job;62630.mgt2-ib.local;ready to commit job completed
06/24/2015 09:08:10.551;128;PBS_Server.31538;Req;dis_request_read;decoding command Commit from rsvancara
06/24/2015 09:08:10.551;08;PBS_Server.31538;Job;dispatch_request;dispatching request Commit on sd=9
06/24/2015 09:08:10.551;08;PBS_Server.31538;Job;62630.mgt2-ib.local;committing job
06/24/2015 09:08:10.552;08;PBS_Server.31538;Job;svr_setjobstate;svr_setjobstate: setting job 62630.mgt2-ib.local state from TRANSIT-TRANSICM to QUEUED-QUEUED (1-10)
06/24/2015 09:08:10.552;256;PBS_Server.31538;Job;62630.mgt2-ib.local;enqueuing into batch, state 1 hop 1
06/24/2015 09:08:10.552;02;PBS_Server.31538;Svr;lock_sv_qs_mutex;svr_enquejob: locking sv_qs_mutex
06/24/2015 09:08:10.552;02;PBS_Server.31538;Svr;unlock_sv_qs_mutex;svr_enquejob: unlocking sv_qs_mutex
06/24/2015 09:08:10.552;08;PBS_Server.31538;Job;svr_enquejob;jobs queued job id 62630.mgt2-ib.local for batch
06/24/2015 09:08:10.553;08;PBS_Server.31538;Job;reply_send_svr;Reply sent for request type Commit on socket 9
06/24/2015 09:08:10.553;08;PBS_Server.31538;Job;req_commit;job_id: 62630.mgt2-ib.local
06/24/2015 09:08:10.553;128;PBS_Server.31538;Req;dis_request_read;decoding command Disconnect from rsvancara
06/24/2015 09:08:10.553;02;PBS_Server.31538;node;close_conn;Closing connection 9 and calling its accompanying function on close
06/24/2015 09:08:11.052;128;PBS_Server.18865;Req;dis_request_read;decoding command StatusNode from maui
06/24/2015 09:08:11.052;08;PBS_Server.18865;Job;dispatch_request;dispatching request StatusNode on sd=10
06/24/2015 09:08:11.052;64;PBS_Server.18865;Req;req_stat_node;entered
06/24/2015 09:08:11.053;08;PBS_Server.18865;Job;reply_send_svr;Reply sent for request type StatusNode on socket 10
06/24/2015 09:08:11.054;128;PBS_Server.18865;Req;dis_request_read;decoding command StatusQueue from maui
06/24/2015 09:08:11.054;08;PBS_Server.18865;Job;dispatch_request;dispatching request StatusQueue on sd=10
06/24/2015 09:08:11.054;08;PBS_Server.18865;Job;reply_send_svr;Reply sent for request type StatusQueue on socket 10
06/24/2015 09:08:11.056;128;PBS_Server.18865;Req;dis_request_read;decoding command StatusJob from maui
06/24/2015 09:08:11.056;08;PBS_Server.18865;Job;dispatch_request;dispatching request StatusJob on sd=10
06/24/2015 09:08:11.056;08;PBS_Server.18865;Job;req_stat_job;note
06/24/2015 09:08:11.060;08;PBS_Server.18865;Job;reply_send_svr;Reply sent for request type StatusJob on socket 10
06/24/2015 09:08:11.060;02;PBS_Server.18865;Job;req_statjob;Successfully returned the status of queued jobs
06/24/2015 09:08:11.064;128;PBS_Server.18865;Req;dis_request_read;decoding command RunJob from maui
06/24/2015 09:08:11.064;08;PBS_Server.18865;Job;dispatch_request;dispatching request RunJob on sd=10
06/24/2015 09:08:11.064;64;PBS_Server.18865;Req;set_nodes;allocating nodes for job 62630.mgt2-ib.local with node expression 'compute-2-9-ib.local:ppn=24'
06/24/2015 09:08:11.064;64;PBS_Server.18865;Req;node_spec;entered spec=compute-2-9-ib.local:ppn=24
06/24/2015 09:08:11.064;64;PBS_Server.18865;Req;node_spec;job allocation debug: 1 requested, 664 svr_clnodes, 20 svr_totnodes
06/24/2015 09:08:11.064;01;PBS_Server.18865;Svr;PBS_Server;LOG_DEBUG::gpu_count, Counted 0 gpus available on node compute-2-9-ib.local
06/24/2015 09:08:11.064;01;PBS_Server.18865;Svr;PBS_Server;LOG_DEBUG::gpu_count, Counted 0 gpus free on node compute-2-9-ib.local
06/24/2015 09:08:11.064;64;PBS_Server.18865;Req;node_spec;job allocation debug(3): returning 1 requested
06/24/2015 09:08:11.064;64;PBS_Server.18865;Req;set_nodes;job 62630.mgt2-ib.local allocated 1 nodes (nodelist=compute-2-9-ib.local/28-51)
06/24/2015 09:08:11.064;08;PBS_Server.18865;Job;62630.mgt2-ib.local;Job Run at request of maui <at> mgt2-ib.local
06/24/2015 09:08:11.064;08;PBS_Server.18865;Job;svr_setjobstate;svr_setjobstate: setting job 62630.mgt2-ib.local state from QUEUED-QUEUED to RUNNING-PRERUN (4-40)
06/24/2015 09:08:11.065;04;PBS_Server.18865;Svr;svr_connect;attempting connect to host 10.50.1.209 port 15002
06/24/2015 09:08:11.124;08;PBS_Server.18865;Job;62630.mgt2-ib.local;entering finish_sendmom
06/24/2015 09:08:11.124;02;PBS_Server.18865;Job;62630.mgt2-ib.local;child reported success for job after 0 seconds (dest=???), rc=0
06/24/2015 09:08:11.124;08;PBS_Server.18865;Job;reply_send_svr;Reply sent for request type RunJob on socket 10
06/24/2015 09:08:11.124;08;PBS_Server.18865;Job;svr_setjobstate;svr_setjobstate: setting job 62630.mgt2-ib.local state from RUNNING-TRNOUTCM to RUNNING-RUNNING (4-42)
06/24/2015 09:08:11.125;13;PBS_Server.18865;Job;62630.mgt2-ib.local;preparing to send 'b' mail for job 62630.mgt2-ib.local to rsvancara <at> login2.local (---)
06/24/2015 09:08:11.125;13;PBS_Server.18865;Job;62630.mgt2-ib.local;Not sending email: User does not want mail of this type.
06/24/2015 09:08:11.228;128;PBS_Server.18885;Req;dis_request_read;decoding command JobObituary from pbs_mom
06/24/2015 09:08:11.228;08;PBS_Server.18885;Job;dispatch_request;dispatching request JobObituary on sd=11
06/24/2015 09:08:11.228;09;PBS_Server.18885;Job;62630.mgt2-ib.local;obit received - updating final job usage info
06/24/2015 09:08:11.228;08;PBS_Server.18885;Job;62630.mgt2-ib.local;attr resources_used modified
06/24/2015 09:08:11.228;08;PBS_Server.18885;Job;62630.mgt2-ib.local;attr Error_Path modified
06/24/2015 09:08:11.228;08;PBS_Server.18885;Job;62630.mgt2-ib.local;attr Output_Path modified
06/24/2015 09:08:11.229;08;PBS_Server.18885;Job;reply_send_svr;Reply sent for request type JobObituary on socket 11
06/24/2015 09:08:11.229;13;PBS_Server.18885;Job;62630.mgt2-ib.local;preparing to send 'a' mail for job 62630.mgt2-ib.local to rsvancara <at> login2.local (Job cannot be executed
06/24/2015 09:08:11.229;13;PBS_Server.18885;Job;62630.mgt2-ib.local;Updated mailto from job owner: 'rsvancara <at> login2.local'
06/24/2015 09:08:11.229;09;PBS_Server.18885;Job;62630.mgt2-ib.local;job exit status -1 handled
06/24/2015 09:08:11.248;16;PBS_Server.18885;Job;62630.mgt2-ib.local;Exit_status=-1 resources_used.cput=00:00:00 resources_used.energy_used=0 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=00:00:00 Error_Path=/dev/pts/4 Output_Path=/dev/pts/4
06/24/2015 09:08:11.248;08;PBS_Server.18885;Job;svr_setjobstate;svr_setjobstate: setting job 62630.mgt2-ib.local state from RUNNING-RUNNING to EXITING-EXITING (5-50)
06/24/2015 09:08:11.248;09;PBS_Server.18885;Job;62630.mgt2-ib.local;on_job_exit task assigned to job
06/24/2015 09:08:11.248;09;PBS_Server.18885;Job;62630.mgt2-ib.local;req_jobobit completed
06/24/2015 09:08:11.248;08;PBS_Server.18885;Job;62630.mgt2-ib.local;calling on_job_exit from req_jobobit
06/24/2015 09:08:11.249;08;PBS_Server.12611;Job;62630.mgt2-ib.local;on_job_exit valid pjob: 62630.mgt2-ib.local (substate=50)
06/24/2015 09:08:11.249;08;PBS_Server.12611;Job;handle_exiting_or_abort_substate;62630.mgt2-ib.local; JOB_SUBSTATE_EXITING
06/24/2015 09:08:11.249;08;PBS_Server.12611;Job;svr_setjobstate;svr_setjobstate: setting job 62630.mgt2-ib.local state from EXITING-EXITING to EXITING-RETURNSTD (5-70)
06/24/2015 09:08:11.250;128;PBS_Server.18885;Req;dis_request_read;decoding command Disconnect from pbs_mom
06/24/2015 09:08:11.250;08;PBS_Server.12611;Job;62630.mgt2-ib.local;no spool files to return
06/24/2015 09:08:11.250;08;PBS_Server.12611;Job;svr_setjobstate;svr_setjobstate: setting job 62630.mgt2-ib.local state from EXITING-RETURNSTD to EXITING-STAGEOUT (5-52)
06/24/2015 09:08:11.250;08;PBS_Server.12611;Job;handle_stageout;JOB_SUBSTATE_STAGE_OUT: 62630.mgt2-ib.local
06/24/2015 09:08:11.250;08;PBS_Server.12611;Job;62630.mgt2-ib.local;no files to copy
06/24/2015 09:08:11.251;08;PBS_Server.12611;Job;svr_setjobstate;svr_setjobstate: setting job 62630.mgt2-ib.local state from EXITING-STAGEOUT to EXITING-STAGEDEL (5-53)
06/24/2015 09:08:11.251;08;PBS_Server.12611;Job;handle_stagedel;JOB_SUBSTATE_STAGEDEL: 62630.mgt2-ib.local
06/24/2015 09:08:11.251;08;PBS_Server.12611;Job;svr_setjobstate;svr_setjobstate: setting job 62630.mgt2-ib.local state from EXITING-STAGEDEL to EXITING-EXITED (5-54)
06/24/2015 09:08:11.251;08;PBS_Server.12611;Job;handle_exited;62630.mgt2-ib.local; JOB_SUBSTATE_EXITED
06/24/2015 09:08:11.251;08;PBS_Server.12611;Job;mom_comm;62630.mgt2-ib.local
06/24/2015 09:08:11.251;04;PBS_Server.12611;Svr;svr_connect;attempting connect to host 10.50.1.209 port 15002
06/24/2015 09:08:11.258;64;PBS_Server.12611;Req;free_nodes;freeing nodes for job 62630.mgt2-ib.local
06/24/2015 09:08:11.258;64;PBS_Server.12611;Req;remove_job_from_node;increased execution slot free count to 36 of 64
06/24/2015 09:08:11.258;08;PBS_Server.12611;Job;svr_setjobstate;svr_setjobstate: setting job 62630.mgt2-ib.local state from EXITING-EXITED to COMPLETE-COMPLETE (6-59)
06/24/2015 09:08:11.258;08;PBS_Server.12611;Job;svr_setjobstate;jobs queued job id 62630.mgt2-ib.local for users
06/24/2015 09:08:11.258;08;PBS_Server.12611;Job;svr_setjobstate;jobs queued job id 62630.mgt2-ib.local for queue batch
06/24/2015 09:08:11.258;08;PBS_Server.12611;Job;62630.mgt2-ib.local;JOB_SUBSTATE_COMPLETE
06/24/2015 09:08:11.258;08;PBS_Server.12611;Job;62630.mgt2-ib.local;adding job to completed_jobs_map from handle_complete_first_time
06/24/2015 09:08:12.009;128;PBS_Server.18865;Req;dis_request_read;decoding command StatusNode from maui
06/24/2015 09:08:12.009;08;PBS_Server.18865;Job;dispatch_request;dispatching request StatusNode on sd=10
06/24/2015 09:08:12.009;64;PBS_Server.18865;Req;req_stat_node;entered
06/24/2015 09:08:12.010;08;PBS_Server.18865;Job;reply_send_svr;Reply sent for request type StatusNode on socket 10
06/24/2015 09:08:12.011;128;PBS_Server.18865;Req;dis_request_read;decoding command StatusQueue from maui
06/24/2015 09:08:12.011;08;PBS_Server.18865;Job;dispatch_request;dispatching request StatusQueue on sd=10
06/24/2015 09:08:12.011;08;PBS_Server.18865;Job;reply_send_svr;Reply sent for request type StatusQueue on socket 10
06/24/2015 09:08:12.013;128;PBS_Server.18865;Req;dis_request_read;decoding command StatusJob from maui
06/24/2015 09:08:12.013;08;PBS_Server.18865;Job;dispatch_request;dispatching request StatusJob on sd=10
06/24/2015 09:08:12.013;08;PBS_Server.18865;Job;req_stat_job;note

Relevant PBS_MOM logs

06/24/2015 09:08:11.121;01;   pbs_mom.63713;Job;exec_job_on_ms;ALERT:  job failed phase 3 start - jobid 62630.mgt2-ib.local
06/24/2015 09:08:11.121;01;   pbs_mom.63713;Job;exec_bail;bailing on job 62630.mgt2-ib.local code -1
06/24/2015 09:08:11.121;08;   pbs_mom.63713;Req;send_sisters;sending ABORT to sisters for job 62630.mgt2-ib.local
06/24/2015 09:08:11.165;128;   pbs_mom.63713;Svr;preobit_preparation;top
06/24/2015 09:08:11.228;128;   pbs_mom.63713;Job;62630.mgt2-ib.local;obit sent to server

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
MANUEL MORICHE GUERRERO | 24 Jun 11:08 2015
Picon

Re: all jobs sent to one node

Thank you David, you were right. Torque did allocate the job properly but the mother superior was not working properly.

The solution is to call mpirun with:

mpirun -np 36 -hostfile $PBS_NODEFILE ./path/to/executable

PBS_NODEFILE is created by Torque, so no need to create an environment variable.

This differs from our former configuration of the cluster where this "-hostfile $PBS_NODEFILE" need not to be defined.
Also -lnodes=36 does not work anymore, we need to define -lnodes=3:ppn=12.

So the original problem is solved, but if some one could help with this configuration (minor) issue, it will be appreciated.

Manuel.

2015-06-22 17:59 GMT+02:00 <torqueusers-request <at> supercluster.org>:
Send torqueusers mailing list submissions to
        torqueusers <at> supercluster.org

To subscribe or unsubscribe via the World Wide Web, visit
        http://www.supercluster.org/mailman/listinfo/torqueusers
or, via email, send a message with subject or body 'help' to
        torqueusers-request <at> supercluster.org

You can reach the person managing the list at
        torqueusers-owner <at> supercluster.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of torqueusers digest..."


Today's Topics:

   1. What is a 'task' (Andrus, Brian Contractor)
   2. Re: What is a 'task' (Rick McKay)
   3. all jobs sent to one node (MANUEL MORICHE GUERRERO)
   4. Re: all jobs sent to one node (David Beer)
   5. CPULOAD with EXACTNODE makes a job queued indefinitely
      (Zahid Hossain)
   6. Re: Critical bug on restart with job dependencies with Torque
      5.1 or 5.1.0.h2 (David Beer)


----------------------------------------------------------------------

Message: 1
Date: Mon, 15 Jun 2015 17:13:19 +0000
From: "Andrus, Brian Contractor" <bdandrus <at> nps.edu>
Subject: [torqueusers] What is a 'task'
To: "Torque Users Mailing List (torqueusers <at> supercluster.org)"
        <torqueusers <at> supercluster.org>
Message-ID:
        <ADC981242279AD408816CB7141A2789DC533BD52 <at> GROWLER.ern.nps.edu>
Content-Type: text/plain; charset="us-ascii"

Ok, this has been bugging me a bit.

What, in the context of Torque/Moab/Maui is a 'task'?

The way it is referenced, it seems to sometimes be a job that can run, a single processor core or even a node.

The same confusion could be said for 'class' as well, but I think that is mostly a synonym for 'queue'

I'd suggest cleaning up the docs AND help files for consistency. My users keep getting confused.


Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20150615/e51accd1/attachment-0001.html

------------------------------

Message: 2
Date: Mon, 15 Jun 2015 11:23:26 -0600
From: Rick McKay <rmckay <at> adaptivecomputing.com>
Subject: Re: [torqueusers] What is a 'task'
To: Torque Users Mailing List <torqueusers <at> supercluster.org>
Message-ID:
        <CA+O7FjXhJGtiP40XA6ttHTMc683U6GQuxR84-qcAbBEpoHmE=w <at> mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi Brian,

Can you break down which parts of the definition you'd suggest we improve?

http://docs.adaptivecomputing.com/8-0-2/basic/help.htm#topics/moabWorkloadManager/topics/schedBasics/environment.html#taskdefinition

I'll be happy to work with documentation to make it better.

Rick
------
Rick McKay | Technical Support Engineer
rmckay <at> adaptivecomputing.com



On Mon, Jun 15, 2015 at 11:13 AM, Andrus, Brian Contractor <bdandrus <at> nps.edu
> wrote:

>  Ok, this has been bugging me a bit.
>
>
>
> What, in the context of Torque/Moab/Maui is a ?task??
>
>
>
> The way it is referenced, it seems to sometimes be a job that can run, a
> single processor core or even a node.
>
>
>
> The same confusion could be said for ?class? as well, but I think that is
> mostly a synonym for ?queue?
>
>
>
> I?d suggest cleaning up the docs AND help files for consistency. My users
> keep getting confused.
>
>
>
>
>
> Brian Andrus
>
> ITACS/Research Computing
>
> Naval Postgraduate School
>
> Monterey, California
>
> voice: 831-656-6238
>
>
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers <at> supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20150615/2cfdc7c7/attachment-0001.html

------------------------------

Message: 3
Date: Sun, 14 Jun 2015 16:15:17 +0200
From: MANUEL MORICHE GUERRERO <mmoriche <at> ing.uc3m.es>
Subject: [torqueusers] all jobs sent to one node
To: torqueusers <at> supercluster.org
Message-ID:
        <CAEfxEd7HOAo_LxH2z0_-Z+xD5bqo22PD3dz4jbTP2wV7=MTJzQ <at> mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi all,

We have a cluster installed with rocks 6.2, including roll torque.

We are having the following problem when submitting jobs: If the job is
scheduled for more than one node, let's say -l nodes=3:ppn=12, all
processes go to one compute node (36) and the other two nodes have no
process in them.

qstat says that the three nodes are being used, 12 procs each, but entering
the nodes and doing ps, the process are all in one node.

Also, if a job is already running with only 12 procs, and we sent another
one with 12 procs without specifying the node (-lnodes=12) the former node
can be overloaded with 24 procs.

We are using torque 4.2.7 maui 3.3.1

Any help will be appreciated.

Thanks in advance

Manuel

--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20150614/456d6d71/attachment-0001.html

------------------------------

Message: 4
Date: Mon, 15 Jun 2015 16:58:23 -0600
From: David Beer <dbeer <at> adaptivecomputing.com>
Subject: Re: [torqueusers] all jobs sent to one node
To: Torque Users Mailing List <torqueusers <at> supercluster.org>
Message-ID:
        <CAFUQeZ3tfyEfxWNQ8V_1FMr8OM3kVwjGibyuN30pvze5g7MWbQ <at> mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

How is your script launching the processes? Torque reserves the 3 nodes for
you, but it only starts the job script on the mother superior (the first
host listed in exec_host, which you can find by running qstat -f). Your
script then has to launch the all other processes that it'd like to run.

On Sun, Jun 14, 2015 at 8:15 AM, MANUEL MORICHE GUERRERO <
mmoriche <at> ing.uc3m.es> wrote:

> Hi all,
>
> We have a cluster installed with rocks 6.2, including roll torque.
>
> We are having the following problem when submitting jobs: If the job is
> scheduled for more than one node, let's say -l nodes=3:ppn=12, all
> processes go to one compute node (36) and the other two nodes have no
> process in them.
>
> qstat says that the three nodes are being used, 12 procs each, but
> entering the nodes and doing ps, the process are all in one node.
>
> Also, if a job is already running with only 12 procs, and we sent another
> one with 12 procs without specifying the node (-lnodes=12) the former node
> can be overloaded with 24 procs.
>
> We are using torque 4.2.7 maui 3.3.1
>
> Any help will be appreciated.
>
> Thanks in advance
>
> Manuel
>
> --
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers <at> supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


--
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20150615/a0cb0f90/attachment-0001.html

------------------------------

Message: 5
Date: Tue, 16 Jun 2015 15:54:22 -0700
From: Zahid Hossain <zhossain <at> stanford.edu>
Subject: [torqueusers] CPULOAD with EXACTNODE makes a job queued
        indefinitely
To: Torque Users Mailing List <torqueusers <at> supercluster.org>
Message-ID: <D72CE3D1-0DF8-42FA-AA44-170919E4B0FB <at> stanford.edu>
Content-Type: text/plain; charset=utf-8

Hi I have the following:

NODEALLOCATIONPOLICY    CPULOAD
JOBNODEMATCHPOLICY      EXACTNODE

And then when i do

> qsub -l nodes=6 test.qsub

The job remains in ?Q? state forever. This does not happen when I have ?NODEALLOCATIONPOLICY    MINRESOURCE.? But I need the capability of CPULOAD, how do this. Basically I want to jobs to be really distributed across X nodes when i say "-l nodes=X?
but otherwise, it should schedule nodes according to CPULOAD.

Zahid


------------------------------

Message: 6
Date: Mon, 22 Jun 2015 09:59:07 -0600
From: David Beer <dbeer <at> adaptivecomputing.com>
Subject: Re: [torqueusers] Critical bug on restart with job
        dependencies with Torque 5.1 or 5.1.0.h2
To: Torque Users Mailing List <torqueusers <at> supercluster.org>
Message-ID:
        <CAFUQeZ3Z=2PNq+Egv_KJOqL=LaG1AP5k3wiwr_QAvd0_jerX0Q <at> mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Michel,

I apologize for a very late response. Can you provide more details on how
to reproduce this issue? I failed to reproduce this with the following test:

dbeer <at> napali:~/dev/torque/5.1-dev$ echo sleep 200 | qsub
22718.napali
dbeer <at> napali:~/dev/torque/5.1-dev$ echo sleep 200 | qsub -W
depend=afterok:22718
22719.napali
dbeer <at> napali:~/dev/torque/5.1-dev$ echo sleep 200 | qsub -W
depend=afterok:22719
22720.napali
dbeer <at> napali:~/dev/torque/5.1-dev$ echo sleep 200 | qsub -W
depend=afterok:22720
22721.napali
dbeer <at> napali:~/dev/torque/5.1-dev$ echo sleep 200 | qsub -W
depend=afterok:22721
22722.napali
dbeer <at> napali:~/dev/torque/5.1-dev$ echo sleep 200 | qsub -W
depend=afterok:22722
22723.napali
dbeer <at> napali:~/dev/torque/5.1-dev$ qstat
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
22718.napali               STDIN            dbeer                  0 Q
batch
22719.napali               STDIN            dbeer                  0 H
batch
22720.napali               STDIN            dbeer                  0 H
batch
22721.napali               STDIN            dbeer                  0 H
batch
22722.napali               STDIN            dbeer                  0 H
batch
22723.napali               STDIN            dbeer                  0 H
batch
dbeer <at> napali:~/dev/torque/5.1-dev$ qterm
dbeer <at> napali:~/dev/torque/5.1-dev$ sudo pbs_server
dbeer <at> napali:~/dev/torque/5.1-dev$ qstat
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
22718.napali               STDIN            dbeer                  0 Q
batch
22719.napali               STDIN            dbeer                  0 H
batch
22720.napali               STDIN            dbeer                  0 H
batch
22721.napali               STDIN            dbeer                  0 H
batch
22722.napali               STDIN            dbeer                  0 H
batch
22723.napali               STDIN            dbeer                  0 H batch

On Tue, Jun 2, 2015 at 2:29 PM, Michel B?land <michel.beland <at> calculquebec.ca
> wrote:

> Hello,
>
> As I said in a previous message last month, we are plagued by regular
> crashes of pbs_server on a production Linux cluster. It crashes every
> two or three days. As I did not have time then to upgrade Torque to the
> hotfix
> (http://files.adaptivecomputing.com/hotfix/torque-5.1.0.h2.tar.gz)
> provided by David Beer on this list (we are now ready to do it at the
> next crash), we just wrote a small script that would just restart
> pbs_server whenever it died.
>
> However, by doing this we found another bug in Torque. Whenever we start
> pbs_server it kills jobs with dependencies. You can find below the
> content of server_logs for the date in question when it happens,
> reproduced on a test cluster (this is with 5.1.0.h2). It seems that
> Torque requeues jobs from the server database on disk, but does not load
> *all* the jobs before checking the dependencies. It means that it fails
> to find a job mentioned typically in a beforeok dependency because it
> did not requeue it yet. When this happens, Torque kills the first job
> and the dependent jobs are killed because the dependency cannot be met.
>
> Is there a workaround or a fix for this problem? Installing the hotfix
> will hopefully help on the production cluster (if the pbs_server crashes
> are eliminated), but it does not solve the bug.
>
> 06/02/2015 14:06:56;0002;PBS_Server.22660;Svr;Log;Log opened
> 06/02/2015 14:06:56;0006;PBS_Server.22660;Svr;PBS_Server;Server egeon2
> started, initialization type = 1
> 06/02/2015
> 14:06:56;0002;PBS_Server.22660;Svr;get_default_threads;Defaulting
> min_threads to 3 threads
> 06/02/2015 14:06:56;0002;PBS_Server.22660;Svr;Act;Account file
> /var/spool/pbs/server_priv/accounting/20150602 opened
> 06/02/2015 14:06:56;0040;PBS_Server.22660;Req;setup_nodes;setup_nodes()
> 06/02/2015 14:06:56;0086;PBS_Server.22660;Svr;PBS_Server;Recovered queue
> batch
> 06/02/2015 14:06:56;0002;PBS_Server.22660;Svr;PBS_Server;Expected 1,
> recovered 1 queues
> 06/02/2015 14:06:56;0080;PBS_Server.22660;Svr;PBS_Server;12 total files
> read from disk
> 06/02/2015 14:06:56;0100;PBS_Server.22660;Job;88.egeon2;enqueuing into
> batch, state 4 hop 1
> 06/02/2015 14:06:56;0080;PBS_Server.22660;Job;89.egeon2;Unknown Job Id
> Error
> 06/02/2015 14:06:56;0080;PBS_Server.22660;Req;req_reject;Reject reply
> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from
> <at> egeon2
> 06/02/2015
> 14:06:56;0001;PBS_Server.22660;Svr;PBS_Server;LOG_ERROR::Unknown Job Id
> Error (15001) in send_depend_req, Unable to perform dependency with job
> 89.egeon2
> 06/02/2015 14:06:56;0100;PBS_Server.22660;Job;89.egeon2;enqueuing into
> batch, state 2 hop 1
> 06/02/2015 14:06:56;0086;PBS_Server.22660;Job;89.egeon2;Requeueing job,
> substate: 22 Requeued in queue: batch
> 06/02/2015 14:06:56;0100;PBS_Server.22660;Job;90.egeon2;enqueuing into
> batch, state 2 hop 1
> 06/02/2015 14:06:56;0080;PBS_Server.22660;Job;92.egeon2;Unknown Job Id
> Error
> 06/02/2015 14:06:56;0080;PBS_Server.22660;Req;req_reject;Reject reply
> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from
> <at> egeon2
> 06/02/2015
> 14:06:56;0001;PBS_Server.22660;Svr;PBS_Server;LOG_ERROR::Unknown Job Id
> Error (15001) in send_depend_req, Unable to perform dependency with job
> 92.egeon2
> 06/02/2015 14:06:56;000f;PBS_Server.22660;Job;90.egeon2;Unable to open
> script file
> 06/02/2015 14:06:56;0100;PBS_Server.22660;Job;91.egeon2;enqueuing into
> batch, state 2 hop 1
> 06/02/2015 14:06:56;0086;PBS_Server.22660;Job;91.egeon2;Requeueing job,
> substate: 22 Requeued in queue: batch
> 06/02/2015 14:06:56;0100;PBS_Server.22660;Job;92.egeon2;enqueuing into
> batch, state 2 hop 1
> 06/02/2015 14:06:56;0086;PBS_Server.22660;Job;92.egeon2;Requeueing job,
> substate: 22 Requeued in queue: batch
> 06/02/2015
> 14:06:56;0002;PBS_Server.22660;Svr;PBS_Server;handle_job_recovery:3
> 06/02/2015 14:06:56;0006;PBS_Server.22660;Svr;PBS_Server;Using ports
> Server:15001  Scheduler:15004  MOM:15002 (server: 'egeon2')
> 06/02/2015 14:06:56;0002;PBS_Server.22660;Svr;PBS_Server;Server Ready,
> pid = 22660, loglevel=0
> 06/02/2015
> 14:06:58;0001;PBS_Server.22669;Svr;PBS_Server;LOG_ERROR::svr_find_job,
> jobid is null
> 06/02/2015
> 14:06:58;0001;PBS_Server.22669;Svr;PBS_Server;LOG_ERROR::kill_job_on_mom,
> stray
> job 88.egeon2 found on n02
> 06/02/2015 14:06:58;0010;PBS_Server.22678;Job;88.egeon2;Exit_status=271
> resources_used.cput=00:00:00 resources_used.energy_used=0
> resources_used.mem=2876kb resources_used.vmem=26752kb
> resources_used.walltime=00:01:31
> 06/02/2015 14:06:58;000d;PBS_Server.22678;Job;88.egeon2;Not sending
> email: User does not want mail of this type.
> 06/02/2015 14:06:58;0008;PBS_Server.22670;Job;88.egeon2;on_job_exit
> valid pjob: 88.egeon2 (substate=50)
> 06/02/2015 14:06:58;0008;PBS_Server.22669;Job;89.egeon2;Job deleted as
> result of dependency on job 88.egeon2
> 06/02/2015 14:06:58;0008;PBS_Server.22669;Job;90.egeon2;Job deleted as
> result of dependency on job 88.egeon2
> 06/02/2015 14:06:58;0008;PBS_Server.22669;Job;91.egeon2;Job deleted as
> result of dependency on job 88.egeon2
> 06/02/2015 14:06:58;0008;PBS_Server.22679;Job;92.egeon2;Job deleted as
> result of dependency on job 90.egeon2
> 06/02/2015 14:07:07;0002;PBS_Server.22679;Svr;PBS_Server;Torque Server
> Version = 5.1.0.h2, loglevel = 0
> 06/02/2015 14:07:29;0100;PBS_Server.22682;Job;89.egeon2;dequeuing from
> batch, state COMPLETE
> 06/02/2015 14:07:29;0100;PBS_Server.22682;Job;92.egeon2;dequeuing from
> batch, state COMPLETE
> 06/02/2015 14:07:29;0100;PBS_Server.22682;Job;91.egeon2;dequeuing from
> batch, state COMPLETE
> 06/02/2015 14:07:29;0100;PBS_Server.22682;Job;90.egeon2;dequeuing from
> batch, state COMPLETE
> 06/02/2015 14:12:15;0002;PBS_Server.22670;Svr;PBS_Server;Torque Server
> Version = 5.1.0.h2, loglevel = 0
> 06/02/2015 14:17:23;0002;PBS_Server.22670;Svr;PBS_Server;Torque Server
> Version = 5.1.0.h2, loglevel = 0
> 06/02/2015 14:22:31;0002;PBS_Server.22680;Svr;PBS_Server;Torque Server
> Version = 5.1.0.h2, loglevel = 0
> 06/02/2015 14:27:39;0002;PBS_Server.22680;Svr;PBS_Server;Torque Server
> Version = 5.1.0.h2, loglevel = 0
> 06/02/2015 14:32:47;0002;PBS_Server.22681;Svr;PBS_Server;Torque Server
> Version = 5.1.0.h2, loglevel = 0
> 06/02/2015 14:37:55;0002;PBS_Server.22680;Svr;PBS_Server;Torque Server
> Version = 5.1.0.h2, loglevel = 0
> 06/02/2015 14:43:03;0002;PBS_Server.22680;Svr;PBS_Server;Torque Server
> Version = 5.1.0.h2, loglevel = 0
> 06/02/2015 14:48:11;0002;PBS_Server.22681;Svr;PBS_Server;Torque Server
> Version = 5.1.0.h2, loglevel = 0
> 06/02/2015 14:50:29;0008;PBS_Server.22678;Job;88.egeon2;purging job
> 88.egeon2 without checking MOM
> 06/02/2015 14:50:29;0100;PBS_Server.22678;Job;88.egeon2;dequeuing from
> batch, state COMPLETE
> 06/02/2015 14:50:29;0080;PBS_Server.22682;Job;89.egeon2;Unknown Job Id
> Error
> 06/02/2015 14:50:29;0080;PBS_Server.22682;Req;req_reject;Reject reply
> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from
> <at> egeon2
> 06/02/2015 14:50:29;0080;PBS_Server.22682;Job;90.egeon2;Unknown Job Id
> Error
> 06/02/2015 14:50:29;0080;PBS_Server.22682;Req;req_reject;Reject reply
> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from
> <at> egeon2
> 06/02/2015 14:50:29;0080;PBS_Server.22682;Job;91.egeon2;Unknown Job Id
> Error
> 06/02/2015 14:50:29;0080;PBS_Server.22682;Req;req_reject;Reject reply
> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from
> <at> egeon2
> 06/02/2015 14:53:19;0002;PBS_Server.22670;Svr;PBS_Server;Torque Server
> Version = 5.1.0.h2, loglevel = 0
>
> --
> Michel B?land, analyste en calcul scientifique
> michel.beland <at> calculquebec.ca
> bureau S-250, pavillon Roger-Gaudry (principal), Universit? de Montr?al
> t?l?phone : 514 343-6111 poste 3892     t?l?copieur : 514 343-2155
> Calcul Qu?bec (www.calculquebec.ca)
> Calcul Canada (calculcanada.ca)
>
> _______________________________________________
> torqueusers mailing list
> torqueusers <at> supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



--
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20150622/6a72a8b4/attachment.html

------------------------------

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers


End of torqueusers Digest, Vol 131, Issue 6
*******************************************



--
<b>MANUEL MORICHE GUERRERO</b>

Universidad Carlos III de Madrid
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Zahid Hossain | 17 Jun 00:54 2015
Picon

CPULOAD with EXACTNODE makes a job queued indefinitely

Hi I have the following:

NODEALLOCATIONPOLICY    CPULOAD
JOBNODEMATCHPOLICY      EXACTNODE

And then when i do

> qsub -l nodes=6 test.qsub

The job remains in “Q” state forever. This does not happen when I have “NODEALLOCATIONPOLICY   
MINRESOURCE.” But I need the capability of CPULOAD, how do this. Basically I want to jobs to be really
distributed across X nodes when i say "-l nodes=X” 
but otherwise, it should schedule nodes according to CPULOAD.

Zahid
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
MANUEL MORICHE GUERRERO | 14 Jun 16:15 2015
Picon

all jobs sent to one node

Hi all,

We have a cluster installed with rocks 6.2, including roll torque.

We are having the following problem when submitting jobs: If the job is scheduled for more than one node, let's say -l nodes=3:ppn=12, all processes go to one compute node (36) and the other two nodes have no process in them. 

qstat says that the three nodes are being used, 12 procs each, but entering the nodes and doing ps, the process are all in one node.

Also, if a job is already running with only 12 procs, and we sent another one with 12 procs without specifying the node (-lnodes=12) the former node can be overloaded with 24 procs.

We are using torque 4.2.7 maui 3.3.1

Any help will be appreciated.

Thanks in advance

Manuel

--


_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Andrus, Brian Contractor | 15 Jun 19:13 2015

What is a 'task'

Ok, this has been bugging me a bit.

 

What, in the context of Torque/Moab/Maui is a ‘task’?

 

The way it is referenced, it seems to sometimes be a job that can run, a single processor core or even a node.

 

The same confusion could be said for ‘class’ as well, but I think that is mostly a synonym for ‘queue’

 

I’d suggest cleaning up the docs AND help files for consistency. My users keep getting confused.

 

 

Brian Andrus

ITACS/Research Computing

Naval Postgraduate School

Monterey, California

voice: 831-656-6238

 

 

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Zahid Hossain | 13 Jun 01:36 2015
Picon

Round robin scheduling in PBS/Torque

Hi,

I am basically trying to do a round robin scheduling. Say I have 5 
nodes, each with 8 processors. But when I do "qsub job.sh"
it want the job to go to node-1 first , then another qsub will send the 
job to node-2 and hence forth. Somebody else posted about this before 
but the solution didn't work for me. All my jobs populate node-1 first 
(ie.. the first 8 qsubs ) before the 9th one goes to node-2. Following 
is my server settings

#
# Create queues and set their attributes.
#
#
# Create and define queue workq
#
create queue workq
set queue workq queue_type = Execution
set queue workq resources_default.nodes = 1:ppn=1
set queue workq enabled = True
set queue workq started = True
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = mirpur
set server default_queue = workq
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.nodect = 1
set server resources_default.nodes = 1
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 300
set server default_node = 1
set server node_pack = False
set server job_stat_rate = 300
set server poll_jobs = True
set server mom_job_sync = True
set server mail_domain = mirpur.stanford.edu
set server keep_completed = 300
set server next_job_number = 225
set server moab_array_compatible = True
set server nppcu = 1

--

-- 
Z.
Ph.D candidate
Computer Science, Stanford University
http://cs.stanford.edu/~zhossain/
Tus | 5 Jun 03:06 2015
Picon

qsub pvmem filter/wrapper

Hello TorqueUsers, I would like to create a filter to add pvmem parameter if job is submitted without it and use same value as mem/pmem if it is there. I only want to do this for one user and not system wide. Also, can someone point me to a link to search torqueusers archives? Thank you.
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Tus | 4 Jun 18:43 2015
Picon

qsub pvmem filter/wrapper

Hello TorqueUsers,

I would like to create a filter to add pvmem parameter if job is submitted
without it and use same value as mem/pmem if it is there. I only want to
do this for one user and not system wide.

Also, can someone point me to a link to search torqueusers archives?

Thank you.

Gmane