Viswanath Pasumarthi | 21 Dec 18:33 2014
Picon

Host key verification failed

Hi,

We have torque installed in the linux cluster with each node having the
processor: Intel(R) Xeon(R) Processor E5-2630 v2 (15M Cache, 2.60 GHz).
This processor contains total of 6 physical cores and 12 threads
(hyperthreading). If the number of cpus/node <= 10 (#PBS -l
nodes=1:ppn=10,walltime=999:00:00), the job runs, but it fails to run when
11 or 12 cpus/node are asked for. However in either of the cases, torque
fails to copy the final output and error files to the working directory,
and prints the following error message in /var/spool/mail/viswanath:

Unable to copy file /var/spool/torque/spool/52.karplus.iitg.ac.in.OU to
viswanath <at> karplus.iitg.ac.in:/home/viswanath/TDTHP-DCA_SDD2/TDDCA_Job1.sh.o52
*** error from copy
Host key verification failed.
lost connection
*** end error output
Output retained on that host in:
/var/spool/torque/undelivered/52.karplus.iitg.ac.in.OU

Unable to copy file /var/spool/torque/spool/52.karplus.iitg.ac.in.ER to
viswanath <at> karplus.iitg.ac.in:/home/viswanath/TDTHP-DCA_SDD2/TDDCA_Job1.sh.e52
*** error from copy
Host key verification failed.
lost connection
*** end error output
Output retained on that host in:
/var/spool/torque/undelivered/52.karplus.iitg.ac.in.ER

SSH login without password is possible. I have followed the procedure laid
(Continue reading)

Brock Palen | 19 Dec 19:49 2014
Picon

trl= with specific node layouts

We are looking at trl :
http://docs.adaptivecomputing.com/mwm/6-1-9/Content/topics/resourceManagers/rmextensions.html#trl

It works fine as long as you assume you only want N cpus anywhere.  

#PBS -l trl=4 <at> 3600:8 <at> 2000  etc,

I want to so this with ppn or tpn

16 cores on one node or 20 cores on one node. Not laid out just anywhere.  

We are using moab as our schedular.

Any thoughts how todo this?

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
brockp <at> umich.edu
(734)936-1985
David Beer | 18 Dec 18:40 2014

Information about Lincoln Ascent

All,

For any who are interested, here is some information about what's new for the Ascent initiative in our Lincoln release, coming out in January.


--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Richard Young | 18 Dec 00:48 2014
Picon
Picon

FW: Invalid credential for PBS commands

To All

I am sorry for submitting this again as it seems my original email didn't make it to the list.

I have recently come across a problem with one of my submission nodes on our HPC and was wondering if somebody
could help.

Our HPC has two submission/login nodes and one head node, both submission/login nodes were setup the same
and working until recently. One of the submission nodes now gets the following error whenever a torque
commands is run:

qstat: Invalid credential MSG=Hosts do not match qmgr obj= svr=default: Invalid credential MSG=Hosts do
not match

torque was install via rpm is version of 4.2.9. Both submission/login nodes are setup in the queue
configuration correctly, as below, they were taken out and put back with the torque server and client
restarted. The file /var/lib/torque/serv_priv/nodes has both servers in it, and this hasn't changed.

set server submit_hosts += server1
set server submit_hosts += server2

The problems exist for all users and the admin account. Has somebody seen this problem before and can
provide some help in fixing it.

Thanks
---------------------------------------------------------------------
Richard A. Young
ICT Services
HPC Support Officer
University of Southern Queensland
Toowoomba, Queensland 4350
Australia 
Email: Richard.Young <at> usq.edu.au   Phone: (07) 46315557   
Mob:   0437544370          Fax:   (07) 46312798 
---------------------------------------------------------------------

_____________________________________________________________
This email (including any attached files) is confidential and is for the intended recipient(s) only. If
you received this email by mistake, please, as a courtesy, tell the sender, then delete this email.

The views and opinions are the originator's and do not necessarily reflect those of the University of
Southern Queensland. Although all reasonable precautions were taken to ensure that this email
contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt.

The University of Southern Queensland is a registered provider of education with the Australian Government.
(CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )
Richard Young | 17 Dec 23:30 2014
Picon
Picon

Invalid credential for PBS commands

I have recently come across a problem with one of my submission nodes on our HPC and was wondering if somebody
could help.

Our HPC has two submission/login nodes and one head node, both submission/login nodes were setup the same
and working until recently. One of the submission nodes now gets the following error whenever a torque
commands is run:

qstat: Invalid credential MSG=Hosts do not match
qmgr obj= svr=default: Invalid credential MSG=Hosts do not match

torque was install via rpm is version of 4.2.9. Both submission/login nodes are setup in the queue
configuration correctly, as below, they were taken out and put back with the torque server and client
restarted. The file /var/lib/torque/serv_priv/nodes has both servers in it, and this hasn't changed.

set server submit_hosts += server1
set server submit_hosts += server2

The problems exist for all users and the admin account. Has somebody seen this problem before and can
provide some help in fixing it.

Thanks 
---------------------------------------------------------------------
Richard A. Young
ICT Services
HPC Support Officer
University of Southern Queensland
Toowoomba, Queensland 4350
Australia 
Email: Richard.Young <at> usq.edu.au   Phone: (07) 46315557   
Mob:   0437544370          Fax:   (07) 46312798 
---------------------------------------------------------------------

_____________________________________________________________
This email (including any attached files) is confidential and is for the intended recipient(s) only. If
you received this email by mistake, please, as a courtesy, tell the sender, then delete this email.

The views and opinions are the originator's and do not necessarily reflect those of the University of
Southern Queensland. Although all reasonable precautions were taken to ensure that this email
contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt.

The University of Southern Queensland is a registered provider of education with the Australian Government.
(CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )
Ricardo Román Brenes | 17 Dec 23:29 2014
Picon

setting up max resources with AND / OR

Hello everyone.

I have a queue system here and i want the jobs to have the following restrictions, a job will run if:
-it requests 4 hosts up to 24h
-it requests 3 hosts up to 72h

how can I set up this torque/maui?
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Mike Dacre | 16 Dec 23:31 2014
Picon

Persistent qsub -I sessions

Hi,

I want to make persistent interactive sessions possible for my users. This worked before, but for some reason it isn't now.

What I want to be able to do is:
  1. run qsub -I
  2. Start a tmux session on the new node
  3. Close my terminal window
  4. ssh directly into the node
  5. Attach the tmux session
  6. Keep working
And potentially repeat steps 3-6 ad infinitum for a period of months.

How can I make that work?

Thanks,

Mike
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Dimitrakakis Georgios | 16 Dec 12:01 2014
Picon

GPU ID

Hi all!

On a cluster where there are multiple gpus on each node how can I get the
gpu_id which has been assigned by torque as a variable in order to use it?

What I mean is that my script has a line where it requests the GPUs

#PBS -l nodes=1:ppn=2:gpus=2

but then the executable would like to know the ID of the GPU devices e.g. 
	"devices" : "0,1"

in order to run it there.

Can I somehow get as a variable the GPU ID in order to use it dynamically
instead of hardwiring the 0,1 at the device field?

Regards,

George
Eva Hocks | 15 Dec 20:58 2014
Picon

maui 3.3.1 multi-thread


I am running maui 3.3.1 and run often into problems where client
commands are unresponsive due to maui running sinlge threaded and
scheduling a burst of jobs.

While reading the documnetation at
http://docs.adaptivecomputing.com/maui/13.1rmoverview.php

-------------
13.1.1.2 Resource Manager Flow

Early versions of Maui (i.e., Maui 3.0.x) interacted with resource
managers in a very basic manner stepping through a serial sequence of
steps each scheduling iteration. These steps are outlined below:
[snip]
Each step would complete before the next step started. As systems
continued to grow in size and complexity however, it became apparent
that the serial model described above would not work. Three primary
motivations drove the effort to replace the serial model with a
concurrent threaded approach.  ....
-------------

it seems that maui 3.3.1 would be multi-treaded but I do not see any but
one thread running maui 3.3.1.

How do I enable maui running multi-threaded?

Thanks much for any help
Eva
Mike Dacre | 13 Dec 09:33 2014
Picon

Interactive Jobs Hanging

Hi All,

Sorry if this is a double post, I have been having trouble with the mailing list. From the archives it doesn't look like my last post went through.

I am having trouble with torque interactive jobs, that I think is something to do with the way authentication is set up.

If I run `qsub -I` from my head node - the machine running the torque server and maui - It works perfectly, and I get the following output:

dacre <at> fruster:/root
>> qsub -I 
qsub: waiting for job 434359.fruster to start
qsub: job 434359.fruster ready

dacre <at> node02:/home/dacre

It also works if I qsub from a node that will run the job - i.e. in the example above, if I ssh into node02 and the run `qsub -I`, it works. However, the same command from any other node or submit host does not work. It hangs at the qsub: waiting for job 434359.fruster to start section, and never goes further. Interestingly, if I look at the queue with qstat, I see that the job has immediately gone to completion and exits with exit status -1. Note: when I submit regular batch jobs, this is not an issue, they all run completely fine. This makes me think maybe it is an issue with pam or ssh... except that all of my users can ssh directly into any node, and I see no errors in any of the secure linux logs.

The qstat -f output for one of these failed jobs is here:
Job Id: 434361.fruster
    Job_Name = STDIN
    Job_Owner = dacre <at> node04
    resources_used.cput = 00:00:00
    resources_used.mem = 0kb
    resources_used.vmem = 0kb
    resources_used.walltime = 00:00:00
    job_state = C
    queue = interactive
    server = fruster
    Checkpoint = u
    ctime = Sat Dec 13 00:20:55 2014
    Error_Path = /dev/pts/0
    exec_host = node02/0+node02/1+node02/2+node02/3
    exec_port = 15003+15003+15003+15003
    Hold_Types = n
    interactive = True
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Sat Dec 13 00:20:56 2014
    Output_Path = /dev/pts/0
    Priority = 0
    qtime = Sat Dec 13 00:20:55 2014
    Rerunable = False
    Resource_List.mem = 16gb
    Resource_List.ncpus = 4
    Resource_List.neednodes = 1:ppn=4
    Resource_List.nice = -5
    Resource_List.nodect = 1
    Resource_List.nodes = 1:ppn=4
    session_id = 0
    substate = 59
    Variable_List = PBS_O_QUEUE=default,PBS_O_HOME=/home/dacre,
        PBS_O_LOGNAME=dacre,
        PBS_O_PATH=/home/dacre/bin:/home/dacre/usr/bin:/home/dacre/tools/bin:
        /home/dacre/mike_tools/bin:/usr/lib/nx/bin:/usr/local/sbin:/usr/local/
        bin:/usr/bin:/usr/lib/jvm/default/bin:/opt/NCBI/sra_sdk/bin:/usr/bin/s
        ite_perl:/usr/bin/vendor_perl:/usr/bin/core_perl,
        PBS_O_MAIL=/var/spool/mail/dacre,PBS_O_SHELL=/usr/bin/zsh,
        PBS_O_LANG=en_US.UTF-8,PBS_O_WORKDIR=/home/dacre,
        PBS_O_HOST=node04.fruster.local,PBS_O_SERVER=fruster
    euser = dacre
    egroup = dacre
    hashname = 434361.fruster
    queue_rank = 10
    queue_type = E
    etime = Sat Dec 13 00:20:55 2014
    exit_status = -1
    submit_args = -I
    start_time = Sat Dec 13 00:20:56 2014
    start_count = 1
    fault_tolerant = False
    comp_time = Sat Dec 13 00:20:56 2014
    job_radix = 0
    total_runtime = 0.024616
    submit_host = node04.fruster.local


The same output for the job that worked is here:

Job Id: 434360.fruster
    Job_Name = STDIN
    Job_Owner = dacre <at> node02
    resources_used.cput = 00:00:00
    resources_used.mem = 0kb
    resources_used.vmem = 0kb
    resources_used.walltime = 00:00:09
    job_state = C
    queue = interactive
    server = fruster
    Checkpoint = u
    ctime = Sat Dec 13 00:20:36 2014
    Error_Path = /dev/pts/1
    exec_host = node02/0+node02/1+node02/2+node02/3
    exec_port = 15003+15003+15003+15003
    Hold_Types = n
    interactive = True
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Sat Dec 13 00:20:46 2014
    Output_Path = /dev/pts/1
    Priority = 0
    qtime = Sat Dec 13 00:20:36 2014
    Rerunable = False
    Resource_List.mem = 16gb
    Resource_List.ncpus = 4
    Resource_List.neednodes = 1:ppn=4
    Resource_List.nice = -5
    Resource_List.nodect = 1
    Resource_List.nodes = 1:ppn=4
    session_id = 1488
    substate = 59
    Variable_List = PBS_O_QUEUE=default,PBS_O_HOME=/home/dacre,

And here is the tracejob output:
/var/spool/torque/mom_logs/20141213: No matching job records located
/var/spool/torque/sched_logs/20141213: No such file or directory

Job: 434361.fruster

12/13/2014 00:20:55  S    enqueuing into default, state 1 hop 1
12/13/2014 00:20:55  S    dequeuing from default, state QUEUED
12/13/2014 00:20:55  S    enqueuing into interactive, state 1 hop 1
12/13/2014 00:20:55  A    queue=default
12/13/2014 00:20:55  A    queue=interactive
12/13/2014 00:20:56  S    Job Run at request of root <at> fruster
12/13/2014 00:20:56  S    Not sending email: User does not want mail of this type.
12/13/2014 00:20:56  S    Exit_status=-1 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=00:00:00 Error_Path=/dev/pts/0 Output_Path=/dev/pts/0
12/13/2014 00:20:56  S    on_job_exit valid pjob: 434361.fruster (substate=50)
12/13/2014 00:20:56  A    user=dacre group=dacre jobname=STDIN queue=interactive ctime=1418458855 qtime=1418458855 etime=1418458855 start=1418458856 owner=dacre <at> node04
                          exec_host=node02/0+node02/1+node02/2+node02/3 Resource_List.mem=16gb Resource_List.ncpus=4 Resource_List.neednodes=1:ppn=4 Resource_List.nice=-5 Resource_List.nodect=1
                          Resource_List.nodes=1:ppn=4
12/13/2014 00:20:56  A    user=dacre group=dacre jobname=STDIN queue=interactive ctime=1418458855 qtime=1418458855 etime=1418458855 start=1418458856 owner=dacre <at> node04
                          exec_host=node02/0+node02/1+node02/2+node02/3 Resource_List.mem=16gb Resource_List.ncpus=4 Resource_List.neednodes=1:ppn=4 Resource_List.nice=-5 Resource_List.nodect=1
                          Resource_List.nodes=1:ppn=4 session=0 total_execution_slots=4 unique_node_count=1 end=1418458856 Exit_status=-1 resources_used.cput=00:00:00 resources_used.mem=0kb
                          resources_used.vmem=0kb resources_used.walltime=00:00:00 Error_Path=/dev/pts/0 Output_Path=/dev/pts/0

And tracejob for the one that did worked:
/var/spool/torque/mom_logs/20141213: No matching job records located
/var/spool/torque/sched_logs/20141213: No such file or directory

Job: 434360.fruster

12/13/2014 00:20:36  S    enqueuing into default, state 1 hop 1
12/13/2014 00:20:36  S    dequeuing from default, state QUEUED
12/13/2014 00:20:36  S    enqueuing into interactive, state 1 hop 1
12/13/2014 00:20:36  A    queue=default
12/13/2014 00:20:36  A    queue=interactive
12/13/2014 00:20:37  S    Job Run at request of root <at> fruster
12/13/2014 00:20:37  S    Not sending email: User does not want mail of this type.
12/13/2014 00:20:37  A    user=dacre group=dacre jobname=STDIN queue=interactive ctime=1418458836 qtime=1418458836 etime=1418458836 start=1418458837 owner=dacre <at> node02
                          exec_host=node02/0+node02/1+node02/2+node02/3 Resource_List.mem=16gb Resource_List.ncpus=4 Resource_List.neednodes=1:ppn=4 Resource_List.nice=-5 Resource_List.nodect=1
                          Resource_List.nodes=1:ppn=4
12/13/2014 00:20:46  S    Exit_status=265 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=00:00:09 Error_Path=/dev/pts/1 Output_Path=/dev/pts/1
12/13/2014 00:20:46  S    Not sending email: User does not want mail of this type.
12/13/2014 00:20:46  S    on_job_exit valid pjob: 434360.fruster (substate=50)
12/13/2014 00:20:46  A    user=dacre group=dacre jobname=STDIN queue=interactive ctime=1418458836 qtime=1418458836 etime=1418458836 start=1418458837 owner=dacre <at> node02
                          exec_host=node02/0+node02/1+node02/2+node02/3 Resource_List.mem=16gb Resource_List.ncpus=4 Resource_List.neednodes=1:ppn=4 Resource_List.nice=-5 Resource_List.nodect=1
                          Resource_List.nodes=1:ppn=4 session=1488 total_execution_slots=4 unique_node_count=1 end=1418458846 Exit_status=265 resources_used.cput=00:00:00 resources_used.mem=0kb
                          resources_used.vmem=0kb resources_used.walltime=00:00:09 Error_Path=/dev/pts/1 Output_Path=/dev/pts/1

They seem very similar to me.

I am using torque 4.2.9 with maui 3.3.1 on CentOS 7 with linux kernel 3.10.0-123.13.1.el7.x86_64.

My submit hosts are arch linux with 3.17.6 kernel.

I have attached the output from qmgr.

If anyone has any ideas, I would really appreciate it.

Thanks,

Mike
Attachment (mike_qmgr_output.txt.gz): application/x-gzip, 1580 bytes
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Eva Hocks | 12 Dec 00:13 2014
Picon

torque 4.2.6: qdel kill_delay not working


testing the delay between the sending of the SIGTERM and SIGKILL signals
in torque Version: 4.2.6.h1 and 4.2.9. It seems not to be honored by the
mom.

1) the kill_delay option for the mom does not exist in torque 4.2:

the mom config file includes
$kill_delay true

where pbs_mom;LOG_ERROR::read_config, special command name kill_delay not found
(ignoring line)

2) I created a c code which ignores the sigterm via SIG_IGN.

Now when I kill the job the mom sends the sigkill exactly after 5 sec
instead of 30 sec:

$ qdel -W 30 2218474
12/11/2014 13:58:00;0008;   pbs_mom.60217;Job;2218474.tscc-mgr.local;kill_task: killing pid 29600
task 1 with sig 15
12/11/2014 13:58:05;0008;   pbs_mom.60217;Job;2218474.tscc-mgr.local;kill_task: killing pid 29775
task 1 with sig 9

3) I then added the server option kill_delay

and that seems to work but that's for all queues. I'd rather
be flexible and use it in the qdel option.

Any idea how to get the kill qdel -W delay working in torque 4.2 or is
the server config all there is?

Thanks
Eva

Gmane