PRAVEEN | 18 Sep 20:50 2014

pbs_mom log error (cpuset enabled)

Hi Team,

 

                                Please help me understand what does the below error log has to do with and how can we sort this out.

 

09/18/2014 11:45:17;0001;   pbs_mom.30000;Svr;pbs_mom;LOG_ERROR::No such file or directory (2) in totmem, get_proc_mem

09/18/2014 11:45:17;0001;   pbs_mom.30000;Svr;pbs_mom;LOG_ERROR::No such file or directory (2) in availmem, get_proc_mem

09/18/2014 11:46:02;0001;   pbs_mom.30004;Svr;pbs_mom;LOG_ERROR::No such file or directory (2) in totmem, get_proc_mem

09/18/2014 11:46:02;0001;   pbs_mom.30004;Svr;pbs_mom;LOG_ERROR::No such file or directory (2) in availmem, get_proc_mem

09/18/2014 11:46:47;0001;   pbs_mom.30008;Svr;pbs_mom;LOG_ERROR::No such file or directory (2) in totmem, get_proc_mem

09/18/2014 11:46:47;0001;   pbs_mom.30008;Svr;pbs_mom;LOG_ERROR::No such file or directory (2) in availmem, get_proc_mem

09/18/2014 11:47:32;0001;   pbs_mom.30015;Svr;pbs_mom;LOG_ERROR::No such file or directory (2) in totmem, get_proc_mem

09/18/2014 11:47:32;0001;   pbs_mom.30015;Svr;pbs_mom;LOG_ERROR::No such file or directory (2) in availmem, get_proc_mem

09/18/2014 11:48:17;0002;   pbs_mom.26076;Svr;pbs_mom;Torque Mom Version = 4.2.8, loglevel = 0

09/18/2014 11:48:17;0001;   pbs_mom.30019;Svr;pbs_mom;LOG_ERROR::No such file or directory (2) in totmem, get_proc_mem

09/18/2014 11:48:17;0001;   pbs_mom.30019;Svr;pbs_mom;LOG_ERROR::No such file or directory (2) in availmem, get_proc_mem

 

 

 

Praveen

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
PRAVEEN | 18 Sep 20:44 2014

torque build with --with-environ

Hi ,

 

Can some one please help me understand what is the need of --with-environ= while configuring torque from the source.

 

                Please Bare, I’m a newbie

 

Praveen

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Mahmood Naderan | 18 Sep 17:36 2014
Picon

A question about privacy in supercomputing centers (non-technical)

Hello,
This question is not directly related to Torque, so excuse me...
I want to know how obtaining/installing softwares are handled? Who is responsible for that? The user or the center?
The same question applies to the security of the results from jobs. Does the center offers "as is" and doesn't care about data loss/attack/theft?
 
Regards,
Mahmood
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Juan Carlos Niebles | 17 Sep 23:24 2014
Picon

'ghost' job

There seems to be a 'ghost' job hanging in one of our compute nodes. See below how pbsnodes reports a job on core 1, but no id is reported ( jobs = 1/ )

The problem is that only 31 cores appear to be free from the scheduler point of view.

Any idea on how to find out what is going on here?



# pbsnodes compute-0-3

compute-0-3

     state = free

     np = 32

     properties = vision

     ntype = cluster

     jobs = 1/

     status = rectime=1410556797,varattr=,jobs=,state=free,netload=917207,gres=,loadave=0.08,ncpus=32,physmem=99163696kb,availmem=99403032kb,totmem=100187688kb,idletime=1055,nusers=0,nsessions=0,uname=Linux compute-0-3.local 2.6.32-431.20.3.el6.x86_64 #1 SMP Thu Jun 19 21:14:45 UTC 2014 x86_64,opsys=linux

     mom_service_port = 15002

     mom_manager_port = 15003

     gpus = 0
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
PRAVEEN | 17 Sep 09:45 2014

QRE: torqueusers Digest, Vol 122, Issue 7

Hi Ken,

	Sorry for the delayed reply..
	Can you please help me write a wrapper for job submission. I found
inadequate documentation for the same. I've a requirement like follows,

	* 40 node cluster among which 10 node has gpu cards,
	* 2 queues are configured CPUQ and GPUQ.
	* When a user submits a job to the cpuq requesting for the gpus in
the submission script, job submission should fail.

Thanks in Advance.

-----Original Message-----
From: torqueusers-bounces <at> supercluster.org
[mailto:torqueusers-bounces <at> supercluster.org] On Behalf Of
torqueusers-request <at> supercluster.org
Sent: 11 September 2014 20:27
To: torqueusers <at> supercluster.org
Subject: torqueusers Digest, Vol 122, Issue 7

Send torqueusers mailing list submissions to
	torqueusers <at> supercluster.org

To subscribe or unsubscribe via the World Wide Web, visit
	http://www.supercluster.org/mailman/listinfo/torqueusers
or, via email, send a message with subject or body 'help' to
	torqueusers-request <at> supercluster.org

You can reach the person managing the list at
	torqueusers-owner <at> supercluster.org

When replying, please edit your Subject line so it is more specific than
"Re: Contents of torqueusers digest..."

Today's Topics:

   1. Queue policy to restrict gpu resource request (PRAVEEN)
   2. Re: pbs_mom not running (Andrus, Brian Contractor)
   3. Re: pbs_mom not running (Gus Correa)
   4. Execution Host and Working Directory (David Beer)
   5. Re: Execution Host and Working Directory (Stephen Cousins)
   6. Re: Queue policy to restrict gpu resource request (Ken Nielson)

----------------------------------------------------------------------

Message: 1
Date: Wed, 10 Sep 2014 00:04:27 +0530
From: PRAVEEN <praveenkumar.s <at> locuz.com>
Subject: [torqueusers] Queue policy to restrict gpu resource request
To: <torqueusers <at> supercluster.org>
Message-ID: <000001cfcc5c$b33cfd90$19b6f8b0$ <at> locuz.com>
Content-Type: text/plain; charset="us-ascii"

Hi ,
		I just want to know how can we restrict requesting gpu
resources from a node, but allowing only the cpu resource.

	Say, I have two queue gpu_queue and cpu_queue, where user should not
be able to select gpus while submitting job to cpu_queue.

Thanks in advance,
Praveen

------------------------------

Message: 2
Date: Wed, 10 Sep 2014 15:46:19 +0000
From: "Andrus, Brian Contractor" <bdandrus <at> nps.edu>
Subject: Re: [torqueusers] pbs_mom not running
To: Torque Users Mailing List <torqueusers <at> supercluster.org>
Message-ID:
	<ADC981242279AD408816CB7141A2789D898770E4 <at> GROWLER.ern.nps.edu>
Content-Type: text/plain; charset="us-ascii"

It is happening now on a couple nodes..
That file does not exist.

Here are some commands showing the issue:
----------------------------------------------------------------
[root <at> compute-8-17 ~]# ps -ef |grep mom
root     22647 22610  0 08:36 pts/0    00:00:00 grep mom
[root <at> compute-8-17 ~]# service pbs_mom status pbs_mom (pid  3402) is
running...
[root <at> compute-8-17 ~]# ls -l /var/lock/subsys/pbs*
ls: cannot access /var/lock/subsys/pbs*: No such file or directory
[root <at> compute-8-17 ~]# service pbs_mom stop Shutting down TORQUE Mom: cannot
connect to MOM on node 'localhost', errno=111 (Connection refused) cannot
connect to MOM on node 'localhost', errno=111 (Connection refused) cannot
connect to MOM on node 'localhost', errno=111 (Connection refused) cannot
connect to MOM on node 'localhost', errno=111 (Connection refused) ^C
[root <at> compute-8-17 ~]# service pbs_mom start
Starting TORQUE Mom: pbs_mom already running               [  OK  ]
[root <at> compute-8-17 ~]# ps -ef |grep mom
root     22686 22610  0 08:37 pts/0    00:00:00 grep mom
[root <at> compute-8-17 ~]# service pbs_mom purge Starting TORQUE Mom with purge:
Error getting SCIF driver version
                                                           [  OK  ]
[root <at> compute-8-17 ~]# ps -ef |grep mom
root     22697     1  2 08:37 ?        00:00:00 /usr/sbin/pbs_mom -r
root     22703 22610  0 08:37 pts/0    00:00:00 grep mom
[root <at> compute-8-17 ~]# service pbs_mom restart Shutting down TORQUE Mom:
shutdown request successful on localhost
                                                           [  OK  ] Starting
TORQUE Mom: Error getting SCIF driver version
                                                           [  OK  ]
[root <at> compute-8-17 ~]# ps -ef |grep mom
root     22728     1  1 08:37 ?        00:00:00 /usr/sbin/pbs_mom -p -d
/var/spool/torque
root     22734 22610  0 08:37 pts/0    00:00:00 grep mom
------------------------------------------------------------

The SCIF warning is fine. We have MIC cards in one node, so torque was
compiled to be aware if they are there.

Seems to me, pbs_mom is not really doing a proper check to see if it is
already running.
In the script all it does is check if there is a lock file.
                status -p $PBS_HOME/mom_priv/mom.lock pbs_mom 2>&1 >
/dev/null

Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238

> -----Original Message-----
> From: torqueusers-bounces <at> supercluster.org [mailto:torqueusers- 
> bounces <at> supercluster.org] On Behalf Of Gus Correa
> Sent: Tuesday, September 09, 2014 9:56 AM
> To: Torque Users Mailing List
> Subject: Re: [torqueusers] pbs_mom not running
> 
> Hi Brian
> 
> Have you checked if /var/lock/subsys/pbs_mom still exists?
> Maybe a leftover from an ungraceful death?
> Just a guess.
> 
> Gus Correa
> 
> 
> On 09/09/2014 12:27 PM, Andrus, Brian Contractor wrote:
> > All,
> >
> > Lately I have seen instances where I see a node showing down when I 
> > do
> 'pbsnodes -l'
> > I log on to the node and check mom with 'service pbs_mom status' and 
> > it
> says it is running.
> However, if I do 'ps -ef |grep mom' there is no process running.
> > If I try to do 'service pbs_mom start' or 'service pbs_mom restart'
> it fails claiming it is already running.
> > The way I have been able to remedy this is to do 'service pbs_mom purge'
> and then I can start it.
> >
> > This seems a little odd. Has anyone else seen any similar symptoms?
> > We are running torque 4.2.8 here.
> >
> >
> > Brian Andrus
> > ITACS/Research Computing
> > Naval Postgraduate School
> > Monterey, California
> > voice: 831-656-6238
> >
> >
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers <at> supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers <at> supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

------------------------------

Message: 3
Date: Wed, 10 Sep 2014 12:25:34 -0400
From: Gus Correa <gus <at> ldeo.columbia.edu>
Subject: Re: [torqueusers] pbs_mom not running
To: Torque Users Mailing List <torqueusers <at> supercluster.org>
Message-ID: <54107B7E.7000303 <at> ldeo.columbia.edu>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

On 09/10/2014 11:46 AM, Andrus, Brian Contractor wrote:
> It is happening now on a couple nodes..
> That file does not exist.
>
> Here are some commands showing the issue:
> ----------------------------------------------------------------
> [root <at> compute-8-17 ~]# ps -ef |grep mom
> root     22647 22610  0 08:36 pts/0    00:00:00 grep mom
> [root <at> compute-8-17 ~]# service pbs_mom status pbs_mom (pid  3402) is 
> running...
> [root <at> compute-8-17 ~]# ls -l /var/lock/subsys/pbs*
> ls: cannot access /var/lock/subsys/pbs*: No such file or directory
> [root <at> compute-8-17 ~]# service pbs_mom stop Shutting down TORQUE Mom: 
> cannot connect to MOM on node 'localhost', errno=111 (Connection 
> refused) cannot connect to MOM on node 'localhost', errno=111 
> (Connection refused) cannot connect to MOM on node 'localhost', 
> errno=111 (Connection refused) cannot connect to MOM on node 
> 'localhost', errno=111 (Connection refused) ^C
> [root <at> compute-8-17 ~]# service pbs_mom start
> Starting TORQUE Mom: pbs_mom already running               [  OK  ]
> [root <at> compute-8-17 ~]# ps -ef |grep mom
> root     22686 22610  0 08:37 pts/0    00:00:00 grep mom
> [root <at> compute-8-17 ~]# service pbs_mom purge Starting TORQUE Mom with 
> purge: Error getting SCIF driver version
>                                                             [  OK  ]
> [root <at> compute-8-17 ~]# ps -ef |grep mom
> root     22697     1  2 08:37 ?        00:00:00 /usr/sbin/pbs_mom -r
> root     22703 22610  0 08:37 pts/0    00:00:00 grep mom
> [root <at> compute-8-17 ~]# service pbs_mom restart Shutting down TORQUE 
> Mom: shutdown request successful on localhost
>                                                             [  OK  ] 
> Starting TORQUE Mom: Error getting SCIF driver version
>                                                             [  OK  ]
> [root <at> compute-8-17 ~]# ps -ef |grep mom
> root     22728     1  1 08:37 ?        00:00:00 /usr/sbin/pbs_mom -p -d
/var/spool/torque
> root     22734 22610  0 08:37 pts/0    00:00:00 grep mom
> ------------------------------------------------------------
>
>
> The SCIF warning is fine. We have MIC cards in one node, so torque was
compiled to be aware if they are there.
>
> Seems to me, pbs_mom is not really doing a proper check to see if it is
already running.
> In the script all it does is check if there is a lock file.
>                  status -p $PBS_HOME/mom_priv/mom.lock pbs_mom 2>&1 > 
> /dev/null
>
>
Hi Brian

That is why I suggested checking the mom.lock file existence.

However, it is really weird that the pbs_mom is not listed by 'ps -ef', and
still refuses to start/restart as a service, saying that it is already
running.

Would there be other ancillary processes besides pbs_mom launched along with
it?
When a job starts pbs_mom seems to spawn additional pbs_moms, but these are
short lived, and have the same name, so your ps|grep would catch them, I
guess.
Maybe the sockets to talk to the server are still hanging around?
(lsof -i :15002-15003)
Could it be that the MIC cards run a separate/additional mom-like daemon,
perhaps named differently?
[I don't know. I only have standard CPU-only nodes, no MIC, no GPUs.]

Anyway, just guesses.

Gus Correa
>
>> -----Original Message-----
>> From: torqueusers-bounces <at> supercluster.org [mailto:torqueusers- 
>> bounces <at> supercluster.org] On Behalf Of Gus Correa
>> Sent: Tuesday, September 09, 2014 9:56 AM
>> To: Torque Users Mailing List
>> Subject: Re: [torqueusers] pbs_mom not running
>>
>> Hi Brian
>>
>> Have you checked if /var/lock/subsys/pbs_mom still exists?
>> Maybe a leftover from an ungraceful death?
>> Just a guess.
>>
>> Gus Correa
>>
>>
>> On 09/09/2014 12:27 PM, Andrus, Brian Contractor wrote:
>>> All,
>>>
>>> Lately I have seen instances where I see a node showing down when I 
>>> do
>> 'pbsnodes -l'
>>> I log on to the node and check mom with 'service pbs_mom status' and 
>>> it
>> says it is running.
>> However, if I do 'ps -ef |grep mom' there is no process running.
>>> If I try to do 'service pbs_mom start' or 'service pbs_mom restart'
>> it fails claiming it is already running.
>>> The way I have been able to remedy this is to do 'service pbs_mom purge'
>> and then I can start it.
>>>
>>> This seems a little odd. Has anyone else seen any similar symptoms?
>>> We are running torque 4.2.8 here.
>>>
>>>
>>> Brian Andrus
>>> ITACS/Research Computing
>>> Naval Postgraduate School
>>> Monterey, California
>>> voice: 831-656-6238

------------------------------

Message: 4
Date: Wed, 10 Sep 2014 15:30:03 -0600
From: David Beer <dbeer <at> adaptivecomputing.com>
Subject: [torqueusers] Execution Host and Working Directory
To: Torque Users Mailing List <torqueusers <at> supercluster.org>
Message-ID:
	<CAFUQeZ2sdhnfF2cT6y8jKZ5STWXcFqUX1JB+HChcQ5s9k4w_sQ <at> mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

All,

Today I have a customer asking me if there's a way to tell TORQUE to make
the working directory a path that includes the execution host of the job
(mother superior). To be specific, if the job is submitting from compute01
and runs on compute03, is there a way to make the working directory
/some/path/compute03/? I know that with $PBS_O_HOST would get you compute01,
I'm just wondering if there's a corresponding way to do it for the mother
superior host.

--
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://www.supercluster.org/pipermail/torqueusers/attachments/20140910/2123c
59e/attachment-0001.html 

------------------------------

Message: 5
Date: Wed, 10 Sep 2014 18:01:51 -0400
From: Stephen Cousins <steve.cousins <at> maine.edu>
Subject: Re: [torqueusers] Execution Host and Working Directory
To: Torque Users Mailing List <torqueusers <at> supercluster.org>
Message-ID:
	<CAMFqqRomtMATr-Hr_3CNBrSQ9k2iiJbrVjkvzB-4j+9QZ+5uow <at> mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

On Linux couldn't you use "hostname -s" in the script to get that hosts
name. So,

MY_WORK_DIR=${PBS_O_WORKDIR}/`hostname -s`

or something like that.

Steve

On Wed, Sep 10, 2014 at 5:30 PM, David Beer <dbeer <at> adaptivecomputing.com>
wrote:

> All,
>
> Today I have a customer asking me if there's a way to tell TORQUE to 
> make the working directory a path that includes the execution host of 
> the job (mother superior). To be specific, if the job is submitting 
> from compute01 and runs on compute03, is there a way to make the 
> working directory /some/path/compute03/? I know that with $PBS_O_HOST 
> would get you compute01, I'm just wondering if there's a corresponding 
> way to do it for the mother superior host.
>
> --
> David Beer | Senior Software Engineer
> Adaptive Computing
>
> _______________________________________________
> torqueusers mailing list
> torqueusers <at> supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>

--
________________________________________________________________
 Steve Cousins             Supercomputer Engineer/Administrator
 Advanced Computing Group            University of Maine System
 244 Neville Hall (UMS Data Center)              (207) 561-3574
 Orono ME 04469                      steve.cousins at maine.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://www.supercluster.org/pipermail/torqueusers/attachments/20140910/a31f5
1bd/attachment-0001.html 

------------------------------

Message: 6
Date: Thu, 11 Sep 2014 08:51:43 -0600
From: Ken Nielson <knielson <at> adaptivecomputing.com>
Subject: Re: [torqueusers] Queue policy to restrict gpu resource
	request
To: Torque Users Mailing List <torqueusers <at> supercluster.org>
Message-ID:
	<CADvLK3cULpxtT1sgKa9xUHRWxOdS+hvypMUPJ_pfpDPdYDBOVA <at> mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

You will probably need to make that as part of a qsub filter. TORQUE will
not restrict jobs from a queue based on gpu request.

Ken

On Tue, Sep 9, 2014 at 12:34 PM, PRAVEEN <praveenkumar.s <at> locuz.com> wrote:

> Hi ,
>                 I just want to know how can we restrict requesting gpu 
> resources from a node, but allowing only the cpu resource.
>
>         Say, I have two queue gpu_queue and cpu_queue, where user 
> should not be able to select gpus while submitting job to cpu_queue.
>
> Thanks in advance,
> Praveen
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers <at> supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>

-- 
       [image: Adaptive Computing] <http://www.adaptivecomputing.com>
 [image: Twitter] <http://twitter.com/AdaptiveMoab> [image: LinkedIn]
<http://www.linkedin.com/company/448673?goback=.fcs_GLHD_adaptive+computing_
false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2&trk=ncsrch_hits>
[image:
YouTube] <http://www.youtube.com/adaptivecomputing> [image: GooglePlus]
<https://plus.google.com/u/0/102155039310685515037/posts> [image: Facebook]
<http://www.facebook.com/pages/Adaptive-Computing/314449798572695?fref=ts>
[image:
RSS] <http://www.adaptivecomputing.com/feed>
Ken Nielson Sr. Software Engineer
 +1 801.717.3700 office    +1 801.717.3738 fax
 1712 S. East Bay Blvd, Suite 300     Provo, UT 84606
 www.adaptivecomputing.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://www.supercluster.org/pipermail/torqueusers/attachments/20140911/49737
d14/attachment.html 

------------------------------

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

End of torqueusers Digest, Vol 122, Issue 7
*******************************************
Peter Yen | 16 Sep 14:58 2014
Picon

How to distribute jobs to node with few cores less and more cores more?

Dear Users:
in my pbs script i set 

#PBS -l nodes=crosby:ppn=4+molkin:ppn=4+node05:ppn=12


crosby and molkin are nodes having 8 cores and node05 has 24 cores. 
I therefore want my jobs to be submitted 20% to both molkin and crosby , 60 % to node05. However  it turns out that all three nodes always have the same amount of job distributed. The results is that node05 will finished earlier and become idle waiting for the other two nodes to finish their jobs. How to correct it in order to utilize all the resources simultaneously and finish it earlier? Thanks you very much!!



                                                                                                                                                                                                      With best regards
                                                                                                                                                                                                                                      Peter

--
Research Assistant,Physics
Department,NCU,Taiwan
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Nick Lindberg | 11 Sep 18:11 2014

torque automatically setting naccesspolicy to SINGLEJOB for job array?

Hello,

I’m submitting a job array of 50 jobs, using qsub as I always do, for one of our users.  The script is below… it requests 1 node with 8 ppn for each.  

What is happening, however, is that somehow the “naccesspoilcy" is getting set to SINGLEJOB (which I can tell in Moab because each 8-core job is occupying an entire 16-core node, and ‘checkjob’ indicates that the policy is being set to SINGLEJOB which then forces the node it is running on also in to SINGLEJOB mode.)

Has anybody ever run in to this?  I have no idea where this could be coming from; I’ve never seen this behavior before.  I’ve got jobs from other users that are running just fine, without this happening.

Could it be an mpiexec thing?  I would find that hard to believe but at this point it’s the only thing I can think— there is nothing in my torque config that sets that as far as I can tell.

#PBS -l nodes=1:ppn=8 
#PBS -j oe 
#PBS -m abe 
#PBS -M nlindberg <at> mkei.org 
#PBS -t 1-50 

module add openmpi/gcc/64/1.6.5-mlnx-ofed 

cd $PBS_O_WORKDIR 
cd run_${PBS_ARRAYID} 

CONVERGE=/gpfs/apps/converge/2.2.0/l_x86_64/bin/converge-udf-2.2.0-openmpi-linux-64-082114 

date 
mpiexec -n 8 $CONVERGE 
date 

Any help would be greatly appreciated— I’m stumped.

Thanks--

Nick Lindberg
Director of Engineering
Milwaukee Institute 
414-727-6413 (office)
608-215-3508 (mobile)






_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
David Beer | 10 Sep 23:30 2014

Execution Host and Working Directory

All,

Today I have a customer asking me if there's a way to tell TORQUE to make the working directory a path that includes the execution host of the job (mother superior). To be specific, if the job is submitting from compute01 and runs on compute03, is there a way to make the working directory /some/path/compute03/? I know that with $PBS_O_HOST would get you compute01, I'm just wondering if there's a corresponding way to do it for the mother superior host.

--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
PRAVEEN | 9 Sep 20:34 2014

Queue policy to restrict gpu resource request

Hi ,
		I just want to know how can we restrict requesting gpu
resources from a node, but allowing only the cpu resource.

	Say, I have two queue gpu_queue and cpu_queue, where user should not
be able to select gpus while submitting job to cpu_queue.

Thanks in advance,
Praveen
Andrus, Brian Contractor | 9 Sep 18:27 2014

pbs_mom not running

All,

Lately I have seen instances where I see a node showing down when I do 'pbsnodes -l'
I log on to the node and check mom with 'service pbs_mom status' and it says it is running. However, if I do 'ps
-ef |grep mom' there is no process running.
If I try to do 'service pbs_mom start' or 'service pbs_mom restart' it fails claiming it is already running.
The way I have been able to remedy this is to do 'service pbs_mom purge' and then I can start it.

This seems a little odd. Has anyone else seen any similar symptoms?
We are running torque 4.2.8 here.

Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238
Grigory Shamov | 8 Sep 20:36 2014
Picon

Where Torque standard output/error files should go?

Hi All,

I am a bit confused with how Torque (at least 2.5) handles jobs' STDERR and STDOUT files. I thought they
should be on the nodes'
/var/spool/torque/spool as $JOBID.ER and $JOBID.OU files; then they are to be copied to the
$PBS_O_WORKDIR or to #PBS -o/-e locations directly (using
scp or if $usecp is set, using simple cp to a shared filesystem). If the copy fails, the files would stay in
./undelivered on the master node. Is that correct?

Now we had an event on our cluster (running 2.5.13) when server's /var/spool/torque/spool was overflown
with STDERR and STDOUT files from several of the jobs. There were
several files in this directory, but not of every job that is running. There is no MOM running on the Torque
server machine. So my question is, what would be the conditions of a job having its STDERR
and STDOUT forwarded to the server? Is there any config option controlling it?

Thank you very much in advance,

--
Grigory Shamov

HPC Analyst, Tech. Site Lead,
Westgrid/Compute Canada
E2-588 EITC Building,
University of Manitoba
(204) 474-9625

Gmane