Steven Lo | 30 Sep 19:24 2014
Picon

Re: platform to run maui and torque


That's what I thought too.

Thanks.

Steven.


On 09/29/2014 10:53 AM, Ken Nielson wrote:
4.2.8 is pretty much a patch release of 4.2.7. (pretty much meaning it was intended as a patch release)



On Mon, Sep 29, 2014 at 10:50 AM, Steven Lo <slo <at> cacr.caltech.edu> wrote:

We tried 4.2.8 and maui 3.3.1 on RHEL 6.5 and Maui crash when starting
up.....  I'm pretty sure
4.2.8 is a patch release of 4.2.7.

I'm wondering if the problem is coming from the way we compile Maui.
But I see that the
parameters we are using are pretty straight forward:

./configure --with-pbs=/opt/torque --with-spooldir=/var/spool/maui
--with-key=HOPPER \
  --prefix=/usr/local/maui-3.3.1

I will try 4.2.7 and see what happen then.

Thanks for your response.

Steven.


On 09/29/2014 09:14 AM, Philippe Weill wrote:
>
> Le 26/09/2014 19:07, Steven Lo a écrit :
>> Hi,
>>
>> Our torque upgrade to 4.2.8 (from 4.1.5.1) with Maui 3.3.1 was not very
>> successful (Maui
>> crash during start up).  I was wondering how many people is using 4.2.8
>> or 4.1.7 with Maui 3.3.1
>> and what platform (hardware, OS, OS version, 32 or 64bit) they run on.
>>
>> We will try 4.1.7 to see if we have better luck.+
>
> we're on production with 4.2.7 and maui 3.3.1
>
> SL6 64 bit ( torque server and maui on VMWARE )
>

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers



--


Ken Nielson Sr. Software Engineer
+1 801.717.3700 office    +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300     Provo, UT 84606



_______________________________________________ torqueusers mailing list torqueusers <at> supercluster.org http://www.supercluster.org/mailman/listinfo/torqueusers

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
David Beer | 29 Sep 23:43 2014

CMake vs. Autotools

All,

We are considering changing our build system from using autotools to using CMake. CMake is simply a replacement, and this wouldn't affect any functionality of the product. For those that download tarballs, the only different would be running cmake . <options> instead of running ./configure <options>. 

Does anyone have strong feelings about this change? The main motivation for doing this is that we have more expertise among our developers for cmake than we do for autotools, and other products that we maintain are going to switch to cmake. We would like to use the same build system for these different products so that we can leverage the expertise and best maintain TORQUE builds.

--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
akaoni | 29 Sep 03:18 2014
Picon

not running the job

Dear Team,

I submitted the job by using qsub command and I checked qstat -a command.

But the job did not run still so Q status did not change R status.

The status is as below.

>>>>>
qstat -a

hpclinux5:
                                                                                  Req'd    Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
0.hpclinux5             testhpc     batch    STDIN               --      1      1    --   01:00:00 Q       --
1.hpclinux5             testhpc     batch    STDIN               --      1      1    --   01:00:00 Q       --
>>>>>

qstat -Q
Queue              Max    Tot   Ena   Str   Que   Run   Hld   Wat   Trn   Ext T   Cpt
----------------   ---   ----    --    --   ---   ---   ---   ---   ---   --- -   ---
batch                0      2   yes   yes     2     0     0     0     0     0 E     0

>>>>>
qstat -q

server: hpclinux5

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
batch              --      --       --      --    0   2 --   E R
                                               ----- -----
                                                   0     2
>>>>>

 momctl -d3

Host: hpclinux5/hpclinux5   Version: 5.0.0   PID: 15794
Server[0]: hpclinux5 (192.168.7.204:15001)
  Last Msg From Server:   255159 seconds (CLUSTER_ADDRS)
  WARNING:  no messages sent to server
HomeDirectory:          /var/spool/torque/mom_priv
stdout/stderr spool directory: '/var/spool/torque/spool/' (4693088blocks available)
NOTE:  syslog enabled
MOM active:             255172 seconds
Check Poll Time:        45 seconds
Server Update Interval: 45 seconds
LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    TCP
MemLocked:              TRUE  (mlock)
TCP Timeout:            60 seconds
Prolog:                 /var/spool/torque/mom_priv/prologue (disabled)
Alarm Time:             0 of 10 seconds
Trusted Client List:  127.0.0.1:0,192.168.7.202:15003,192.168.7.203:15003,192.168.7.204:0,192.168.7.204:15003:  0
Copy Command:           /usr/bin/scp -rpB
NOTE:  no local jobs detected

diagnostics complete

>>>>>

Could you tell me how to run the job?

Thank you,
Alex

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Jennifer Schmitt | 26 Sep 21:46 2014
Picon

Numa Nodes not accepting jobs

I installed the newest version of torque with numa nodes enabled
it is a 1 node, 32 processor system
I have operating system centos 7
pbsnodes -a 

eva.rd.unr.edu-0

     state = free

     power_state = Running

     np = 8

     ntype = cluster

     status = rectime=1411760176,macaddr=d8:50:e6:05:2f:50,cpuclock=OnDemand:1400MHz,varattr=,jobs=,state=free,netload=,gres=,loadave=0.00,ncpus=8,physmem=16776308kb,availmem=15966148kb,totmem=16776308kb,idletime=916,nusers=0,nsessions=0,uname=Linux eva.rd.unr.edu 3.10.0-123.6.3.el7.x86_64 #1 SMP Wed Aug 6 21:12:36 UTC 2014 x86_64,opsys=linux

     mom_service_port = 15002

     mom_manager_port = 15003


eva.rd.unr.edu-1

     state = free

     power_state = Running

     np = 8

     ntype = cluster

     status = rectime=1411760176,macaddr=d8:50:e6:05:2f:50,cpuclock=OnDemand:1400MHz,varattr=,jobs=,state=free,netload=,gres=,loadave=0.00,ncpus=8,physmem=16777216kb,availmem=16333076kb,totmem=16777216kb,idletime=916,nusers=0,nsessions=0,uname=Linux eva.rd.unr.edu 3.10.0-123.6.3.el7.x86_64 #1 SMP Wed Aug 6 21:12:36 UTC 2014 x86_64,opsys=linux

     mom_service_port = 15002

     mom_manager_port = 15003


eva.rd.unr.edu-2

     state = free

     power_state = Running

     np = 8

     ntype = cluster

     status = rectime=1411760176,macaddr=d8:50:e6:05:2f:50,cpuclock=OnDemand:1400MHz,varattr=,jobs=,state=free,netload=,gres=,loadave=0.00,ncpus=8,physmem=16777216kb,availmem=16317968kb,totmem=16777216kb,idletime=916,nusers=0,nsessions=0,uname=Linux eva.rd.unr.edu 3.10.0-123.6.3.el7.x86_64 #1 SMP Wed Aug 6 21:12:36 UTC 2014 x86_64,opsys=linux

     mom_service_port = 15002

     mom_manager_port = 15003


eva.rd.unr.edu-3

     state = free

     power_state = Running

     np = 8

     ntype = cluster

     status = rectime=1411760176,macaddr=d8:50:e6:05:2f:50,cpuclock=OnDemand:1400MHz,varattr=,jobs=,state=free,netload=,gres=,loadave=0.00,ncpus=8,physmem=16760832kb,availmem=16293456kb,totmem=16760832kb,idletime=916,nusers=0,nsessions=0,uname=Linux eva.rd.unr.edu 3.10.0-123.6.3.el7.x86_64 #1 SMP Wed Aug 6 21:12:36 UTC 2014 x86_64,opsys=linux

     mom_service_port = 15002

     mom_manager_port = 15003



when i submit jobs, i can only submit to eva.rd.unr.edu-0 and none of the others
it tells : not enough of the right type nodes avaialbe

--
Jennifer Schmitt
Ph.D. Candidate, Chemistry
Math Center TA
LSYCC- Event Coordinator
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Steven Lo | 26 Sep 19:07 2014
Picon

platform to run maui and torque

Hi,

Our torque upgrade to 4.2.8 (from 4.1.5.1) with Maui 3.3.1 was not very 
successful (Maui
crash during start up).  I was wondering how many people is using 4.2.8 
or 4.1.7 with Maui 3.3.1
and what platform (hardware, OS, OS version, 32 or 64bit) they run on.

We will try 4.1.7 to see if we have better luck.

Thanks in advance for your input.

Steven.
Pablo Restrepo Henao | 25 Sep 19:30 2014
Picon

Force Machine to use only 7 cores

Hello,

I have a cluster with 536 cores running with Torque 4.2.6.

A user is trying to launch an array job, which has

#PBS -l nodes=1:ppn=1
#PBS -t 1-400

all our machines have 8 cores, so in every machine he gets, Torque
launches 8 jobs.

The problem is that although each job uses 1 core, when 8 jobs are
launched, the machine load goes up (over 10), so the jobs take longer
to complete, because the machine is overloaded.

Is there a way to tell Torque to launch maximum 7 jobs per machine for
this particular array, without affecting other users?

Thanks in advance for your help,
Best Regards,

Pablo Restrepo
Noam Bernstein | 24 Sep 15:34 2014
Picon
Picon

Maui backfill question

Hi torque users - I sent this to the maui users mailing list originally, but that mailing list seems to be
dead, and the archives stop in July 2014.  So I’m hoping that since torque and maui are used together
often, someone here might be able to help.

We’ve been using torque+maui (3.3.1) for a couple of years very happily, but we just had a job fail to run
when expected, and I’m trying to understand why.  

We have the following two relevant lines in our maui.cfg:

  BACKFILLPOLICY        FIRSTFIT
 RESERVATIONPOLICY     CURRENTHIGHEST

and this is the relevant portion of the queuing system output (I apologize for the slightly non-standard format):

   Job ID                       name      owner state             host  np wallt_used  wallt_req
43571.tin                      AAA      U1     R      compute-2-3  32   01:14:09   02:00:00
43575.tin                   BBB   U2     R      compute-2-4  16   00:09:39   64:00:00
43576.tin                       CCC   U2     Q           n2013f  32          -   20:00:00
43578.tin                   DDD     U1     Q           n2013f  16          -   02:00:00

and of showres (from a minute or two later):

 ReservationID       Type S       Start         End    Duration    N/P    StartTime
 43576                Job I    00:44:37    20:44:37    20:00:00    2/32   Thu Sep 11 11:48:33

There are 4 nodes with 16 cores each (n2013f is a property of nodes compute-2-*).  43571 is using 32 cores, and
will finish within 46 minutes. 43575 is using 16, and will finish within 64 hours.  3 nodes are full, 1 node is free.

The question is why 43578 isn’t being allowed to run.

job 43576 is causing a reservation, expecting to start when 43571 finishes.  I’m assuming that this 2 node
reservation is why 43578 isn’t allowed to run on the remaining free node.  However, whether or not 43578
runs now, two nodes (what 43576 requires) will become free once 43571 finishes in 45 minutes, so it
shouldn’t delay anything.

Does anyone have an idea about why doesn’t 43578 start?  I would have thought that the backfill algorithm
would be clever enough to figure this out.

1. is this expected?
2. is there something I could change in the maui.cfg to make it possible for such a job to run under these circumstances?

											thanks,
											Noam
Attachment (smime.p7s): application/pkcs7-signature, 8126 bytes
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Erica Riello | 23 Sep 23:30 2014
Picon

jobs submitted but not executed

Hi all,

I have some jobs that are submitted to Torque, but they finish as soon 
as they arrive, even if the job is supposed to sleep for 5 min, for 
instance.

Does anyone knows what kind of misconfiguration can lead to such 
situation?

Thanks in advance.

Regards,

--
Erica Riello
nedim11 | 21 Sep 11:49 2014
Picon

Fwd: Problem with Maui

Hi everyone

I have one problem:

- everything is working on cluster but when I execute the job, there is no results and nodes are unavailable instead of being busy (in idle state it said they are blocked by reservation). In attachment you can find all files an detailed view of the probelm.

If you need anything else please ask ASAP.

Thanks in advance.

Attachment (job state listing): application/octet-stream, 4052 bytes
Attachment (maui.cfg): application/octet-stream, 2686 bytes
Attachment (mom config files on compute nodes): application/octet-stream, 303 bytes
Attachment (nodes): application/octet-stream, 6662 bytes
Attachment (torque configuration): application/octet-stream, 1280 bytes
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Steven Lo | 22 Sep 22:03 2014
Picon

maui won't start after upgrade Torque to 4.2.9 from 4.1.5.1


Hi,

We just upgraded Torque from 4.1.5.1 to 4.2.9.  Torque server and 
trqauthd start without any problem.
Torque server is able to communicate with all cluster nodes (which still 
running 4.1.5.1) without any
problem.

However, when I start Maui (3.3.1) with "/usr/local/maui-3.3.1/sbin/maui 
-C /var/spool/maui/maui.cfg"
it gives the following error:

[root <at> master maui]# /usr/local/maui-3.3.1/sbin/maui -C 
/var/spool/maui/maui.cfg
*** glibc detected *** /usr/local/maui-3.3.1/sbin/maui: double free or 
corruption (out): 0x00d27030 ***
======= Backtrace: =========
/lib/libc.so.6[0x179a15]
/lib/libc.so.6(cfree+0x59)[0x17da89]
/usr/local/maui-3.3.1/sbin/maui[0x80acd1c]
/usr/local/maui-3.3.1/sbin/maui[0x80acd5e]
/usr/local/maui-3.3.1/sbin/maui[0x804ceba]
/lib/libc.so.6(__libc_start_main+0xdc)[0x125edc]
/usr/local/maui-3.3.1/sbin/maui[0x8049e61]
======= Memory map: ========
00110000-00266000 r-xp 00000000 08:02 258326     /lib/libc-2.5.so
00266000-00268000 r--p 00156000 08:02 258326     /lib/libc-2.5.so
00268000-00269000 rw-p 00158000 08:02 258326     /lib/libc-2.5.so
00269000-0026c000 rw-p 00269000 00:00 0
0026c000-00399000 r-xp 00000000 08:05 2719716 /usr/lib/libxml2.so.2.6.26
00399000-0039e000 rw-p 0012d000 08:05 2719716 /usr/lib/libxml2.so.2.6.26
0039e000-0039f000 rw-p 0039e000 00:00 0
0039f000-003a6000 r-xp 00000000 08:02 258360     /lib/librt-2.5.so
003a6000-003a7000 r--p 00007000 08:02 258360     /lib/librt-2.5.so
003a7000-003a8000 rw-p 00008000 08:02 258360     /lib/librt-2.5.so
003a8000-003b3000 r-xp 00000000 08:02 259522 
/lib/libgcc_s-4.1.2-20080825.so.1
003b3000-003b4000 rw-p 0000a000 08:02 259522 
/lib/libgcc_s-4.1.2-20080825.so.1
003b4000-003b7000 r-xp 00000000 08:02 258334     /lib/libdl-2.5.so
003b7000-003b8000 r--p 00002000 08:02 258334     /lib/libdl-2.5.so
003b8000-003b9000 rw-p 00003000 08:02 258334     /lib/libdl-2.5.so
003bf000-003da000 r-xp 00000000 08:02 258319     /lib/ld-2.5.so
003da000-003db000 r--p 0001a000 08:02 258319     /lib/ld-2.5.so
003db000-003dc000 rw-p 0001b000 08:02 258319     /lib/ld-2.5.so
003dc000-00506000 r-xp 00000000 08:02 259548 /lib/libcrypto.so.0.9.8e
00506000-0051a000 rw-p 00129000 08:02 259548 /lib/libcrypto.so.0.9.8e
0051a000-0051d000 rw-p 0051a000 00:00 0
0051d000-00561000 r-xp 00000000 08:02 259554 /lib/libssl.so.0.9.8e
00561000-00565000 rw-p 00043000 08:02 259554 /lib/libssl.so.0.9.8e
00565000-00576000 r-xp 00000000 08:02 258356 /lib/libresolv-2.5.so
00576000-00577000 r--p 00010000 08:02 258356 /lib/libresolv-2.5.so
00577000-00578000 rw-p 00011000 08:02 258356 /lib/libresolv-2.5.so
00578000-0057a000 rw-p 00578000 00:00 0
0057a000-00590000 r-xp 00000000 08:02 259552 /lib/libselinux.so.1
00590000-00592000 rw-p 00015000 08:02 259552 /lib/libselinux.so.1
00592000-0059c000 r-xp 00000000 08:02 258352 /lib/libnss_files-2.5.so
0059c000-0059d000 r--p 00009000 08:02 258352 /lib/libnss_files-2.5.so
0059d000-0059e000 rw-p 0000a000 08:02 258352 /lib/libnss_files-2.5.so
005a4000-005b6000 r-xp 00000000 08:02 258379     /lib/libz.so.1.2.3
005b6000-005b7000 rw-p 00011000 08:02 258379     /lib/libz.so.1.2.3
005b7000-0064b000 r-xp 00000000 08:05 2721535 /usr/lib/libkrb5.so.3.3
0064b000-0064e000 rw-p 00093000 08:05 2721535 /usr/lib/libkrb5.so.3.3
00673000-006b2000 r-xp 00000000 08:02 872913 
/opt/torque/lib/libtorque.so.2.0.0
006b2000-006b4000 rw-p 0003e000 08:02 872913 
/opt/torque/lib/libtorque.so.2.0.0
006b4000-00abf000 rw-p 006b4000 00:00 0
00abf000-00afa000 r-xp 00000000 08:02 259551     /lib/libsepol.so.1
00afa000-00afb000 rw-p 0003b000 08:02 259551     /lib/libsepol.so.1
00afb000-00b05000 rw-p 00afb000 00:00 0
00b4d000-00b74000 r-xp 00000000 08:02 258336     /lib/libm-2.5.so
00b74000-00b75000 r--p 00026000 08:02 258336     /lib/libm-2.5.so
00b75000-00b76000 rw-p 00027000 08:02 258336     /lib/libm-2.5.so
00b76000-00c56000 r-xp 00000000 08:05 2719853 /usr/lib/libstdc++.so.6.0.8
00c56000-00c5a000 r--p 000df000 08:05 2719853 /usr/lib/libstdc++.so.6.0.8
00c5a000-00c5b000 rw-p 000e3000 08:05 2719853 /usr/lib/libstdc++.so.6.0.8
00c5b000-00c61000 rw-p 00c5b000 00:00 0
00d26000-00d28000 r-xp 00000000 08:02 259549 /lib/libkeyutils-1.2.so
00d28000-00d29000 rw-p 00001000 08:02 259549 /lib/libkeyutils-1.2.so
00d5a000-00d62000 r-xp 00000000 08:05 2719643 /usr/lib/libkrb5support.so.0.1
00d62000-00d63000 rw-p 00007000 08:05 2719643 /usr/lib/libkrb5support.so.0.1
00d65000-00d8b000 r-xp 00000000 08:05 2721532 /usr/lib/libk5crypto.so.3.1
00d8b000-00d8c000 rw-p 00025000 08:05 2721532 /usr/lib/libk5crypto.so.3.1
00d8e000-00d90000 r-xp 00000000 08:02 259553 /lib/libcom_err.so.2.1
00d90000-00d91000 rw-p 00001000 08:02 259553 /lib/libcom_err.so.2.1
00d93000-00dc0000 r-xp 00000000 08:05 2721536 /usr/lib/libgssapi_krb5.so.2.2
00dc0000-00dc1000 rw-p 0002d000 08:05 2721536 /usr/lib/libgssapi_krb5.so.2.2
00df4000-00df5000 r-xp 00df4000 00:00 0          [vdso]
00f64000-00f7a000 r-xp 00000000 08:02 258350 /lib/libpthread-2.5.so
00f7a000-00f7b000 r--p 00015000 08:02 258350 /lib/libpthread-2.5.so
00f7b000-00f7c000 rw-p 00016000 08:02 258350 /lib/libpthread-2.5.so
00f7c000-00f7e000 rw-p 00f7c000 00:00 0
08048000-0813b000 r-xp 00000000 08:0b 7136459 
/usr/local/maui-3.3.1/sbin/maui
0813b000-0813e000 rw-p 000f3000 08:0b 7136459 
/usr/local/maui-3.3.1/sbin/maui
0813e000-09aca000 rw-p 0813e000 00:00 0
0a2c5000-0a311000 rw-p 0a2c5000 00:00 0          [heap]
b7f3f000-b7f45000 rw-p b7f3f000 00:00 0
b7f73000-b7f74000 rw-p b7f73000 00:00 0
bf8bc000-bf908000 rw-p bffb2000 00:00 0          [stack]
Aborted

When I do "/usr/local/maui-3.3.1/sbin/maui" without the configuration 
file specification, it starts OK
without crashing.

I have recompiled Maui in case there is any Torque library linking 
issue, but the problem persist.
Could the problem coming from the new Torque .so library?  When I move 
the Torque directory
back to the old version, Maui starts without any problem.

Does Torque 4.2.9 support Maui 3.3.1?  If so, have anyone try Maui 3.3.1 
/ Torque 4.2.9 combination?
Should I stick with 4.1?

Thanks.

Steven.
PRAVEEN | 21 Sep 01:34 2014

Numa issue with cpu metrics,

Hi All,

 

                                I am trying to build torque-4.2.8 with NUMA AND facing some issues with the ncpus value.

The machine has two sockets with 8 core each and 2 gpu. Our intention is to create 2 numa nodes with np=4 for numa node1 and np=12 for numa node2 .

 

                The ..server_priv/nodes file looks like the below snippet.

 

gpu2.torque num_node_boards=2 numa_board_str=4,12 gpu

gpu3.torque num_node_boards=2 numa_board_str=4,12 gpu

gpu4.torque num_node_boards=2 numa_board_str=4,12 gpu

gpu5.torque num_node_boards=2 numa_board_str=4,12 gpu

gpu6.torque num_node_boards=2 numa_board_str=4,12 gpu

gpu7.torque num_node_boards=2 numa_board_str=4,12 gpu

gpu8.torque num_node_boards=2 numa_board_str=4,12 gpu

phi1.torque np=16 num_node_boards=2 numa_board_str=4,12 phi

phi2.torque np=16 num_node_boards=2 numa_board_str=4,12 phi

phi3.torque np=16 num_node_boards=2 numa_board_str=4,12 phi

 

but qmgr result for the node looks like the following,

 

[root <at> nandadevi1 server_priv]# qmgr -c "p n gpu8.torque-0"

#

# Create nodes and set their properties.

#

#

# Create and define node gpu8.torque-0

#

create node gpu8.torque-0

set node gpu8.torque-0 state = free

set node gpu8.torque-0 np = 4

set node gpu8.torque-0 properties = gpu

set node gpu8.torque-0 ntype = cluster

set node gpu8.torque-0 status = rectime=1411255386

set node gpu8.torque-0 status += varattr=

set node gpu8.torque-0 status += jobs=

set node gpu8.torque-0 status += state=free

set node gpu8.torque-0 status += netload=? 0

set node gpu8.torque-0 status += gres=

set node gpu8.torque-0 status += loadave=0.00

set node gpu8.torque-0 status += ncpus=8

set node gpu8.torque-0 status += physmem=33507884kb

set node gpu8.torque-0 status += availmem=32431560kb

set node gpu8.torque-0 status += totmem=33507884kb

set node gpu8.torque-0 status += idletime=7970

set node gpu8.torque-0 status += nusers=0

set node gpu8.torque-0 status += nsessions=0

set node gpu8.torque-0 status += uname=Linux gpu8.torque 2.6.32-431.el                                                                                        

set node gpu8.torque-0 status += opsys=linux

set node gpu8.torque-0 gpus = 2

set node gpu8.torque-0 gpu_status = gpu[1]=gpu_id=0000:42:00.0;gpu_produ

 

Snipped…

 

 

The unexpected behaviour is that torue reports ncpus=8 for both the numa nodes.

 

Our Main intention is that we will dedicate 4 processors and 2 gpu cards to one node and add the remaining compute processors as another node. Not sure if this is a good idea. Kindly suggest.

 

 

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

Gmane