CAO THANG | 24 Jul 09:00 2014
Picon

pbsnodes shows "state = free", even there is a job running on it

Hello everyone,

I setup a two-node cluster. The torque runs well.

When I submit a job (by qsub) and run (by qrun), I check status of the 
nodes by pbsnodes command. pbsnodes shows "state = free" for the two 
nodes, even there is a job running on these two nodes.

Could you explain it? how I can know status of the nodes as "busy".

Thank you very much

--

-- 
CAO THANG
Mikael Tillströmning | 22 Jul 07:06 2014

clöster riddaren


Skickat från min iPad
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Mikael Tillströmning | 22 Jul 07:06 2014

(no subject)


Skickat från min iPad
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Mikael Tillströmning | 22 Jul 07:31 2014

general TrideT G TridenT G superclöster


Skickat från min iPad
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Mikael Tillströmning | 22 Jul 07:56 2014

Re: torqueusers Digest, Vol 120, Issue 16




Skickat från min iPad

21 jul 2014 kl. 19:05 skrev torqueusers-request <at> supercluster.org:

Send torqueusers mailing list submissions to
   torqueusers <at> supercluster.org

To subscribe or unsubscribe via the World Wide Web, visit
   http://www.supercluster.org/mailman/listinfo/torqueusers
or, via email, send a message with subject or body 'help' to
   torqueusers-request <at> supercluster.org

You can reach the person managing the list at
   torqueusers-owner <at> supercluster.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of torqueusers digest..."


Today's Topics:

  1. Re: error in pbsacct 1.4.9 (Mark Moore)
  2. Re: Statistics and Cluster utilization reporting with    Troque
     (Mark Moore)
  3. Re: Statistics and Cluster utilization reporting with    Troque
     (Mark Moore)
  4. Re: Torque Scheduling & Conditions (David Beer)


----------------------------------------------------------------------

Message: 1
Date: Mon, 21 Jul 2014 10:47:32 -0600
From: Mark Moore <mmoore <at> ucar.edu>
Subject: Re: [torqueusers] error in pbsacct 1.4.9
To: Torque Users Mailing List <torqueusers <at> supercluster.org>
Message-ID:
   <CA+i9Z022bHy-Ow0XtkiyDAkni=Ouc1a5TnFzuce_bkAp8RT1nw <at> mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

On Fri, Jul 18, 2014 at 9:06 AM, Amjad Syed <amjadcsu <at> gmail.com> wrote:

Hello,
I am not sure if this is right forum , but just want to check if anyone
has same issue

I am using pbsacct 1.4.9  with torque 2.5.4 and maui.

When i execute  the following command from  reporting directory i get the
following error.



pbsacct 20140617

Portable Batch System accounting statistics
-------------------------------------------

Processing a total of 1 accounting files... /opt/maui/bin/pbsjobs: line
72: -F;: command not found
done.


I recall having to add the line

AWK=/usr/bin/gawk

to the top of the script.

Mark
--0-



/opt/maui/bin/pbsacct ERROR: There are no accounting records in the input
files:
20140617


This is sample of what  i have in 20140617

08:47:51;E;1056.krplpadul001;user=xxxxxx group=xxxxx
jobname=WekaScriptArabicTextMiningClassificatin-LongRun-V6-con6-HPC-v12.sh
queue=vlong ctime=1401266642 qtime=1401266642 etime=1401266642
start=1401266644 owner=xxxxx <at> krplpadul001 exec_host=krplpslcn001/5
Resource_List.neednodes=1 Resource_List.nodect=1 Resource_List.nodes=1
Resource_List.walltime=480:00:00 session=11415 end=1402984071 Exit_status=0
resources_used.cput=481:44:42 resources_used.mem=1723856kb
resources_used.vmem=8873540kb resources_used.walltime=477:03:46



I am missing something here

Thanks
Amjad

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers




--
   Mark
   --0-
----------------------------------------------------------------------
Mark Moore
UCAR/NCAR/CGD                    mmoore <at> ucar.edu
1850 Table Mesa Drive            (W) 303 497-1338
Boulder, CO 80305                (F) 303 497-1324
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20140721/ccf72c43/attachment-0001.html

------------------------------

Message: 2
Date: Mon, 21 Jul 2014 10:48:18 -0600
From: Mark Moore <mmoore <at> ucar.edu>
Subject: Re: [torqueusers] Statistics and Cluster utilization
   reporting with    Troque
To: Torque Users Mailing List <torqueusers <at> supercluster.org>
Message-ID:
   <CA+i9Z01+srsmZq05u60gd-49dY1CzduR834dHDX+5iTeP4fXzQ <at> mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

On Thu, Jul 17, 2014 at 12:45 PM, Amjad Syed <amjadcsu <at> gmail.com> wrote:

Hello,

I have  just  question , in  torque/maui  can we get  a monthly/weekly
report of the following

1) Number of jobs submitted
2) Number of cores/cpu utilized
3) Memory utilization


If not , what are the tools recommended  that can be used with torque to
get above information .

Thanks,
Amjad

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers




--
   Mark
   --0-
----------------------------------------------------------------------
Mark Moore
UCAR/NCAR/CGD                    mmoore <at> ucar.edu
1850 Table Mesa Drive            (W) 303 497-1338
Boulder, CO 80305                (F) 303 497-1324
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20140721/d1bfe2b4/attachment-0001.html

------------------------------

Message: 3
Date: Mon, 21 Jul 2014 10:51:49 -0600
From: Mark Moore <mmoore <at> ucar.edu>
Subject: Re: [torqueusers] Statistics and Cluster utilization
   reporting with    Troque
To: Torque Users Mailing List <torqueusers <at> supercluster.org>
Message-ID:
   <CA+i9Z02-eoqH8EdtQ7nvGZ6p9LEbt2_RnfSqBsP9kLwvEHcLEA <at> mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

On Thu, Jul 17, 2014 at 12:45 PM, Amjad Syed <amjadcsu <at> gmail.com> wrote:

Hello,

I have  just  question , in  torque/maui  can we get  a monthly/weekly
report of the following

1) Number of jobs submitted
2) Number of cores/cpu utilized
3) Memory utilization


If not , what are the tools recommended  that can be used with torque to
get above information .

Thanks,
Amjad

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers



Adding to the list:

. pbstop originally written by Garrick Staples, now maintained by others.
. I just completed some alpha code to graph job throughput for our small
 cluster. Code and screen shots are at the link below. I'll be making
 improvements at some point. I'll make the improvements sooner if there's
 interest.

http://tools.cgd.ucar.edu/torqueacctgr/index.html


--
   Mark
   --0-
----------------------------------------------------------------------
Mark Moore
UCAR/NCAR/CGD                    mmoore <at> ucar.edu
1850 Table Mesa Drive            (W) 303 497-1338
Boulder, CO 80305                (F) 303 497-1324
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20140721/fb82c339/attachment-0001.html

------------------------------

Message: 4
Date: Mon, 21 Jul 2014 10:55:14 -0600
From: David Beer <dbeer <at> adaptivecomputing.com>
Subject: Re: [torqueusers] Torque Scheduling & Conditions
To: Torque Users Mailing List <torqueusers <at> supercluster.org>
Message-ID:
   <CAFUQeZ0DjjqxrNkQ89DWhhsEpZ8GE6Mu7P25DYGUodJeAxEmWw <at> mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

On Mon, Jul 14, 2014 at 3:20 PM, Paul El Khoury <
Paul.ElKhoury <at> halliburton.com> wrote:

Hello,



I am facing the below problem in my project and would like to solicit your
feedback/feasibility on the below issue:



I need to simulate 50,000 shot points which require 1 hr per shot or
approximately 2 months using 32 nodes (20 cores each) continuously;
however, management would like my simulation to be done during slack time
in order not to prevent major jobs from execution.

The plan is to try and do the following:


Is each shot divided into its own job? It would seem like a good idea to
get them divided into as many jobs as possible to facilitate the way you
want to do things.


-          Task: Set program to run during off working hours only
(overnight 5pm to 7am) and/or run the jobs only during cluster?s slack time
(when many nodes are available)

o   Problem: Is it possible to do that?


I would recommend simply submitting very low priority jobs that are small
in length. If you want it literally to run only from 5 pm to 7 am there are
ways to do that depending on the scheduler that you're using but I can't
really think of anything to accomplish this with just TORQUE. For example,
with Moab you could create some standing reservations and submit the jobs
directly to that standing reservation.

-          Task: Use low priority to queue the jobs and for every shot:
find a node, run the simulation and then release node so that if a higher
priority job is queued it can use it.

o   Problem: During my testing, once my job started simulation on a node,
it did not stop execution until all the shots were done.

You have to set up preemption for this to work, this is pretty much done
via the scheduler.


-          Task: If a high priority job is queued, check for the % of
completion of the shot, if less than 75% kill it, if more continue and then
release node.

o   Problem: Is it possible to set such a condition or something similar?

I don't know of any specific way to do this.



I would appreciate your help.



Regards

Paul
------------------------------
This e-mail, including any attached files, may contain confidential and
privileged information for the sole use of the intended recipient. Any
review, use, distribution, or disclosure by others is strictly prohibited.
If you are not the intended recipient (or authorized to receive information
for the intended recipient), please contact the sender by reply e-mail and
delete all copies of this message.

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers




--
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20140721/46e4d8c8/
    _______________________________________________
    torqueusers mailing list
    torqueusers <at> supercluster.org
    http://www.supercluster.org/mailman/listinfo/torqueusers
    
    Picon

    Fwd: Preemption Torque/Maui

    Hi all (I tried send this message to a maui users list without sucess, so here I am..) ,

       I've been trying configure the maui to work with preemption based in queues. But It just doesn't work as expected. Acctually I setup all priorities and I classified all classes as preemptor and preemptee, and nothing.

      I made a simple test to see it, when I subimit job in queue wich is preemptee and then I submit a job wich is in a preemptor queue, if there're no resources available the preemptor job goes to a Idle list jobs, it doesn't suspend the first job to get the resources available.


    Here is the setup:


    ================= maui.cfg ====================

    SERVERHOST              myserver.mydomain.com.br
    ADMIN1                  root
    ADMIN3                  test
    ADMINHOST               myserver.mydomain.com.br
    RMCFG[base]             TYPE=PBS
    #
    SERVERPORT              40559
    SERVERMODE              NORMAL
    #
    RMPOLLINTERVAL        00:00:30
    #
    LOGFILE               /var/log/maui.log
    LOGFILEMAXSIZE        10000000
    LOGLEVEL              3
    NODEACCESSPOLICY        shared
    #
    DEFERTIME       00:01:00
    #
    PREEMPTPOLICY           SUSPENDED
    BACKFILLPOLICY        FIRSTFIT
    RESERVATIONPOLICY     CURRENTHIGHEST
    NODEALLOCATIONPOLICY  MINRESOURCE
    #
    FSPOLICY        DEDICATEDPS
    FSINTERVAL      24:00:00
    FSQOSWEIGHT     2
    #
    QOSWEIGHT               1

    CLASSCFG[high_prio]     QDEF=high
    CLASSCFG[cluster]       QDEF=slow
    QOSCFG[high]            QFLAGS=PREEMPTOR        PRIORITY=1000
    QOSCFG[slow]            QFLAGS=PREEMPTEE        PRIORITY=100
    #
    CONSUMEDWEIGHT          3
    CREDWEIGHT              1
    GROUPWEIGHT             1
    USERWEIGHT              1
    SRCFGWEIGHT             2
    #
    QUEUETIMEWAIT        1     
    #
    ENABLENEGJOBPRIORITY    true
    REJECTNEGPRIOJOB        false
    #
    NODEALLOCATIONPOLICY    PRIORITY
    NODECFG[DEFAULT]        SLOT=2
    #
    NODESETPOLICY    ONEOF
    NODESETATTRIBUTE FEATURE
    #
    DEFERTIME       0
    ==============================================================================


    =============== showconfig ===================================================

    # Maui version 3.3.1 (PID: 3656)
    # global policies

    REJECTNEGPRIOJOBS[0]              FALSE
    ENABLENEGJOBPRIORITY[0]           TRUE
    ENABLEMULTINODEJOBS[0]            TRUE
    ENABLEMULTIREQJOBS[0]             FALSE
    BFPRIORITYPOLICY[0]               [NONE]
    JOBPRIOACCRUALPOLICY            QUEUEPOLICY
    NODELOADPOLICY                  ADJUSTSTATE
    USEMACHINESPEEDFORFS            FALSE
    USEMACHINESPEED                 FALSE
    USESYSTEMQUEUETIME              TRUE
    USELOCALMACHINEPRIORITY         FALSE
    NODEUNTRACKEDLOADFACTOR         1.2
    JOBNODEMATCHPOLICY[0]            
    JOBMAXSTARTTIME[0]                  INFINITY
    METAMAXTASKS[0]                   0
    NODESETPOLICY[0]                  ONEOF
    NODESETATTRIBUTE[0]               FEATURE
    NODESETLIST[0]                   
    NODESETDELAY[0]                   00:00:00
    NODESETPRIORITYTYPE[0]            MINLOSS
    NODESETTOLERANCE[0]                 0.00
    BACKFILLPOLICY[0]                 FIRSTFIT
    BACKFILLDEPTH[0]                  0
    BACKFILLPROCFACTOR[0]             0
    BACKFILLMAXSCHEDULES[0]           10000
    BACKFILLMETRIC[0]                 PROCS
    BFCHUNKDURATION[0]                00:00:00
    BFCHUNKSIZE[0]                    0
    PREEMPTPOLICY[0]                  [NONE]
    MINADMINSTIME[0]                  00:00:00
    RESOURCELIMITPOLICY[0]           
    NODEAVAILABILITYPOLICY[0]         COMBINED:[DEFAULT]
    NODEALLOCATIONPOLICY[0]           PRIORITY
    TASKDISTRIBUTIONPOLICY[0]         DEFAULT
    RESERVATIONPOLICY[0]              CURRENTHIGHEST
    RESERVATIONRETRYTIME[0]           00:00:00
    RESERVATIONTHRESHOLDTYPE[0]       NONE
    RESERVATIONTHRESHOLDVALUE[0]      0
    FSPOLICY                        DEDICATEDPS
    FSPOLICY                        DEDICATEDPS
    FSINTERVAL                      1:00:00:00
    FSDEPTH                         8
    FSDECAY                         1.00 
    SERVICEWEIGHT[0]                  1
    TARGETWEIGHT[0]                   1
    CREDWEIGHT[0]                     1
    ATTRWEIGHT[0]                     1
    FSWEIGHT[0]                       1
    RESWEIGHT[0]                      1
    USAGEWEIGHT[0]                    1
    QUEUETIMEWEIGHT[0]                1
    XFACTORWEIGHT[0]                  0
    SPVIOLATIONWEIGHT[0]              0
    BYPASSWEIGHT[0]                   0
    TARGETQUEUETIMEWEIGHT[0]          0
    TARGETXFACTORWEIGHT[0]            0
    USERWEIGHT[0]                     1
    GROUPWEIGHT[0]                    1
    ACCOUNTWEIGHT[0]                  0
    QOSWEIGHT[0]                      1
    CLASSWEIGHT[0]                    0
    FSUSERWEIGHT[0]                   0
    FSGROUPWEIGHT[0]                  0
    FSACCOUNTWEIGHT[0]                0
    FSQOSWEIGHT[0]                    2
    FSCLASSWEIGHT[0]                  0
    ATTRATTRWEIGHT[0]                 0
    ATTRSTATEWEIGHT[0]                0
    NODEWEIGHT[0]                     0
    PROCWEIGHT[0]                     0
    MEMWEIGHT[0]                      0
    SWAPWEIGHT[0]                     0
    DISKWEIGHT[0]                     0
    PSWEIGHT[0]                       0
    PEWEIGHT[0]                       0
    WALLTIMEWEIGHT[0]                 0
    UPROCWEIGHT[0]                    0
    UJOBWEIGHT[0]                     0
    CONSUMEDWEIGHT[0]                 3
    USAGEEXECUTIONTIMEWEIGHT[0]       0
    REMAININGWEIGHT[0]                0
    PERCENTWEIGHT[0]                  0
    XFMINWCLIMIT[0]                   00:02:00
    REJECTNEGPRIOJOBS[1]              FALSE
    ENABLENEGJOBPRIORITY[1]           FALSE
    ENABLEMULTINODEJOBS[1]            TRUE
    ENABLEMULTIREQJOBS[1]             FALSE
    BFPRIORITYPOLICY[1]               [NONE]
    JOBPRIOACCRUALPOLICY            QUEUEPOLICY
    NODELOADPOLICY                  ADJUSTSTATE
    JOBNODEMATCHPOLICY[1]            
    JOBMAXSTARTTIME[1]                  INFINITY
    METAMAXTASKS[1]                   0
    NODESETPOLICY[1]                  [NONE]
    NODESETATTRIBUTE[1]               [NONE]
    NODESETLIST[1]                   
    NODESETDELAY[1]                   00:00:00
    NODESETPRIORITYTYPE[1]            MINLOSS
    NODESETTOLERANCE[1]                 0.00
    XFMINWCLIMIT[1]                   00:00:00
    RMAUTHTYPE[0]                     CHECKSUM
    CLASSCFG[[NONE]]  DEFAULT.FEATURES=[NONE]
    CLASSCFG[[ALL]]  DEFAULT.FEATURES=[NONE]
    CLASSCFG[high_prio]  DEFAULT.FEATURES=[NONE]
    CLASSCFG[cluster]  DEFAULT.FEATURES=[NONE]
    QOSPRIORITY[0]                    0
    QOSQTWEIGHT[0]                    0
    QOSXFWEIGHT[0]                    0
    QOSTARGETXF[0]                      0.00
    QOSTARGETQT[0]                    00:00:00
    QOSFLAGS[0]                      
    QOSPRIORITY[1]                    0
    QOSQTWEIGHT[1]                    0
    QOSXFWEIGHT[1]                    0
    QOSTARGETXF[1]                      0.00
    QOSTARGETQT[1]                    00:00:00
    QOSFLAGS[1]                      
    QOSPRIORITY[2]                    1000
    QOSQTWEIGHT[2]                    0
    QOSXFWEIGHT[2]                    0
    QOSTARGETXF[2]                      0.00
    QOSTARGETQT[2]                    00:00:00
    QOSFLAGS[2]                       PREEMPTOR
    QOSPRIORITY[3]                    100
    QOSQTWEIGHT[3]                    0
    QOSXFWEIGHT[3]                    0
    QOSTARGETXF[3]                      0.00
    QOSTARGETQT[3]                    00:00:00
    QOSFLAGS[3]                       PREEMPTEE
    SERVERMODE                      NORMAL
    SERVERNAME                     
    SERVERHOST                      myserver.mydomain.com.br
    SERVERPORT                      40559
    LOGFILE                         /var/log/maui.log
    LOGFILEMAXSIZE                  10000000
    LOGFILEROLLDEPTH                1
    LOGLEVEL                        3
    LOGFACILITY                     fALL
    SERVERHOMEDIR                   /usr/local/maui/
    TOOLSDIR                        /usr/local/maui/tools/
    LOGDIR                          /usr/local/maui/log/
    STATDIR                         /usr/local/maui/stats/
    LOCKFILE                        /usr/local/maui/maui.pid
    SERVERCONFIGFILE                /usr/local/maui/maui.cfg
    CHECKPOINTFILE                  /usr/local/maui/maui.ck
    CHECKPOINTINTERVAL              00:05:00
    CHECKPOINTEXPIRATIONTIME        3:11:20:00
    TRAPJOB                        
    TRAPNODE                       
    TRAPFUNCTION                   
    RESDEPTH                        24

    RMPOLLINTERVAL                  00:00:30
    NODEACCESSPOLICY                SHARED
    ALLOCLOCALITYPOLICY             [NONE]
    SIMTIMEPOLICY                   [NONE]
    ADMIN1                          root
    ADMIN3                          test
    ADMINHOSTS                      ALL
    NODEPOLLFREQUENCY               0
    DISPLAYFLAGS                   
    DEFAULTDOMAIN                   .mydomain.com.br
    DEFAULTCLASSLIST                [DEFAULT:1]
    FEATURENODETYPEHEADER          
    FEATUREPROCSPEEDHEADER         
    FEATUREPARTITIONHEADER         
    DEFERTIME                       00:00:00
    DEFERCOUNT                      24
    DEFERSTARTCOUNT                 1
    JOBPURGETIME                    0
    NODEPURGETIME                   2140000000
    APIFAILURETHRESHHOLD            6
    NODESYNCTIME                    600
    JOBSYNCTIME                     600
    JOBMAXOVERRUN                   00:10:00
    NODEMAXLOAD                     0.0
    PLOTMINTIME                     120
    PLOTMAXTIME                     245760
    PLOTTIMESCALE                   11
    PLOTMINPROC                     1
    PLOTMAXPROC                     512
    PLOTPROCSCALE                   9
    SCHEDCFG[]                        MODE=NORMAL SERVER=myserver.mydomain.com.br:40559
    # RM MODULES: PBS SSS WIKI NATIVE
    RMCFG[base] AUTHTYPE=CHECKSUM EPORT=15004 TIMEOUT=00:00:09 TYPE=PBS
    SIMWORKLOADTRACEFILE            workload
    SIMRESOURCETRACEFILE            resource
    SIMAUTOSHUTDOWN                 OFF
    SIMSTARTTIME                    0
    SIMSCALEJOBRUNTIME              FALSE
    SIMFLAGS                       
    SIMJOBSUBMISSIONPOLICY          CONSTANTJOBDEPTH
    SIMINITIALQUEUEDEPTH            16
    SIMWCACCURACY                   0.00
    SIMWCACCURACYCHANGE             0.00
    SIMNODECOUNT                    0
    SIMNODECONFIGURATION            NORMAL
    SIMWCSCALINGPERCENT             100
    SIMCOMRATE                      0.10
    SIMCOMTYPE                      ROUNDROBIN
    COMINTRAFRAMECOST               0.30
    COMINTERFRAMECOST               0.30
    SIMSTOPITERATION                -1
    SIMEXITITERATION                -1
    ==============================================================================

    test <at> myserver:~$ showq -v
    ACTIVE JOBS--------------------
    JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

    0                     test    Running     1  4:03:59:29  Fri Jul 11 16:44:36

         1 Active Job        1 of   12 Processors Active (8.33%)
                             1 of    3 Nodes Active      (33.33%)

    IDLE JOBS----------------------
    JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

    1                     test       Idle    12  4:04:00:00  Fri Jul 11 16:44:52

    1 Idle Job

    BLOCKED JOBS----------------
    JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


    Total Jobs: 2   Active Jobs: 1   Idle Jobs: 1   Blocked Jobs: 0
    ===============================================================================

    test <at> myserver:~$ qstat -q

    server: myserver.mydomain.com.br

    Queue            Memory CPU Time Walltime Node  Run Que Lm  State
    ---------------- ------ -------- -------- ----  --- --- --  -----
    cluster            --      --       --      --    1   0 --   E R
    high_prio          --      --       --      --    0   1 --   E R
                                                   ----- -----
                                                       1     1

    ===============================================================================

    test <at> myserver:~$ checkjob 1


    checking job 1

    State: Idle
    Creds:  user:test  group:test  class:high_prio  qos:DEFAULT
    WallTime: 00:00:00 of 4:04:00:00
    SubmitTime: Fri Jul 11 16:44:52
      (Time Queued  Total: 00:08:31  Eligible: 00:05:56)

    Total Tasks: 12

    Req[0]  TaskCount: 12  Partition: ALL
    Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
    Opsys: [NONE]  Arch: [NONE]  Features: [NONE]


    IWD: [NONE]  Executable:  [NONE]
    Bypass: 0  StartCount: 0
    PartitionMask: [ALL]
    Flags:       RESTARTABLE

    Reservation '1' (4:03:51:13 -> 8:07:51:13  Duration: 4:04:00:00)
    PE:  12.00  StartPriority:  5
    job cannot run in partition DEFAULT (idle procs do not meet requirements : 8 of 12 procs found)
    idle procs:  12  feasible procs:   8

    Rejection Reasons: [CPU          :    1]


    ======================================================================================


    Best regards.

    --
    ~ Pedro Cesar
    IT Specialist


    _______________________________________________
    torqueusers mailing list
    torqueusers <at> supercluster.org
    http://www.supercluster.org/mailman/listinfo/torqueusers
    
    Paul El Khoury | 14 Jul 23:20 2014

    Torque Scheduling & Conditions

    Hello,

     

    I am facing the below problem in my project and would like to solicit your feedback/feasibility on the below issue:

     

    I need to simulate 50,000 shot points which require 1 hr per shot or approximately 2 months using 32 nodes (20 cores each) continuously; however, management would like my simulation to be done during slack time in order not to prevent major jobs from execution.

    The plan is to try and do the following:

    -          Task: Set program to run during off working hours only (overnight 5pm to 7am) and/or run the jobs only during cluster’s slack time (when many nodes are available)

    o   Problem: Is it possible to do that?

    -          Task: Use low priority to queue the jobs and for every shot: find a node, run the simulation and then release node so that if a higher priority job is queued it can use it.  

    o   Problem: During my testing, once my job started simulation on a node, it did not stop execution until all the shots were done.

    -          Task: If a high priority job is queued, check for the % of completion of the shot, if less than 75% kill it, if more continue and then release node.

    o   Problem: Is it possible to set such a condition or something similar?

     

    I would appreciate your help.

     

    Regards

    Paul

    This e-mail, including any attached files, may contain confidential and privileged information for the sole use of the intended recipient. Any review, use, distribution, or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive information for the intended recipient), please contact the sender by reply e-mail and delete all copies of this message.
    _______________________________________________
    torqueusers mailing list
    torqueusers <at> supercluster.org
    http://www.supercluster.org/mailman/listinfo/torqueusers
    
    Amjad Syed | 18 Jul 17:06 2014
    Picon

    error in pbsacct 1.4.9

    Hello, 
    I am not sure if this is right forum , but just want to check if anyone has same issue 

    I am using pbsacct 1.4.9  with torque 2.5.4 and maui.

    When i execute  the following command from  reporting directory i get the following error.



    pbsacct 20140617

    Portable Batch System accounting statistics
    -------------------------------------------

    Processing a total of 1 accounting files... /opt/maui/bin/pbsjobs: line 72: -F;: command not found
    done.
    /opt/maui/bin/pbsacct ERROR: There are no accounting records in the input files:
    20140617


    This is sample of what  i have in 20140617 

     08:47:51;E;1056.krplpadul001;user=xxxxxx group=xxxxx jobname=WekaScriptArabicTextMiningClassificatin-LongRun-V6-con6-HPC-v12.sh queue=vlong ctime=1401266642 qtime=1401266642 etime=1401266642 start=1401266644 owner=xxxxx <at> krplpadul001 exec_host=krplpslcn001/5 Resource_List.neednodes=1 Resource_List.nodect=1 Resource_List.nodes=1 Resource_List.walltime=480:00:00 session=11415 end=1402984071 Exit_status=0 resources_used.cput=481:44:42 resources_used.mem=1723856kb resources_used.vmem=8873540kb resources_used.walltime=477:03:46



    I am missing something here

    Thanks
    Amjad
    _______________________________________________
    torqueusers mailing list
    torqueusers <at> supercluster.org
    http://www.supercluster.org/mailman/listinfo/torqueusers
    
    Amjad Syed | 17 Jul 20:45 2014
    Picon

    Statistics and Cluster utilization reporting with Troque

    Hello,

    I have  just  question , in  torque/maui  can we get  a monthly/weekly report of the following

    1) Number of jobs submitted
    2) Number of cores/cpu utilized
    3) Memory utilization


    If not , what are the tools recommended  that can be used with torque to get above information .

    Thanks,
    Amjad
    _______________________________________________
    torqueusers mailing list
    torqueusers <at> supercluster.org
    http://www.supercluster.org/mailman/listinfo/torqueusers
    
    Andrej Prsa | 15 Jul 14:06 2014
    Picon

    Intermittent ORTE_ERROR_LOG problems

    Hi everyone,
    
    qsubbing a job to torque sometimes results in the following:
    
    [node3:16773] [[35788,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 142
    [node3:16773] [[35788,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 82
    [node3:16773] [[35788,0],0] ORTE_ERROR_LOG: File open failure in file base/ras_base_allocate.c at
    line 149
    [node3:16773] [[35788,0],0] ORTE_ERROR_LOG: File open failure in file
    base/plm_base_launch_support.c at line 99
    [node3:16773] [[35788,0],0] ORTE_ERROR_LOG: File open failure in file plm_tm_module.c at line 194
    --------------------------------------------------------------------------
    A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
    launch so we are aborting.
    
    There may be more information reported by the environment (see above).
    
    This may be because the daemon was unable to find all the needed shared
    libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
    location of the shared libraries on the remote nodes and this will
    automatically be forwarded to the remote nodes.
    --------------------------------------------------------------------------
    --------------------------------------------------------------------------
    mpirun noticed that the job aborted, but has no info as to the process
    that caused that situation.
    --------------------------------------------------------------------------
    
    Another consecutive qsub and the problem goes away. Here's the script:
    
    #PBS -N V346Cen
    #PBS -l nodes=120:big
    #PBS -l walltime=14:00:00
    #PBS -V
    #PBS -M aprsa <at> villanova.edu
    #PBS -m abe
    
    cd $PBS_O_WORKDIR
    mpirun ./mcmc.py
    
    Any ideas what may be causing this? The occurrence rate is 30-40%.
    Using torque 4.2.7, maui 3.3.1 on ubuntu 12.04.
    
    Thanks,
    Andrej
    
    glen.beane | 11 Jul 00:48 2014
    Picon

    Re: Configuring debug queue to use just half a node

    Hi Steve, you can make a Moab reservation on the node for 8 cores for the debug class (queue).  The remaining cores would be available for the other queue. 

    Sent from my iPhone

    On Jul 10, 2014, at 6:41 PM, Stephen Cousins <steve.cousins <at> maine.edu> wrote:

    Hi David,

    But I don't think it would limit the batch queue from using all of the cores on that node, leaving 8 cores free for debugging.  

    Right now our cluster is completely full but there are some people who are frustrated because they want to debug code to get it ready. We have a fairly small cluster (only 32 nodes) so I don't want to tie up a whole node (16 cores) exclusively for debugging. I'm trying to tie up just 8 cores instead.  

    Thanks,

    Steve




    On Thu, Jul 10, 2014 at 6:29 PM, David Beer <dbeer <at> adaptivecomputing.com> wrote:
    I didn't explain that correctly. I think you want to do something like:

     n0 np=16 debug ib lowmem ...

      set queue debug resources_default.neednodes = debug
    ​  set queue batch resources_default.neednodes = ib

    and on top of that have the CLASSCFG line in your moab.cfg file. These combined should restrict the debug queue to only 8 procs on n0.


    On Thu, Jul 10, 2014 at 4:19 PM, Stephen Cousins <steve.cousins <at> maine.edu> wrote:
    That gets me part way there. I'd like to limit the batch queue to a maximum of 8 cores on that node, leaving 8 for the debug queue. Maybe Moab has a way of doing this. I'll look there further.

    Thanks a lot. 

    Steve


    On Thu, Jul 10, 2014 at 6:09 PM, David Beer <dbeer <at> adaptivecomputing.com> wrote:
    There is definitely no way that you can specify a node twice in the nodes file for TORQUE.

    The only way I know of to handle this is to use Moab. Something like:

    CLASSCFG[debug] MAXPROC=8


    On Thu, Jul 10, 2014 at 3:22 PM, Stephen Cousins <steve.cousins <at> maine.edu> wrote:
    All of our nodes have 16 cores. I'd like to set up a debug queue with only 8 cores on one node and let the other cores be available for longer term jobs. There is probably a simple way to do this but I'm not seeing it. 

    Is it possible to have multiple lines for a node in server_priv/nodes like:

      n0 np=8 debug
      n0 np=8 ib-only ib lowmem

    And then qmgr commands would be:

      set queue debug resources_default.neednodes = debug
    ​  set queue batch resources_default.neednodes = ib

    The "batch" queue would then have access to all nodes but only 8 cores on n0. 

    I don't think this type of configuration is allowed. Is it? If not, what would you suggest?

    ​We have Moab too. Maybe I need to configure it there?​

    Thanks,

    Steve

    --
    ________________________________________________________________
     Steve Cousins             Supercomputer Engineer/Administrator
     Advanced Computing Group            University of Maine System
     244 Neville Hall (UMS Data Center)              (207) 561-3574
     Orono ME 04469                      steve.cousins at maine.edu


    _______________________________________________
    torqueusers mailing list
    torqueusers <at> supercluster.org
    http://www.supercluster.org/mailman/listinfo/torqueusers




    --
    David Beer | Senior Software Engineer
    Adaptive Computing

    _______________________________________________
    torqueusers mailing list
    torqueusers <at> supercluster.org
    http://www.supercluster.org/mailman/listinfo/torqueusers




    --
    ________________________________________________________________
     Steve Cousins             Supercomputer Engineer/Administrator
     Advanced Computing Group            University of Maine System
     244 Neville Hall (UMS Data Center)              (207) 561-3574
     Orono ME 04469                      steve.cousins at maine.edu


    _______________________________________________
    torqueusers mailing list
    torqueusers <at> supercluster.org
    http://www.supercluster.org/mailman/listinfo/torqueusers




    --
    David Beer | Senior Software Engineer
    Adaptive Computing

    _______________________________________________
    torqueusers mailing list
    torqueusers <at> supercluster.org
    http://www.supercluster.org/mailman/listinfo/torqueusers




    --
    ________________________________________________________________
     Steve Cousins             Supercomputer Engineer/Administrator
     Advanced Computing Group            University of Maine System
     244 Neville Hall (UMS Data Center)              (207) 561-3574
     Orono ME 04469                      steve.cousins at maine.edu

    _______________________________________________
    torqueusers mailing list
    torqueusers <at> supercluster.org
    http://www.supercluster.org/mailman/listinfo/torqueusers
    _______________________________________________
    torqueusers mailing list
    torqueusers <at> supercluster.org
    http://www.supercluster.org/mailman/listinfo/torqueusers
    

    Gmane