Kevin Buckley | 20 May 2013 07:21
Picon

Re: building Son of Grid Engine

On 20 May 2013 14:51, Kevin Buckley
<kevin.buckley.ecs.vuw.ac.nz@...> wrote:

Here's a patch to

inst_common.sh

that moves qmon from the list of

BINFILES

into a list of

WARNBINFILES

There's then a test for existence loop over WARNBINFILES that
spits out a message but doesn't set

missing=false

Oh yeah, this time, I have noticed that the reply seems to go to the
sender not the
list and so I have copied this to the list as well as re-sending my
previous reply to
Tina to the list.

Dave, I think my pathc for the spec file went only to you as well.

I'lll get the hang of this mailing list thing one day!

(Continue reading)

Joe Borġ | 17 May 2013 16:33

Spelling mistake in OGE 2011.11

Not, by any means, urgent but there is a spelling mistake that occurs a few times in OGE 2011.11.  Resource has been spelt "ressource" which is the French way to spell the word.

This is in the splash screen and main control window.  It's spelt correctly when hovering over  "Ressource Quota" and when in the Resource Window itself.

Regards,
Joseph David Borġ
http://www.jdborg.com
_______________________________________________
users mailing list
users@...
https://gridengine.org/mailman/listinfo/users
Tim Landscheidt | 16 May 2013 19:18
Picon

Where do the factors for np_load_short come from?

Hi,

we're using OGS/GE 2011.11 at toolserver.org, and unfortu-
nately our admins are AWOL.  I'm trying to investigate why
the grid is heavily underloaded (while queues are filling
up).  A simple job gets queued and has scheduling_info:

| scheduling info:            queue instance
"longrun-sol@..." dropped because it is
temporarily not available
|                             queue instance "short-sol@..." dropped because
it is temporarily not available
|                             queue instance "medium-lx@..." dropped because
it is temporarily not available
|                             queue instance "longrun3-sol@..." dropped
because it is temporarily not available
|                             queue instance "longrun2-sol@..." dropped
because it is disabled
|                             queue instance "longrun2-sol@..." dropped
because it is disabled
|                             queue instance "medium-sol@..." dropped because
it is overloaded: np_load_short=0.845508 (= 0.645508 + 0.8 * 1.000000 with nproc=4) >= 0.75
|                             queue instance "medium-sol@..." dropped because
it is overloaded: np_load_short=0.831445 (= 0.231445 + 0.8 * 6.000000 with nproc=8) >= 0.75
|                             queue instance "short-sol@..." dropped because
it is overloaded: np_load_short=1.245508 (= 0.645508 + 0.8 * 3.000000 with nproc=4) >= 1.2
|                             queue instance "short-sol@..." dropped because
it is overloaded: np_load_short=1.231445 (= 0.231445 + 0.8 * 10.000000 with nproc=8) >= 1.2
|                             queue instance "medium-lx@..." dropped because
it is overloaded: np_load_short=1.202500 (= 0.002500 + 0.8 * 6.000000 with nproc=4) >= 1.2
|                             queue instance "medium-lx@..."
dropped because it is full
|                             queue instance "longrun-lx@..."
dropped because it is overloaded: mem_free=-173461503.737856 (= 13834.574219M - 500M * 28.000000) <= 500
|                             queue instance "longrun-lx@..." dropped because
it is overloaded: np_load_short=3.202500 (= 0.002500 + 0.8 * 16.000000 with nproc=4) >= 3.1

For example, in queue instance
medium-lx@..., where do the factors 0.8
and 6.000000 come from?  Neither "qconf -sconf global" nor
"qconf -sconf yarrow" show anything obvious, and "qconf -sq
medium-lx" only has load_thresholds with:

| [...]
| load_thresholds       np_load_short=1.2,np_load_long=1.5,cpu=98, \
|                       mem_free=1000M, \
|                       [mayapple.toolserver.org=np_load_short=2.1,mem_free=300M]
| [...]

to define the threshold, but not the calculation.  I believe
the factors are applied at
source/libs/sched/sge_select_queue.c:2057, but I don't want
to read the whole source :-).  Are these factors some de-
fault, or where should I look?

TIA,
Tim
Jacques Foucry | 15 May 2013 17:49

submit, need help

Hello,

I am almost happy I can do somthing with my little grid :-)

I have a qmaster, and two exec host. All of them are sharing the same 
SGE_CELL, nfs export from qmaster (/usr/local/sge/default).

With my user, on sge1 (the second exec host) I create a small shell script :

#!/bin/sh

HOST=`hostname`
echo "This is a test from ${HOST}"

I tried to submit with qsub:

$SGE_ROOT/bin/linux-x64/qsub test.sh
Your job 17 ("test.sh") has been submitted

On sge0 (the first exec host), in my home (same user but different 
filesystem) I can find results files:

test.sh.o17
test.sh.e17

With this output (with french in it):

Attention: pas d'accès au tty (Mauvais descripteur de fichier).
Ainsi pas de contrôle de job dans ce shell.
This is a test from sge0.ns42.fr

It seems my script was submit on sge1 and executed on sge0.

How can I ask qsub to exec my script on sge1?

I try with -q  <at> server option, but it did not work:

% $SGE_ROOT/bin/linux-x64/qsub -q  <at> sge1 test.sh
Unable to run job: Job was rejected because job requests unknown queue 
" <at> sge1.ns42.fr".

What did I need to check?

Thanks in advance,
Regards,
Jacques Foucry
--

-- 
Jacques Foucry
*NOVΛSPARKS *
IT Manager
Tel : +33 (0)1 42 68 12 61
jacques.foucry <at> novasparks.com
_______________________________________________
users mailing list
users <at> gridengine.org
https://gridengine.org/mailman/listinfo/users
William Hay | 15 May 2013 14:58
Picon
Picon
Favicon

Re: Allow users to choose how many concurrent jobs they have




On 15 May 2013 13:32, Txema Heredia Genestar <txema.llistes <at> gmail.com> wrote:
Hi all,

I was wondering if there is any way to allow a user to choose how many
jobs they want to have running concurrently in the cluster. I am aware
that I, as an administrator, can specify limits in the slot usage for
each user whith resource quota sets.
What I am asking is a method to allow a user to submit, for instance,
2000 jobs, but having only 50 running simultaneously, and, two days
later, be able to run 400 jobs at once.
Currently I am using a consumable attribute set to the total number of
cores of our cluster (400), so users can request some number (400 / 50 =
8) in order to have their desired simultaneous job, but this leads to
some confusion and applies to all users at once (the consumable
attribute pool is shared among all users).

Is there a fast-and-easy way a user can set his own limit?

Per user resource quota on the consumable?   Use  some wrapper scripts/jsv to do the maths.


 
_______________________________________________
users mailing list
users@...
https://gridengine.org/mailman/listinfo/users
Txema Heredia Genestar | 15 May 2013 14:32
Picon

Allow users to choose how many concurrent jobs they have

Hi all,

I was wondering if there is any way to allow a user to choose how many 
jobs they want to have running concurrently in the cluster. I am aware 
that I, as an administrator, can specify limits in the slot usage for 
each user whith resource quota sets.
What I am asking is a method to allow a user to submit, for instance, 
2000 jobs, but having only 50 running simultaneously, and, two days 
later, be able to run 400 jobs at once.
Currently I am using a consumable attribute set to the total number of 
cores of our cluster (400), so users can request some number (400 / 50 = 
8) in order to have their desired simultaneous job, but this leads to 
some confusion and applies to all users at once (the consumable 
attribute pool is shared among all users).

Is there a fast-and-easy way a user can set his own limit?

Thanks in advance,

Txema
Tina Friedrich | 15 May 2013 13:09
Picon

building Son of Grid Engine

Hello list,

have finally decided to look into upgrading our SGE6.2 installation - 
mainly to see if it helps with my job scheduling problem.

I'm trying to build Son of Grid Engine - succeeded actually. Currently 
trying to make it run / import my old configuration. Which mostly 
worked. Couple of niggles.

Our setup is SGE_ROOT on shared NFS file system, SGE running as a 
non-root user. I'd quite like to keep it that way (it worked well for 
us). Managed to build & install, got the qmaster running, managed to 
start execds. However, at least inst_sge.sh -upd-execd simply refuses to 
work if you're not root, if I remember correctly (not helping!).

Script(s) sometimes say 'You are not installing as user >root< - Can't 
set the file owner/group and permissions'. It would help if they'd tell 
me (without digging through them) what files they're trying to 
chown/chmod and what they're trying to chown/chmod it to - so I can fix 
that, if there is a problem. Goes for a lot of these sort of errors (to 
do with running as non-root) - if it fails to do something, it would 
really help to know what it failed to do.

The other thing is that I keep having to run it with -nobincheck, as far 
as I can tell simply because I didn't build qmon. Annoying - should it 
not just check for actually required binaries?

Importing my old installation / upgrading from my old installation 
didn't quite work. Mostly did, it seems, which is something. No error 
that I'd seen during the import/upgrade, but none of my queues are 
there. Host groups are; exec hosts are; complexes look okay; global 
config looks right. PEs aren't there; trying to create the PEs from the 
config files I originally created them from I get 'error: required 
attribute "qsort_args" is missing'. Assume that's the root problem (i.e. 
did not manage to import PEs, thus can't import queues). Anyone else had 
issues with that? Should the save_config script have caught that?

And now for the important question :). My execds currently are a mix of 
RHEL5 and RHEL6; SoGE got compiled on RHEL6, doesn't work on RHEL5 
execds. Also, all nodes and the master/shadow hosts get software 
upgrades quite regularly - I would like to avoid having to recompile 
SoGE whenever I run yum update (the old installation is nicely agnostic 
to all of this, it Just Works(TM) - well, at least it worked with RHEL5 
and RHEL6.) Plus I've installed hwlock in a non-standard location (and 
currently have to tell the execd process where it is). Is there an 
option for aimk to build statically linked binaries? (I'm sort of 
guessing that that's what the difference is here.).

Apologies for the very long post.

Tina

-- 
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442

--

-- 
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are
for the use of the intended addressee only. If you are not the intended addressee or an authorised
recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy,
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light
Source Ltd. 
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we
cannot accept liability for any damage which you may sustain as a result of software viruses which may be
transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered
office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
Taras Shapovalov | 15 May 2013 09:07

2011.11 update 1

Dear developers,

The date of 2011.11 update 1 was scheduled on the 8th of May (according 
http://gridscheduler.sourceforge.net). Could you, please, comment the 
delay (or maybe I am looking at the wrong web page) and what is the new 
expected date of the release?

Best regards,
Taras
Reuti | 14 May 2013 18:52
Picon

Re: prolog-like script for qrsh

Am 14.05.2013 um 18:29 schrieb Riccardo Murri:

> On 14 May 2013 18:17, Riccardo Murri <riccardo.murri@...> wrote:
>> 
>> On 14 May 2013 09:05, Reuti <reuti@...> wrote:
>>> 
>>> In the prolog of the PE you could program a loop across all granted nodes by using a `qrsh -inherit -V ...`
to all nodes therein to make some preparations.
>> 
>> I'm now doing this:
>> 
>>        HOSTS=$( cut -d' ' -f1 $PE_HOSTFILE | fgrep -v $(hostname -s) )
>>        for host in $HOSTS; do
>>            qrsh -inherit -V $host $RUNSCRIPT | grep -q "failed"
>>            # ... react on failure
>>        done
>> 
>> but apparently `execd` does not allow me to qrsh from the prolog:
>> 
>>    error: executing task of job 2648607 failed: execution daemon on
>> host "r01c04b04n02" didn't accept task
>> 
>> anything wrong with my use of `qrsh`?
> 
> apparently, since the prolog script runs as root, the execd does not
> allow `qrsh -inherit`:
> 
>    |W|denied request of user "root" to start a pe task in job of user "murri"
> 
> I probably misread your suggestion: I was talking about the job
> prolog, which at our site runs as root, whereas you adviced me to use
> the PE start-up procedure (`start_proc_args` in `qconf -sp`)?

As it's used only in a parallel startup, the PE start_proc_args would be the more suitable place IMO, but if
you run the prolog as root user it must go to the prolog of course (there are security impacts running a shell
script as root and it needs to remove many exported variables from the user to avoid being abused).

-- Reuti
Kevin Buckley | 14 May 2013 07:10
Picon

SGE 8.1.3 Qmaster RPMS have dependencies on Flac, Ogg and PulseAudio ?

Hi there,

I have a prototpye of an x86_64 CentOS 6.4 machine and have come to see what
would be needed to go with the SGE 8.1.X series RPMS, rather than the old
6u2.5 series from back in the Sun days.

I grab the RPMS from http://arc.liv.ac.uk/downloads/SGE/releases/8.1.3

I check out the client's installation footprint

-----8<-------------8<-------------8<-------------8<-------------8<-------------8<--------
# yum localinstall gridengine-execd-8.1.3-1.el6.x86_64.rpm
gridengine-8.1.3-1.el6.x86_64.rpm
...
Installing:
 gridengine       x86_64 8.1.3-1.el6 /gridengine-8.1.3-1.el6.x86_64        49 M
 gridengine-execd x86_64 8.1.3-1.el6 /gridengine-execd-8.1.3-1.el6.x86_64 3.9 M
Installing for dependencies:
 hwloc            x86_64 1.5-1.el6   base                                 1.4 M
...
-----8<-------------8<-------------8<-------------8<-------------8<-------------8<--------

All seems good.

Now I come to try the qmaster

-----8<-------------8<-------------8<-------------8<-------------8<-------------8<--------
# yum localinstall gridengine-qmaster-8.1.3-1.el6.x86_64.rpm
gridengine-8.1.3-1.el6.x86_64.rpm
...
Installing:
 gridengine         x86_64 8.1.3-1.el6
/gridengine-8.1.3-1.el6.x86_64
                                                                           49 M
 gridengine-qmaster x86_64 8.1.3-1.el6
/gridengine-qmaster-8.1.3-1.el6.x86_64
                                                                          4.1 M
Installing for dependencies:
 flac               x86_64 1.2.1-6.1.el6                base              243 k
 hwloc              x86_64 1.5-1.el6                    base              1.4 M
 java-1.6.0-openjdk x86_64 1:1.6.0.0-1.61.1.11.11.el6_4 updates            25 M
 jline              noarch 0.9.94-0.8.el6               base               86 k
 libasyncns         x86_64 0.8-1.1.el6                  base               24 k
 libogg             x86_64 2:1.1.4-2.1.el6              base               21 k
 libsndfile         x86_64 1.0.20-5.el6                 base              233 k
 libvorbis          x86_64 1:1.2.3-4.el6_2.1            base              168 k
 perl-XML-Simple    noarch 2.18-6.el6                   base               72 k
 pulseaudio-libs    x86_64 0.9.21-14.el6_3              base              462 k
 rhino              noarch 1.7-0.7.r2.2.el6             base              778 k
 tzdata-java        noarch 2013b-1.el6                  updates           156 k
...
-----8<-------------8<-------------8<-------------8<-------------8<-------------8<--------

FLAC ? Ogg ? PulseAudio ?

So then I read the way the dependencies have developed and it seems that it's
down to the requirement for Java, as satisfied in the CentOS 6.4 distro.

--> Processing Dependency: java >= 1.6.0 for package:
gridengine-qmaster-8.1.3-1.el6.x86_64
...
--> Processing Dependency: libpulse.so.0(PULSE_0)(64bit) for package:
1:java-1.6.0-openjdk-1.6.0.0-1.61.1.11.11.el6_4.x86_64

and so on?

However, does an SGE 8 qmaster actually NEED Java. or is it some
functionality that could be split out
from the core ?

Kevin M. Buckley

eScience Consultant
School of Engineering and Computer Science
Victoria University of Wellington
New Zealand
Riccardo Murri | 14 May 2013 01:26
Picon
Favicon

prolog-like script for qrsh

Hello,

is there a prolog-like script that gets executed before a task spawned
by qrsh (as part of a parallel job) is run?  The usual prolog is only
run on the master node at the start of the job, but I'm trying to
intercept instances of `qrsh -V` spawned by OpenMPI's `mpiexec`.

Thanks for any hint!

Riccardo

--
Riccardo Murri
http://www.gc3.uzh.ch/people/rm

Grid Computing Competence Centre
University of Zurich
Winterthurerstrasse 190, CH-8057 Zürich (Switzerland)
Tel: +41 44 635 4222
Fax: +41 44 635 6888

_______________________________________________
users mailing list
users <at> gridengine.org
https://gridengine.org/mailman/listinfo/users

Gmane