David Beer | 27 Aug 18:27 2014

CPU Utilization in TORQUE

All,

The other day someone internally mentioned the possibility of adding a cpu utilization number to TORQUE (essentially the same cpu utilization number windows provides you). I know that a large portion (if not a supermajority) of the community come from a linux background, but I wanted to ask if providing this information is something that'd be interesting to the community in general?

If we decide to add cpu utilization it'd be another attr=value pair added to the mom status information.

--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Erica Riello | 27 Aug 14:01 2014
Picon

how to use pbs-drmaa library

Hi all,

I've successfully installed Poznan DRMAA library for Torque/PBS. Now, I'm
having some trouble to compile examples of use of this library.

I'd like to know how should I compile an example of C program using drmaa.

Thanks in advance.

the example is very simple:

[torquepbs:~/Documents/testes_pbs_drmaa] more teste0.c
#include <stdio.h>
#include <stdlib.h>
#include "drmaa.h"

int main (void)
{
	char s[DRMAA_CONTACT_BUFFER];
	char e[DRMAA_ERROR_STRING_BUFFER];
	if (drmaa_get_DRM_system(s, DRMAA_CONTACT_BUFFER, e,
DRMAA_ERROR_STRING_BUFFER) !=DRMAA_ERRNO_SUCCESS)
	{
		fprintf (stderr, "Could get the DRM system: %s\n", e);
		return 1;
	}
	printf("Available DRM_systems: %s\n", s);
	return 0;
}

But compiling it with -ldrmaa gives me:

[torquepbs:~/Documents/testes_pbs_drmaa] gcc teste0.c -o teste0 -ldrmaa
//usr/local/lib/libdrmaa.a(libdrmaa_utils_la-exception.o): In function
`fsd_exc_init':
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/drmaa_utils/drmaa_utils/exception.c:80:
undefined reference to `pthread_key_create'
//usr/local/lib/libdrmaa.a(libdrmaa_utils_la-exception.o): In function
`fsd_exc_get_stack':
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/drmaa_utils/drmaa_utils/exception.c:101:
undefined reference to `pthread_once'
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/drmaa_utils/drmaa_utils/exception.c:109:
undefined reference to `pthread_getspecific'
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/drmaa_utils/drmaa_utils/exception.c:116:
undefined reference to `pthread_setspecific'
//usr/local/lib/libdrmaa.a(libdrmaa_utils_la-thread.o): In function
`fsd_thread_create':
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/drmaa_utils/drmaa_utils/thread.c:58:
undefined reference to `pthread_create'
//usr/local/lib/libdrmaa.a(libdrmaa_utils_la-thread.o): In function
`fsd_thread_join':
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/drmaa_utils/drmaa_utils/thread.c:67:
undefined reference to `pthread_join'
//usr/local/lib/libdrmaa.a(libdrmaa_utils_la-thread.o): In function
`fsd_thread_detach':
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/drmaa_utils/drmaa_utils/thread.c:76:
undefined reference to `pthread_detach'
//usr/local/lib/libdrmaa.a(libdrmaa_utils_la-thread.o): In function
`fsd_mutex_init':
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/drmaa_utils/drmaa_utils/thread.c:90:
undefined reference to `pthread_mutexattr_init'
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/drmaa_utils/drmaa_utils/thread.c:92:
undefined reference to `pthread_mutexattr_settype'
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/drmaa_utils/drmaa_utils/thread.c:96:
undefined reference to `pthread_mutexattr_destroy'
//usr/local/lib/libdrmaa.a(libdrmaa_utils_la-thread.o): In function
`fsd_mutex_trylock':
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/drmaa_utils/drmaa_utils/thread.c:135:
undefined reference to `pthread_mutex_trylock'
//usr/local/lib/libdrmaa.a(util.o): In function `pbsdrmaa_exc_raise_pbs':
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/util.c:123:
undefined reference to `pbs_errno'
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/util.c:126:
undefined reference to `pbse_to_txt'
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/util.c:133:
undefined reference to `pbs_geterrmsg'
//usr/local/lib/libdrmaa.a(pbs_conn.o): In function `check_reconnect':
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/pbs_conn.c:581:
undefined reference to `pbs_disconnect'
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/pbs_conn.c:589:
undefined reference to `pbs_connect'
//usr/local/lib/libdrmaa.a(pbs_conn.o): In function `pbsdrmaa_pbs_holdjob':
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/pbs_conn.c:536:
undefined reference to `pbs_errno'
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/pbs_conn.c:529:
undefined reference to `pbs_holdjob'
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/pbs_conn.c:535:
undefined reference to `pbs_errno'
//usr/local/lib/libdrmaa.a(pbs_conn.o): In function `pbsdrmaa_pbs_rlsjob':
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/pbs_conn.c:485:
undefined reference to `pbs_errno'
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/pbs_conn.c:478:
undefined reference to `pbs_rlsjob'
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/pbs_conn.c:484:
undefined reference to `pbs_errno'
//usr/local/lib/libdrmaa.a(pbs_conn.o): In function `pbsdrmaa_pbs_deljob':
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/pbs_conn.c:433:
undefined reference to `pbs_errno'
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/pbs_conn.c:427:
undefined reference to `pbs_deljob'
//usr/local/lib/libdrmaa.a(pbs_conn.o): In function `pbsdrmaa_pbs_sigjob':
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/pbs_conn.c:382:
undefined reference to `pbs_errno'
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/pbs_conn.c:375:
undefined reference to `pbs_sigjob'
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/pbs_conn.c:381:
undefined reference to `pbs_errno'
//usr/local/lib/libdrmaa.a(pbs_conn.o): In function `pbsdrmaa_pbs_statjob':
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/pbs_conn.c:311:
undefined reference to `pbs_errno'
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/pbs_conn.c:307:
undefined reference to `pbs_statjob'
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/pbs_conn.c:333:
undefined reference to `pbs_statfree'
//usr/local/lib/libdrmaa.a(pbs_conn.o): In function `pbsdrmaa_pbs_submit':
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/pbs_conn.c:260:
undefined reference to `pbs_errno'
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/pbs_conn.c:252:
undefined reference to `pbs_submit'
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/pbs_conn.c:259:
undefined reference to `pbs_errno'
//usr/local/lib/libdrmaa.a(pbs_conn.o): In function `pbsdrmaa_pbs_conn_new':
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/pbs_conn.c:116:
undefined reference to `pbs_disconnect'
//usr/local/lib/libdrmaa.a(pbs_conn.o): In function
`pbsdrmaa_pbs_conn_destroy':
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/pbs_conn.c:141:
undefined reference to `pbs_disconnect'
//usr/local/lib/libdrmaa.a(pbs_conn.o): In function
`pbsdrmaa_pbs_statjob_free':
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/pbs_conn.c:355:
undefined reference to `pbs_statfree'
collect2: error: ld returned 1 exit status

That's strange but we can easily recognize these functions, they're from
pbs and pthread libraries. Then I try again with -ldrmaa -lpbs and
-pthread:

[torquepbs:~/Documents/testes_pbs_drmaa] gcc teste0.c -o teste0 -ldrmaa
-lpbs -pthread
//usr/local/lib/libdrmaa.a(util.o): In function `pbsdrmaa_exc_raise_pbs':
/home/msv/ericaflr/Downloads/pbs-drmaa-1.0.17_/pbs_drmaa/util.c:126:
undefined reference to `pbse_to_txt'
collect2: error: ld returned 1 exit status

'pbse_to_txt' is a function defined in pbs_error.h, part of pbs library,
it shouldn't be giving such error now.

Regards,

Erica.
Taras Shapovalov | 26 Aug 17:14 2014

Torque on SLES with Xeon Phi

Hi guys,

I have torque 4.2.6.1 with Xeon Phi support running on SLES SP3. By some reason mom fails to get the mic info:

08/26/2014 17:09:36;0001;   pbs_mom.90006;Svr;pbs_mom;LOG_ERROR::add_mic_status, Can't get handle for mic index 0

Has anybody idea what it can be? On rhel6.5 cluster the same torque with the same configure flags works just fine.

Best regards,

Taras
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
jackie | 20 Aug 09:21 2014

ICETC2014 - IEEE Extended Submission until Aug. 28, 2014

		   Apologies for cross-posting.
          Kindly forward to those who may be of interest.
=======================================================================
  International Conference on Education Technologies and Computers
			   (ICETC2014)
	  Lodz University of Technology, Lodz, Poland
	 	     September 22-24, 2014

	  http://sdiwc.net/conferences/2014/icetc2014

The conference is technically co-sponsored by IEEE Poland Section. All
registered papers will be submitted to IEEE for potential inclusion
to IEEE Xplore as well as other Abstracting and Indexing (A&I) 
databases.

Paper submission has been extended until August 28, 2014. For more 
details
and updates please visit the conference website or email us at 
icetc <at> sdiwc.net
=======================================================================
** T H A N K  Y O U  A N D  H O P E  T O  S E E  Y O U  T H E R E **
Gareth.Williams | 26 Aug 01:35 2014
Picon
Picon

pbsdsh repeated invocation bug

I see a problem with repeated invocation  of pbsdsh. It first came up in a case where I was trying to run one test
process on each allocated node of a job in succession with a loop to serialise access.

In a two node interactive batch job (each node has 16 cores), this works:
for node in `uniq $PBS_NODEFILE`; do pbsdsh -c 1 -h $node uname -a; done

But this (which run one process per core) hangs after the first few executions on the remote node:
for node in `cat $PBS_NODEFILE`; do pbsdsh -c 1 -h $node uname -a; done it hangs and ^C gives:
pbsdsh(): error from tm_poll() 17002

Immediate further invocations of pbsdsh are likely to fail but it can recover.  The problem seems to be
related to the number of pbsdsh invocations - it stops working after a some number over a short period (not
really short - I was typing the commands).

This is not critical but it would be good to have it understood and fixed.

Can other confirm if they see the same behaviour?

We are running v. 4.2.7

Gareth

~> qsub -I -l nodes=2:ppn=16
qsub: waiting for job 
-snip-
n045:~> for node in `uniq $PBS_NODEFILE`; do pbsdsh -c 1 -h $node hostname; done
n045
n048
n045:~> for node in `cat $PBS_NODEFILE`; do pbsdsh -c 1 -h $node hostname; done
n045
n045
n045
n045
n045
n045
n045
n045
n045
n045
n045
n045
n045
n045
n045
n045
n048
n048
n048
n048
n048
n048
n048
n048
-hangs here-
^Cpbsdsh(): error from tm_poll() 17002
^Cpbsdsh(): error from tm_poll() 17002
^C
n045:~>
Gene Wolfe | 25 Aug 22:42 2014

jobs node packing

Hello Folks,

I am running Torque 4.2.6.1 and I have node_pack set to False. I have also set it as false. I was getting jobs to start in a round-robin fashion around the cluster, just like I like, but then IBM came in and did something to the network/cluster, and now, no matter what I set, all job are getting node packed.  I even tried Maui, with full debugging turned on to verify the settings, and that just node packs as well.

# Set server attributes.
#
set server scheduling = True
set server acl_hosts = icpcpshnd001
set server default_queue = parallel
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 30
set server node_check_rate = 150
set server tcp_timeout = 900
set server node_pack = False
set server job_stat_rate = 120
set server poll_jobs = True
set server mom_job_sync = True
set server next_job_number = 2926
set server moab_array_compatible = True
set server nppcu = 1

I cannot seem to detect why this is happening. I can debug the code, but I am not sure where the schedule supplies the list of nodes to the server . . . any ideas on this?

-- Gene.
_________________________________________________
Gene (The Machine) Wolfe

email: eugene.wolfe <at> gmail.com email: Gene.Wolfe <at> halliburton.com
_________________________________________________
This e-mail, including any attached files, may contain confidential and privileged information for the sole use of the intended recipient. Any review, use, distribution, or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive information for the intended recipient), please contact the sender by reply e-mail and delete all copies of this message.
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Jack Wilkinson | 25 Aug 22:18 2014

/var/lib/torque/undelivered

Good afternoon folks,

I recently got an email from a task that failed and it mentioned:

Unable to copy file /var/lib/torque/spool/52464.srvXXXXXX01.sssseeee.com.OU to /BBBB/CCCC/merge/YY/ZZ/2014/08/20/FS8K140253/FS8K140254_stdout_002.txt
*** error from copy
/bin/cp: cannot create regular file
`/BBBB/CCCC/merge/YY/ZZ/2014/08/20/FS8K140253/FS8K140254_stdout_002.txt': No such file or directory
*** end error output
Output retained on that host in: /var/lib/torque/undelivered/52464.srvXXXXXX01.sssseeee.com.OU

(Sorry about the obvious masking of names.  Legal is freaking paranoid around here.)

I decided to take a look at /var/lib/torque/undelivered.  It had 28,558 files in it going back to 2012.  I took
a look at that directory on all of our boxes and found pretty much the same.

Anyone have any ideas to toss out?  I'm getting ready to run a job across all the boxes and empty those directories.

Thoughts appreciated and have a good afternoon!
jack

Jack Wilkinson, Programmer
Services | VPay®
P: 972.367-6622
jwilkinson <at> stoneeagle.com
www.stoneeagle.com
www.vpayusa.com

111 W. Spring Valley Rd., #100
Richardson, TX 75081

CONFIDENTIALITY NOTICE: This email, including any attachments, is for the sole use of the intended
recipient(s) and may contain confidential and privileged information. Any unauthorized review, use,
disclosure, or distribution is prohibited. If you received this email and are not the intended
recipient, please inform the sender by email reply and destroy all copies of the original message.
Ken Nielson | 25 Aug 19:49 2014

pbsdsh survey

Hi all,

I am wondering how many users out there use pbsdsh regularly in production.

Ken

--

Ken Nielson Sr. Software Engineer
+1 801.717.3700 office    +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300     Provo, UT 84606
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Mahmood N | 25 Aug 11:52 2014
Picon

Need help for a scheduler policy

Hello,
I need to implement the following policy. Please let me know, how that is possible with torque V3.

Assume we have 20 idle cores. User1 submits 20 jobs and they start to run. Now User2 tries to submit 2 jobs but
there is no free core. The scheduler should delete the last two jobs of User1 (by timestamps) to free two
cores for User2.

How that is possible?

 
Regards,
Mahmood
Eva Hocks | 21 Aug 22:30 2014
Picon

ALERT Resource_List.neednodes


torque version  4.2.6.h1, maui 3.3.1

The following qsub command or batch script without #PBS -l nodes....

$ qsub -I -q hotel

runs the job with the configuration:

Resource_List.neednodes=hotel-node

the torque queue "hotel" is set up with

Queue: hotel
    resources_default.neednodes = hotel-node

which causes ALERT message in maui

ALERT:    cannot locate host 'hotel-node' for job hostlist

It looks like torque adds the queue configuration default settings to
the user's batch script if no resources are specified. It should be
adding the default nodes=1:ppn=1

Is this problem fixed in another release?

Thanks
Eva
Erica Riello | 19 Aug 18:03 2014
Picon

drmaa c library

Hi all,

does anyone uses a c drmaa library with torque? I've been trying to use
the one available in apt-get, but it does not work.

Thanks in advance.

Regards,

Erica.

Gmane