Ken Nielson | 23 Jul 01:07 2016
Gravatar

Ken Nielson leaving Adaptive Computing

Hi all,

I wanted to let everyone know on the list that I am leaving Adaptive Computing as of today, Friday, July 22nd. I have an opportunity with a new company and will be joining them on Monday.

I will still monitor the mailing lists and chime in when I feel I know the answers.

Torque has a great community and I wish you all well.

Thanks

Ken

--

Ken Nielson Sr. Software Engineer
+1 801.717.3700 office    +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300     Provo, UT 84606
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Mike Chen | 22 Jul 19:46 2016
Picon

usecp not working after update from Torque 3.0.3 to 6.0.1

Hi,
I've just updated the cluster from Torque 3.0.3 to 6.0.1.
After the update, users start to complain that the output and error files are not found.
Having no clues, I have to enable SCP from nodes to master.
But I really don't like to do this - SSH was disabled to avoid user abusing nodes without submitting jobs.

A test setup was made with VM, with simple master + 1 node.
(CentOS 6.8, config with:
./configure --prefix=/usr --libdir=/usr/lib64 --includedir=/usr/include/torque \
--with-default-server=master.c6 --with-server-home=/var/spool/torque
With clean setup, the problem is reproducible.

The job script:
#PBS -l nodes=1:ppn=1
#PBS -N test
#PBS -e test.err
#PBS -o test.out
cd $PBS_O_WORKDIR
hostname
date
pwd
whoami

Looks like the MOM did read the $usecp setting:
(/var/spool/torque/mom_priv/20160723)
07/23/2016 01:16:51.784;02;   pbs_mom.13660;Svr;read_config;processing config line '$usecp  *:/home  /home
07/23/2016 01:16:51.784;02;   pbs_mom.13660;Svr;usecp;*:/home  /home
07/23/2016 01:16:51.785;02;   pbs_mom.13661;n/a;initialize;independent

but the output / error files are kept in the undelivered folder:
(node01:/var/spool/torque/undelivered/2.master.c6.OU)
node01.c6
Sat Jul 23 01:17:50 CST 2016
/home/mike
mike
So the working dir matches the usecp setting, but not copied.

And we have this familiar message on the server side:
(master:/var/spool/torque/server_logs/20160723)
07/23/2016 01:17:50.626;08;PBS_Server.13458;Job;2.master.c6;on_job_exit valid pjob: 2.master.c6 (substate=50)
07/23/2016 01:17:58.807;13;PBS_Server.13458;Job;handle_stageout;Post job file processing error; job 2.master.c6 on host node01.c6

Did I miss something? Or somehow the usecp function changes in the new version?
Any suggestions are appreciated ;)

Mike Chen
Research Assistant
Dept. of Atmospheric Science
National Taiwan University
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Ezell, Matthew A. | 21 Jul 15:41 2016
Picon

Building RPMs from Adaptive Tarballs

Hello-

Currently the Adaptive tarballs cannot be built into RPMs *directly* using
'rpmbuild -tb'.  The tarball is named
torque-${version}-${date}_${githash}.tar.gz with a matching directory
inside.

The spec file, however, does:
%define tarversion ${version}
..
%setup -n %{name}-%{tarversion}

The build fails because there is no directory called %{name}-%{tarversion}
at the root of the tarball.

Clearly, there are workarounds and other methods for generating RPMs.  But
it would be nice if this worked directly.

We could change the spec file to include the date and githash for the
%setup macro, but I'm unsure if that would break other build methods.  Is
there any reason that the dist tarballs can't just use torque-${version} ?

Thanks,
~Matt

---
Matt Ezell
HPC Systems Administrator
Oak Ridge National Laboratory
Mahmoud A. A. Ibrahim | 20 Jul 16:59 2016
Picon

qsub under another username

Dear Torque Users
For a specific reason using root account, we would like to submit job under a specific username without logging to user account.
i.e. qsub SUBFILE -u username
This command does not work. Any suggestions/tricks will be highly appreciated.
Sincerely
M. Ibrahim

--
Mahmoud A. A. Ibrahim
Head of CompChem Lab, Chemistry Department,
Faculty of Science, Minia University, Minia 61519, Egypt.
Email: m.ibrahim <at> compchem.net
            m.ibrahim <at> mu.edu.eg
Website: www.compchem.net
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Henrik Skibbe | 14 Jul 12:46 2016
Picon
Picon

memory limit

Hi,

we have nodes with each 1TB of memory. The problem is that I can simultaneously start several shells
on the same node with
 qsub -I -V -q bigmem -l vmem=800gb,pmem=800gb,mem=800gb,pvmem=800gb

This makes little sense since a node only provides enough memory for one user (in this example).

Any idea how I can ensure that the number of requested memory does not exceeds the remaining amount of 
"unreserved" physical memory of the system?


Thank you,

Henrik


_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Henrik Skibbe | 10 Jul 15:00 2016
Picon
Picon

Torque 6.0.1 + cgroups

Hi

I have successfully set up a torque server 
and  pbs_mom clients on several nodes.
Everything works perfect when using cpuset. However,
when I try to use cgroups instead of cpuset, every job is immediately canceled.   

Any help/hint is highly appreciated

Thank you,
Henrik


This is what happens on the node: syslog (here node knv01):

knv01 pbs_mom: LOG_ERROR::Permission denied (13) in trq_cg_set_swap_memory_limit, Could not open /sys/fs/cgroup/memory/torque/2282.mrMACHINE.kuins.net/memory.memsw.limit_in_bytes Kernel may not have been built with CONFIG_MEMCG_SWAP and CONFIG_MEMCG_SWAP_ENABLED
Jul 10 19:24:12 knv01 pbs_mom: LOG_ERROR::set_job_cgroup_memory_limits, Could not set swap memory limits for 2282.mrMACHINE.kuins.net.
Jul 10 19:24:12 knv01 pbs_mom: LOG_ERROR::Connection refused (111) in start_interactive_session, cannot open interactive qsub socket to host mrMACHINE.kuins.net:36108 - 'Failed to connect to mrMACHINE.kuins.net at address 10.232.11.92:36108' - check routing tables/multi-homed host issues


Cgroups are mounted and torque seems to be aware of the directory:


lssubsys -am
cpuset /sys/fs/cgroup/cpuset
cpu,cpuacct /sys/fs/cgroup/cpu,cpuacct
blkio /sys/fs/cgroup/blkio
memory /sys/fs/cgroup/memory
devices /sys/fs/cgroup/devices
freezer /sys/fs/cgroup/freezer
net_cls,net_prio /sys/fs/cgroup/net_cls,net_prio
perf_event /sys/fs/cgroup/perf_event
hugetlb /sys/fs/cgroup/hugetlb
pids /sys/fs/cgroup/pids



ls /sys/fs/cgroup/memory/torque/
cgroup.clone_children memory.kmem.failcnt memory.kmem.tcp.limit_in_bytes memory.max_usage_in_bytes memory.soft_limit_in_bytes notify_on_release
cgroup.event_control memory.kmem.limit_in_bytes memory.kmem.tcp.max_usage_in_bytes memory.move_charge_at_immigrate memory.stat tasks
cgroup.procs memory.kmem.max_usage_in_bytes memory.kmem.tcp.usage_in_bytes memory.numa_stat memory.swappiness
memory.failcnt memory.kmem.slabinfo memory.kmem.usage_in_bytes memory.oom_control memory.usage_in_bytes
memory.force_empty memory.kmem.tcp.failcnt memory.limit_in_bytes memory.pressure_level memory.use_hierarchy
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Mahmood Naderan | 6 Jul 17:05 2016
Picon

showq wrongly counts active processors

Hi,
Problem with 'showq' command is that it doesn't correctly show multithreaded jobs in the list. For example, a job has two threads, so 'top' command shows


23286 mahmood  20   0 17.3g  13g 6776 R 203.2 21.5  17920:10 l1002.exe

So, 203.2% means two cores are utilized. However, 'showq' says


283                mahmood    Running     1 99:04:00:22  Tue Jul  5 23:29:02
     4 Active Jobs       1 of  128 Processors Active (3.12%)
                         1 of    4 Nodes Active      (75.00%)

But that is not correct. It has to show 2 of 128 processors are active.

Is there any switch in the configuration for this purpose?
 
Regards,
Mahmood
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Mahmood N | 6 Jul 17:02 2016
Picon

showq wrongly counts active processors

Hi,
Problem with 'showq' command is that it doesn't correctly show multithreaded jobs in the list. For example, a job has two threads, so 'top' command shows


23286 mahmood  20   0 17.3g  13g 6776 R 203.2 21.5  17920:10 l1002.exe

So, 203.2% means two cores are utilized. However, 'showq' says


283                mahmood    Running     1 99:04:00:22  Tue Jul  5 23:29:02
     4 Active Jobs       1 of  128 Processors Active (3.12%)
                         1 of    4 Nodes Active      (75.00%)

But that is not correct. It has to show 2 of 128 processors are active.

Is there any switch in the configuration for this purpose?

 
Regards,
Mahmood
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Shepherd, Melissa D | 29 Jun 19:01 2016

LBNL Node Health Check in physical cluster - question

Greetings,

 

I'm new to the management, architecture, and configuration of TORQUE. I don't see anything in the list archives that clarifies an issue for me, and appreciate guidance.

 

The cluster I am supporting has physical machines (1 blade/host=1 compute node) all with RHEL OS (6.x).

TORQUE client (pbs_mom) runs on each compute node, which communicates with one head node running TORQUE server (pbs_server).

 

I've put NHC from .rpm on some compute nodes, and tested that the scripts do work correctly in detecting issues on the node. However, the helper scripts, etc. run from the nodes--how can the node-mark-offline & node-mark-online scripts invoke 'pbsnodes' when that's on the head node?

 

Thanks,

Mel 

 

​​


This electronic message transmission contains information from CSRA that may be attorney-client privileged, proprietary or confidential. The information in this message is intended only for use by the individual(s) to whom it is addressed. If you believe you have received this message in error, please contact me immediately and be aware that any use, disclosure, copying or distribution of the contents of this message is strictly prohibited. NOTE: Regardless of content, this email shall not operate to bind CSRA to any order or other contract unless pursuant to explicit written agreement or government initiative expressly permitting the use of email for such purpose.

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Christopher Wirth | 29 Jun 17:10 2016
Picon

Are multiple array dependencies possible?

Hi,

Another question about array dependencies: If I have array jobs A, B, and C, is it possible to make C wait
until both A and B have completed ok?

In the documentation I can see how to do multiple non-array dependencies, using a colon-separated list of
job IDs:

afterok:jobid[:jobid…]

However, this doesn’t appear to be an option for afterokarray:

afterokarray:arrayid[count]

A quick test, just in case it was implemented but not documented, suggests that the docs are accurate as
array job C started when array job A finished, but array job B was still running.

Is there a way to make this dual-array-dependency happen?

Many thanks,

Chris
________________________________

This email is confidential and intended solely for the use of the person(s) ('the intended recipient') to
whom it was addressed. Any views or opinions presented are solely those of the author and do not
necessarily represent those of the Cancer Research UK Manchester Institute or the University of
Manchester. It may contain information that is privileged & confidential within the meaning of
applicable law. Accordingly any dissemination, distribution, copying, or other use of this message, or
any of its contents, by any person other than the intended recipient may constitute a breach of civil or
criminal law and is strictly prohibited. If you are NOT the intended recipient please contact the sender
and dispose of this e-mail as soon as possible.
Skip Montanaro | 27 Jun 21:38 2016
Picon

Can't send email from a job?

I'm trying to get myself reacquainted with Torque after a several year
hiatus. I've got pbs_server and maui running on server1, and pbs_mom
running on server1 and server2. This simpleton command works just
fine:

echo "echo hello world > /home/skipm/torque.out" | qsub

creating the desired file in my home directory. This command completes
with an error, however:

echo "echo hello world | mailx -s test user <at> host.domain" | qsub

06/27/2016 13:59:26.074;13;PBS_Server.109057;Job;6.server1;Not sending
email: User does not want mail of this type.

I did not use --with-sendmail=... when configuring, so it set the
SENDMAIL_CMD macro/variable to /usr/lib/sendmail. I'm not sure what
effect that might have had on this failed command. Mailx is in
/usr/bin/mailx, which should be in PATH.

Searching for the error message yielded no useful hits. Your guidance
would be appreciated.

Thanks,

Skip Montanaro

Gmane