Haifa Ftirich | 30 Jul 11:38 2015
Picon

pbsnodes marking nodes as down when actually they're are not running jobs

Hey,
so am in half way to set a galaxy server working along with a cluster for computational aims. I've installed torque on the galaxy server and configured it to communicate with the nodes ( I followed instructions in the admin guide of torque).
I am using version 5.0.2 and pbsmom is working on both nodes and pbs_server is up in the server. They are pinging perfectly and I've set no-password ssh tunnel from both sides.
each node has one processor can that be the problem?
Please suggest what to do cause I really find no way to make it detect nodes as free.
Thank you.
Sincerely,
 
 
Haïfa Ftirich
about.me/haifaftirich
 
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Michael Durket | 30 Jul 16:57 2015
Picon

Email flood on job cancel attempt when MOM cannot be reached

I'm running very old versions of Torque (2.1.6) and Maui (3.2.6p11). They work just fine 24x7 with only
occasional issues and we've never needed to upgrade. One of those occasional issues occurs when a MOM dies
and (some time later) Maui notices that it needs to cancel a job that was running on that MOM because a limit
(wall clock usually) has been exceeded. Maui notifies Torque to cancel the job, Torque tries to contact
the MOM but fails. Torque reports to Maui that the job cannot be cancelled and an email message is sent out.
On the next Maui cycle this process repeats and another email is generated, etc, resulting in an email flood.

Has this issue ever been fixed in any subsequent release of either Torque or Maui (either to suppress
multiple emails about the same issue, or to recognize that when a MOM goes down, you don't actually have any
control over the jobs that are/were running on the affected node and so shouldn't try until the MOM comes
back up)?
W. Graham McCullough | 29 Jul 22:48 2015
Picon

A crazy question regarding hyperthreading with torque and moab

Hello all,

Recently, a colleague of mine figured out he could toggle hyperthreading (per core) on the fly with a script. We've been thinking about experimenting with allowing users to request a job that would have hyperthreading turned on. This is all pretty easy to do in config, but we're not sure it's advisable for a couple reasons: CPUsets comes to mind. So does Moab licensing. Would this bump the moab server offline if a job suddenly doubled the number of cores available on a node?

Honestly, we're not really sure if there is much use for such a feature if we were to roll it out. Has anyone else dealt with a similar feature?

Best,

Graham McCullough
HPC System Administrator
Purdue University
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Mahmoud A. A. Ibrahim | 24 Jul 22:05 2015
Picon

Multicore binding strategy by torque

Dear Torque Users
We have a problem with increasing the performance of our cluster.
Based on our limited experience, we would expect to increase the performance by employing multi-core binding strategy.
Is there any way to ask torque to manage this issue?
We use torque package based on Rocks 6.1.1.
Any kind of support will be highly appreciated.
Thanks in advance
Sincerely;
M. Ibrahim
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Conference Alert | 23 Jul 11:46 2015

CfP: ICGCTI2015 - Malaysia

The Third International Conference on Green Computing, Technology and 
Innovation (ICGCTI2015)

Universiti Putra Malaysia, Selangor, Malaysia
8-10 December 2015
http://sdiwc.net/conferences/icgcti2015/

The proposed conference on the above theme will held over three days, 
with presentations delivered by researchers from the international 
community, including presentations from keynote speakers and 
state-of-the-art lectures. ICGCTI2015 aims to enable researchers build 
connections between different digital applications.

The conference welcomes papers on the following (but not limited to) 
research topics:

- Benefits of, and barriers to, adopting greener IT practices
- Carbon metering and user feedback
- Climate and ecosystem monitoring
- Energy harvesting, storage, and recycling
- Energy-aware high performance computing and applications
- Energy-aware software
- Energy-efficient network services and operations
- Green IT metrics, maturity models, standards, and regulations
- Green computing models, methodologies and paradigms
- Green networking and communication
- Life-cycle analysis of IT equipment
- Management and profiling tools for energy efficient systems
- Modeling-representations, simulation and validation for energy 
consumption optimization problems
- Online dynamic optimization for energy efficient systems
- Power-aware algorithms and protocols
- Power-efficient delivery and cooling
- Renewable energy models and prediction
- Smart buildings and urban development
- Smart homes, buildings, offices, streets
- Stability of smart energy systems
- Using IT to reduce carbon emissions
- Carbon management policies and ecology- related issues with ICT
- Characterization, metrics, and modeling
- Creating green awareness using IT
- Energy-aware computing
- Energy-aware large scale distributed systems, such as Grids, Clouds 
and service computing
- Energy-efficient mass data storage and processing
- Governments’ roles in fostering and enforcing green initiatives
- Green business process reengineering and management
- Green design, manufacture, use, disposal, and recycling of computers 
and communication systems
- Green software engineering
- Low-power electronics and systems
- Matching energy supply and demand
- Network design optimization
- Optimization of energy-efficient protocols
- Power-aware software and hardware
- Reliability, thermal behavior and control
- Robustness and performance guarantees
- Smart grid and microgrids
- Smart transportation and manufacturing
- Sustainable computing

Researchers are encouraged to submit their work electronically. All 
papers will be fully refereed by a minimum of two specialized referees. 
Before final acceptance, all referees comments must be considered.

Important Dates
==============
Submission Deadline	: November 8, 2015
Notification of Acceptance : 2-4 weeks from the submission date
Final Notification	: November 20, 2015
Camera Ready Deadline	: November 28, 2015
Registration Deadline	: November 28, 2015
Conference Dates	: December 8-10, 2015

Drop us an email at icgcti15 <at> sdiwc.net

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Angelos Ching | 23 Jul 10:34 2015

Proper way to configure PAM?

Dear fellows,

We're trying to configure the PAM module for Torque, we have the following configuration:
  • User accounts are authenticated using NIS
  • Admin users are authenticated using NIS group, eg: hpcadmin
What I'm trying to achieve:
  •  Users in hpcadmin group can always ssh into any node
  •  Other users can ssh into a node when they have a task assigned to the node in Torque
The example in Torque 5.0.0 admin guide had the following example:
/etc/pam.d/sshd
account required pam_pbssimpleauth.so
account required pam_access.so
/etc/security/access.conf
-:ALL EXCEPT root george allen michael torque:ALL
I'm new to this and find this a little bit confusing. The example stated that george will not be able to login when he does not have a job on the node. But far as I could understand, such config in access.conf will also block out all user login as long as they are not listed in access.conf? Say, if I have a user alice, she won't be able to login no matter what?

After finally wrapping my head around this access.conf & pam.d stuff, I come up with the following configuration. I've tested and this seems to work. But would like to have some advise if this is a valid configuration / has any security concern:
/etc/pam.d/sshd (Appended at the end)
account sufficient pam_access.so
account required pam_pbssimpleauth.so
/etc/security/access.conf
+:root wheel hpcadmin:ALL
Cheers,
Angelos
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Hossain, Zahid | 21 Jul 01:59 2015
Picon

PBS/Torque notification

Hi, I am just wondering if there is a way to setup notification to be sent over a socket or webservice instead
of email. Basically I am trying to use a cluster to perform jobs that are submitted by bunch of other
computers automatically. But these separate computer needs to know the status of the job that it
submitted. How do people usually set that up ?
Zhang,Jun | 16 Jul 23:46 2015

pbs_mom process owner is not root, but the current job runner on the node

Right after I restart pbs_mom on the compute node, the process was owned by root, a few second later, the ownership changed to the user who has job running on the node. The result is that there is a lot Permission denied or “cannot bind” messages in log file. What could be the possible cause of this, and how to correct it?

 

Maui version 3.3.1

pbs-config --version

4.2.6.1

 

Jun

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Li, Yonghui | 15 Jul 19:19 2015
Picon

Update: About CPU frequency not correctly generated in mom_priv/BaseCpuFrequency.xml

Dear torque users,

 

I figured the problem is CPU dependent. On the machines with i7-3930K CPU, there is no problem. But on the old machines with Core 2 Quad Q6600, I go the problem I mentioned before.

 

So far I can send the jobs to TORQUE and the job runs fine as long as the pbs_mom is running on the worker nodes.

 

To restart, I’ll manually delete the BaseCpuFrequency.xml and restart the pbs_mom service.

 

It seems the CPU frequency thing is added to Torque since version 5? I don’t familiar with this new feature. So What may happen if Torque failed in figuring out the CPU information?

 

Any discussion is welcome and any suggestion will be appreciated.

 

Thanks,

Yonghui

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
Li, Yonghui | 14 Jul 04:38 2015
Picon

About CPU frequency not correctly generated in mom_priv/BaseCpuFrequency.xml

Dear torque users,

 

I am recently upgrading torque on our small cluster. Everything on the head node was fine. I then copied the packages generated on the head node to the worker node and installed them. After that I got pbs_mom run on the worker node. But each time when I restarted the pbs_mom. I got an error: “pbs_mom: LOG_ERROR::main, unexpected boost exception caught”.

 

After that I used gdb to check what the problem is. Then I was taken to the load_base_frequencies function. I realized that mom_priv/BaseCpuFrequency.xml is broken. Then I deleted it and run the pbs_mom again and check the mom_priv/BaseCpuFrequency.xml file. The file is also provided below. I think there is something wrong in my torque build and it didn’t generate the correct information.

 

The cluster has 1 head node and several worker nodes. Head node runs Ubuntu 12.04 and worker nodes run Ubuntu 14.04.

 

Here are my questions:

1.      I think a quick fix is to write BaseCpuFrequency.xml file manually and torque will load it. But can anyone tell me what is type, upperKHz and lowerLHz in the file?

2.      What is the cause of these strange problem? Does the Ubuntu version difference matters?

3.      My configure option is only a simple one with gui enabled? Is it possible to resolve this problem permanently by adding any configure options? (I tried --enable-libcpuset and --enable-cpuset but no luck.)

 

Thanks,

Yonghui

 

<?xml version="1.0"?>

<CpuFrequencies>

  <CpuNode NodeNumber="0">

    <type>Unknown</type>

    <KHz>47</KHz>

    <upperKHz>140736676614000</upperKHz>

    <lowerKHz>0</lowerKHz>

  </CpuNode>

  <CpuNode NodeNumber="0">

    <type>Unknown</type>

    <KHz>47</KHz>

    <upperKHz>140736676614000</upperKHz>

    <lowerKHz>0</lowerKHz>

  </CpuNode>

  <CpuNode NodeNumber="0">

    <type>Unknown</type>

    <KHz>47</KHz>

    <upperKHz>140736676614000</upperKHz>

    <lowerKHz>0</lowerKHz>

  </CpuNode>

  <CpuNode NodeNumber="0">

    <type>Unknown</type>

    <KHz>47</KHz>

    <upperKHz>140736676614000</upperKHz>

    <lowerKHz>0</lowerKHz>

  </CpuNode>

</CpuFrequencies>

 

_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
David Beer | 10 Jul 18:25 2015

down_on_error default

All,

Currently, if a node health check script reports a problem, pbs_server only sets the node down if down_on_error is set to true. We have had requests to change the default to true, and only ignore the error if down_on_error is set to false. I personally agree with the requests and think that honoring the result of a node health check is the more sensible default.

Are there any objections to changing this default? We could either change it in the next major release or in the next release of 5.1, which would be 5.1.2. This would only affect people who have a node health check script configured, so even though it is changing a default it would not affect default behavior generally.

--
David Beer | Senior Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
torqueusers <at> supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

Gmane