prologalarm time not respected on torque-4.2.10
eatdirt <dirteat <at> gmail.com>
2015-05-01 11:27:00 GMT
I got an epilog.user + epilog system script running on our local cluster
allowing some users to use local scratch and get back their data afterwards.
The doc says that the prologalarm time is by default 5mn; and indeed
after 5mn all these scripts are killed (but the nodes are not marked as
down though). We wanted to increase that time and set:
Our problem is that this setting is not respected, the scripts are still
killed after 5mn instead of 1h. Momctl reports the new prologalarm to
really be 3600; but that just does not work.
That's look like a torque bug; may anyone help me with this issue... :-/
Here an example:
On the running node, that's the pbs_mom logs:
05/01/2015 12:25:27;0080; pbs_mom.28100;Svr;preobit_preparation;top
05/01/2015 12:26:07;0008; pbs_mom.28100;Job;scan_for_terminated;pid
35895 not tracked, statloc=0, exitval=0
05/01/2015 12:26:52;0008; pbs_mom.28100;Job;scan_for_terminated;pid
35938 not tracked, statloc=0, exitval=0
05/01/2015 12:27:37;0008; pbs_mom.28100;Job;scan_for_terminated;pid
35991 not tracked, statloc=0, exitval=0
05/01/2015 12:28:22;0008; pbs_mom.28100;Job;scan_for_terminated;pid
36047 not tracked, statloc=0, exitval=0
05/01/2015 12:28:37;0002; pbs_mom.28100;Svr;pbs_mom;Torque Mom
Version = 4.2.10, loglevel = 1
05/01/2015 12:29:07;0008; pbs_mom.28100;Job;scan_for_terminated;pid
36090 not tracked, statloc=0, exitval=0
05/01/2015 12:29:52;0008; pbs_mom.28100;Job;scan_for_terminated;pid
36138 not tracked, statloc=0, exitval=0
05/01/2015 12:30:24;0080; pbs_mom.28100;Job;2355.cosmo;obit sent to
05/01/2015 12:30:24;0008; pbs_mom.28100;Job;2355.cosmo;forking to
user, uid: 500 gid: 100 homedir: '/home/chris'
05/01/2015 12:30:26;0080; pbs_mom.28100;Job;2355.cosmo;removed job
The job has been terminated at 12:25, preobit started and pbs_mon "obit"
the job at 12:30, 5 minutes later.
momctl -d3 for that very same node reports a prolog alarm at 3600s;
which is thus not respected !!!!
Server: cosmo (192.168.0.1:15001)
Last Msg From Server: 374 seconds (DeleteJob)
WARNING: no messages sent to server
stdout/stderr spool directory: '/var/lib/torque/spool/'
NOTE: syslog enabled
MOM active: 65834 seconds
Check Poll Time: 45 seconds
Server Update Interval: 45 seconds
LogLevel: 1 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model: TCP
MemLocked: TRUE (mlock)
TCP Timeout: 60 seconds
Prolog: /var/lib/torque/mom_priv/prologue (enabled)
Prolog Alarm Time: 3600 seconds
Alarm Time: 0 of 10 seconds
Trusted Client List:
Copy Command: /usr/bin/scp -rpB
NOTE: no local jobs detected