I would greatly appreciate any help at this point! I am trying to make a small computing cluster for a class in the fall. I have installed torque-5.0.1, and I am trying to get a 2 computer cluster to work before I add more machines.
On the head node (pchem-m-236), I have pbs_server and pbs_sched running, and the compute node (pchem-s1-236) is running pbs_mom.
The compute node state is down when everything is started. When I set the node state to free using qmgr, jobs will actually run on the compute node. But after 1 min or so the node goes down.
Here’s some info, momcrl gives warnings that no messages have been sent to the server:
pchem-m-236 teacher # momctl -d 3 -h DHCP-129-59-119-100.n1.vanderbilt.edu
Host: pchem-s1-236/pchem-s1-236 Version: 5.0.1 PID: 3828
Server: DHCP-129-59-119-18.n1.vanderbilt.edu (184.108.40.206:15001)
Last Msg From Server: 739 seconds (CLUSTER_ADDRS)
WARNING: no messages sent to server
stdout/stderr spool directory: '/var/spool/torque/spool/' (33917122blocks available)
NOTE: syslog enabled
MOM active: 3932 seconds
Check Poll Time: 45 seconds
Server Update Interval: 30 seconds
LogLevel: 5 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model: TCP
MemLocked: TRUE (mlock)
TCP Timeout: 120 seconds
Prolog: /var/spool/torque/mom_priv/prologue (disabled)
Alarm Time: 0 of 10 seconds
Trusted Client List: 127.0.0.1:0,127.0.1.1:0,220.127.116.11:0,18.104.22.168:15003: 0
Copy Command: /usr/bin/scp -rpB
NOTE: no local jobs detected
But the output of the mom’s log says that a status update was successfully sent:
11/21/2014 09:36:43;0002; pbs_mom.3828;Svr;pbs_mom;Torque Mom Version = 5.0.1, loglevel = 5
11/21/2014 09:36:43;0002; pbs_mom.4956;n/a;mom_server_update_stat;status update successfully sent to DHCP-129-59-119-18.n1.vanderbilt.edu
11/21/2014 09:36:43;0008; pbs_mom.3828;Job;scan_for_terminated;pid 4956 not tracked, statloc=0, exitval=0
11/21/2014 09:37:13;0002; pbs_mom.6195;n/a;mom_server_update_stat;status update successfully sent to DHCP-129-59-119-18.n1.vanderbilt.edu
11/21/2014 09:37:14;0008; pbs_mom.3828;Job;scan_for_terminated;pid 6195 not tracked, statloc=0, exitval=0
Does anyone have any ideas?
Thanks in advance