Re: vmware vmotion or snapshot + rabbitmq + celery = Too many heartbeats missed
Each of the instances is running ntpd, we have a stratum 1 timeserver synced to GPS in the same facility.
Clock drift is more or less negligible, and well under a tenth of a second worst case.
I'll try extending the heartbeat - pretty sure that's going to fix the issue, just wasn't sure where to
change config. Thanks!
On Oct 12, 2012, at 6:24 AM, Ask Solem <ask@...> wrote:
> On 11 Oct 2012, at 08:01, Tom Pepper <peppernicus@...> wrote:
>> Hi all:
>> I've noticed that during snapshot events and nightly backups (which take a snapshot), as well as
lengthier vmotion events of the rabbitmq server or nodes, that celeryd (kombu, really) will emit the following:
>> ERROR/MainProcess] Error in timer: ConnectionError('Too many heartbeats missed', None, None, None,
'')#012Traceback (most recent call last):#012 File
"/root/toro/local/lib/python2.7/site-packages/celery/utils/timer2.py", line 93, in
apply_entry#012 entry()#012 File
"/root/toro/local/lib/python2.7/site-packages/celery/utils/timer2.py", line 49, in
__call__#012 return self.fun(*self.args, **self.kwargs)#012 File
"/root/toro/local/lib/python2.7/site-packages/celery/utils/timer2.py", line 150, in
_reschedules#012 return fun(*args, **kwargs)#012 File
"/root/toro/local/lib/python2.7/site-packages/kombu/connection.py", line 186, in
heartbeat_check#012 return self.transport.heartbeat_check(self.connection, rate=rate)#012
File "/root/toro/local/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 130, in
heartbeat_check#012 return connection.heartbeat_tick(rate=rate)#012 File
"/root/toro/local/lib/python2.7/site-packages/amqp/connection.py", line 836, in
heartbeat_tick#012 raise ConnectionError('Too many heartbeats missed')#012ConnectionError: Too
many heartbeats missed
>> followed shortly by:
>> CRITICAL/MainProcess] Couldn't ack 4, reason:error(32, 'Broken pipe')#012Traceback (most recent
call last):#012 File "/root/toro/local/lib/python2.7/site-packages/kombu/transport/base.py",
line 104, in ack_log_error#012 self.ack()#012 File
"/root/toro/local/lib/python2.7/site-packages/kombu/transport/base.py", line 99, in ack#012
"/root/toro/local/lib/python2.7/site-packages/amqp/channel.py", line 1556, in basic_ack#012
self._send_method((60, 80), args)#012 File
"/root/toro/local/lib/python2.7/site-packages/amqp/abstract_channel.py", line 58, in
_send_method#012 self.channel_id, method_sig, args, content)#012 File
"/root/toro/local/lib/python2.7/site-packages/amqp/method_framing.py", line 216, in
write_method#012 write_frame(1, channel, payload)#012 File
"/root/toro/local/lib/python2.7/site-packages/amqp/transport.py", line 149, in
write_frame#012 frame_type, channel, size, payload, 0xce))#012 File
"/usr/lib/python2.7/socket.py", line 224, in meth#012 return
getattr(self._sock,name)(*args)#012error: [Errno 32] Broken pipe
>> Once this happens, the celeryd instances show in top as consuming 100% CPU per node started and no longer
process any tasks until they are restarted.
> Either the broker did actually miss the heartbeat or the system time is unreliable.
> (time in virtualized environments is often unreliable, but not sure if that is
> at play here).
> You could try increasing the heartbeat rate (e.g. BROKER_HEARTBEAT=30)
> There also a constant in the code called AMQHEARTBEAT_RATE, there's no setting for this
> yet but you could change this in the source code to modify how often the heartbeats
> are checked. The default is to check twice the rate of the heartbeat value, and I would guess
> decreasing this could adjust for clock instability.
> As for why the process uses 100% CPU I have no idea, but it sounds like a bug.
> Ask Solem
> twitter.com/asksol | +44 (0)7713357179
> You received this message because you are subscribed to the Google Groups "celery-users" group.
> To post to this group, send email to celery-users@...
> To unsubscribe from this group, send email to celery-users+unsubscribe <at> googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/celery-users?hl=en.
You received this message because you are subscribed to the Google Groups "celery-users" group.
To post to this group, send email to celery-users@...
To unsubscribe from this group, send email to celery-users+unsubscribe <at> googlegroups.com.
For more options, visit this group at http://groups.google.com/group/celery-users?hl=en.