So in prep for upgrading from 2.1 to 2.2 this weekend, I upgraded my server's OS from FreeBSD 9.1 to 9.2 (a fairly minor update, as OS updates go).
Since the upgrade, the slon connected to the replica DB on that upgraded server will stop after just about 58 to 59 minutes. Restarting the slon daemon allows the replication to continue and fairly quickly catch up.
By "stop" I mean there is nothing visibly going on -- replication stalls and nothing is logged.
Here is the tail end of my log file from a few minutes ago. The slon process was started at 8:48am.
2014-03-14 09:46:43.744319500 DEBUG1 calc sync size - last time: 1 last length: 2002 ideal: 29 proposed size: 3
2014-03-14 09:46:43.745342500 DEBUG1 about to monitor_subscriber_query - pulling big actionid list for 4
2014-03-14 09:46:43.749657500 INFO remoteWorkerThread_4: syncing set 1 with 262 table(s) from provider 4
2014-03-14 09:46:43.762199500 DEBUG1 remoteHelperThread_4_4: 0.012 seconds delay for first row
2014-03-14 09:46:43.766863500 DEBUG1 remoteHelperThread_4_4: 0.016 seconds until close cursor
2014-03-14 09:46:43.766867500 DEBUG1 remoteHelperThread_4_4: inserts=266 updates=350 deletes=176 truncates=0
2014-03-14 09:46:43.766869500 DEBUG1 remoteWorkerThread_4: sync_helper timing: pqexec (s/count)- provider 0.014/5 - subscriber 0.000/5
2014-03-14 09:46:43.766872500 DEBUG1 remoteWorkerThread_4: sync_helper timing: large tuples 0.000/0
2014-03-14 09:46:44.006795500 INFO remoteWorkerThread_4: SYNC 5015580475 done in 0.262 seconds
2014-03-14 09:46:44.006853500 DEBUG1 remoteWorkerThread_4: SYNC 5015580475 sync_event timing: pqexec (s/count)- provider 0.001/2 - subscriber 0.005/2 - IUD 0.242/164
at this point nothing more gets logged.
Looking at the activity in the DB, I see the 5 connections from this slon, with all but one having a query start time of 09:46:44. This is the query that was running for over 10 minutes:
datid | 16392
datname | vkmlm
pid | 7159
usesysid | 16389
usename | slony
application_name | slon.local_cleanup
client_addr | 127.0.0.1
client_port | 55142
backend_start | 2014-03-14 08:48:16.198806-04
query_start | 2014-03-14 09:34:32.735557-04
state_change | 2014-03-14 09:34:32.745553-04
waiting | f
state | idle
query | begin;lock table "_mailermailer".sl_config_lock;select "_mailermailer".cleanupEvent('10 minutes'::interval);commit;
pg_cancel_backend() will not kill that query. I did a pg_terminate_backend() and it got rid of that process, but the rest are still seemingly stuck and nothing is logging from slon.
Any ideas? This is so confusing because it is such an odd time interval before it locks up. What's magical about 58 minutes?
OS: FreeBSD 9.2/amd64