Issue when zookeeper session expires during shard leader election.
Michael Roberts <mroberts <at> tableau.com>
2015-07-28 06:10:59 GMT
I am encountering an issue which looks a lot like https://issues.apache.org/jira/browse/SOLR-6763.
However, it seems like the fix for that does not address the entire problem. That fix will only work if we hit
the zkClient.getChildren() call before the reconnect logic has finished reconnecting us to ZooKeeper
(I can reproduce scenarios where it doesn’t in 4.10.4). If the reconnect has already happened, we
won’t get the session timeout exception.
The specific problem I am seeing is slightly different SOLR-6763, but the root cause appears to be the same.
The issue that I am seeing is; during startup the collections are registered and there is one
thread-* per collection. The elections are started on this thread, the
/collections/≤name>/leader_elect ZNodes are created, and then the thread blocks waiting for the peers
to become available. During the block the ZooKeeper session times out.
Once we finish blocking, the reconnect logic calls register() for each collection, which restarts the
election process (although serially this time). At a later point, we can have two threads that are trying
to register the same collection.
This is incorrect, because the coreZkRegister-1- thread-’s are assuming they are leader with no
verification from zookeeper. The ephemeral leader_elect nodes they created were removed when the
session timed out. If another host started in the interim (or any point after that actually), it would see
no leader, and would attempt to become leader of the shard itself. This leads to some interesting race
conditions, where you can end up with two leaders for a shard.
It seems like a more complete fix would be to actually close the ElectionContext upon reconnect. This would
break us out of the wait for peers loop, and stop the threads from processing the rest of the leadership
logic. The reconnection logic would then continue to call register() again for each Collection, and if
the ZK state indicates it should be leader it can re-run the leadership logic.