本文介绍了为什么我的 Zookeeper 服务器无法重新加入 Quorum?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的法定人数中有三台服务器.他们正在运行 ZooKeeper 3.4.5.根据 mntr 的输出,其中两个似乎运行良好.其中一个在几天前由于部署而重新启动,此后一直无法加入法定人数.日志中突出的一些行是:

I have three servers in my quorum. They are running ZooKeeper 3.4.5. Two of them appear to be running fine based on the output from mntr. One of them was restarted a couple days ago due to a deploy, and since then has not been able to join the quorum. Some lines in the logs that stick out are:

2014-03-03 18:44:40,995 [myid:1] - INFO  [main:QuorumPeer@429] - currentEpoch not found! Creating with a reasonable default of 0. This should only happen when you are upgrading your installation

和:

2014-03-03 18:44:41,233 [myid:1] - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:QuorumCnxManager@190] - Have smaller server identifier, so dropping the connection: (2, 1)
2014-03-03 18:44:41,234 [myid:1] - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:QuorumCnxManager@190] - Have smaller server identifier, so dropping the connection: (3, 1)
2014-03-03 18:44:41,235 [myid:1] - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:FastLeaderElection@774] - Notification time out: 400

谷歌搜索第一个('currentEpoch not found!')让我找到了 JIRA ZOOKEEPER-1653 - 由于纪元不一致,zookeeper 无法启动.它描述了一个错误修复,但没有描述在不升级 zookeeper 的情况下解决问题的方法.

Googling for the first ('currentEpoch not found!') led me to JIRA ZOOKEEPER-1653 - zookeeper fails to start because of inconsistent epoch. It describes a bug fix but doesn't describe a way to resolve the issue without upgrading zookeeper.

谷歌搜索第二个('有更小的服务器标识符,所以放弃连接')让我找到了 JIRA ZOOKEEPER-1506 - 如果节点连接失败,请重试 DNS 主机名 -> IP 解析.这是有道理的,因为我为服务器使用 AWS 弹性 IP.解决此问题的方法似乎是进行滚动重启,这会导致我们暂时失去法定人数.

Googling for the second ('Have smaller server identifier, so dropping the connection') led me to JIRA ZOOKEEPER-1506 - Re-try DNS hostname -> IP resolution if node connection fails. This makes sense because I am using AWS Elastic IPs for the servers. The fix for this issue seems to be to do a rolling restart, which would cause us to temporarily lose quorum.

看起来第二个问题肯定在起作用,因为我在尝试连接到第一台服务器时看到其他 ZooKeeper 服务器的日志(仍在仲裁中的日志)超时.我不确定的是,当我滚动重启时,第一个问题是否会消失.我想避免升级和/或滚动重启,但如果我必须进行滚动重启,我想避免多次这样做.有没有办法在不升级的情况下解决第一个问题?或者甚至更好:有没有办法在不进行滚动重启的情况下解决这两个问题?

It looks like the second issue is definitely in play because I see timeouts in the other ZooKeeper server's logs (the ones still in the quorum) when trying to connect to the first server. What I'm not sure of is if the first issue will disappear when I do a rolling restart. I would like to avoid upgrading and/or doing a rolling restart, but if I have to do a rolling restart I'd like to avoid doing it multiple times. Is there a way to fix the first issue without upgrading? Or even better: Is there a way to resolve both issues without doing a rolling restart?

感谢您的阅读和帮助!

推荐答案

这是zookeeper的一个bug:与其他对等点的连接断开后,服务器无法加入仲裁重启leader就解决了这个问题.

This is a bug of zookeeper: Server is unable to join quorum after connection broken to other peersRestart the leader solves this issue.

这篇关于为什么我的 Zookeeper 服务器无法重新加入 Quorum?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-11 08:09