MHA(Master High Availability)目前在MySQL高可用方面是一个相对成熟的解决方案,它由日本DeNA公司youshimaton(现就职于 Facebook公司)开发,是一套优秀的作为MySQL高可用性环境下故障切换和主从提升的高可用软件。在MySQL故障切换过程中,MHA能做到在 0~30秒之内自动完成数据库的故障切换操作,并且在进行故障切换的过程中,MHA能在最大程度上保证数据的一致性,以达到真正意义上的高可用。

该软件由两部分组成:MHA Manager(管理节点)和MHA Node(数据节点)。MHA Manager可以单独部署在一台独立的机器上管理多个master-slave集群,也可以部署在一台slave节点上。MHA Node运行在每台MySQL服务器上,MHA Manager会定时探测集群中的master节点,当master出现故障时,它可以自动将最新数据的slave提升为新的master,然后将所有其 他的slave重新指向新的master。整个故障转移过程对应用程序完全透明。

在MHA自动故障切换过程中,MHA试图从宕机的主服务器上保存二进制日志,最大程度的保证数据的不丢失,但这并不总是可行的。例如,如果主服务器 硬件故障或无法通过ssh访问,MHA没法保存二进制日志,只进行故障转移而丢失了最新的数据。使用MySQL 5.5的半同步复制,可以大大降低数据丢失的风险。MHA可以与半同步复制结合起来。如果只有一个slave已经收到了最新的二进制日志,MHA可以将最 新的二进制日志应用于其他所有的slave服务器上,因此可以保证所有节点的数据一致性。

MHA工作原理如图:

MySQL高可用架构之MHA-LMLPHP

环境:

server1 : 172.25.100.1   manager

server5: 172.25.100.5 node1

server6: 172.25.100.6 node2

server7: 172.25.100.7 node3

使用安装包:

perl-Mail-Sender-0.8.16-3.el6.noarch.rpm                      perl-Mail-Sendmail-0.79-12.el6.noarch.rpm
mha4mysql-manager-0.56-0.el6.noarch.rpm                  perl-MIME-Lite-3.027-2.el6.noarch.rpm
mha4mysql-node-0.56-0.el6.noarch.rpm                         perl-MIME-Types-1.28-2.el6.noarch.rpm
perl-Config-Tiny-2.12-7.1.el6.noarch.rpm                        perl-Parallel-ForkManager-0.7.9-1.el6.noarch.rpm
perl-Email-Date-Format-1.002-5.el6.noarch.rpm
perl-Log-Dispatch-2.27-1.el6.noarch.rpm

1.部署MHA

所有节点安装node

yum install -y mha4mysql-node-0.56-0.el6.noarch.rpm

建立好master与slave间的主从复制环境

参考之前博客---mysql的复制

server1安装manager管理工具

将上述所有安装包安装

yum install -y *

创建MHA的工作目录,并且创建相关配置文件

[root@server1 ~]# mkdir /etc/manager
[root@server1 MHA]# cd /etc/manager/
[root@server1 manager]# ls
[root@server1 manager]# vim manager.cnf
[server default]
manager_workdir=/etc/manager    //设置manager的工作目录
manager_log=/etc/manager/manager.log    //设置manager的日志
master_binlog_dir=/var/lib/mysql    //设置master 保存binlog的位置,以便MHA可以找到master的日志,我这里的也就是mysql的数据目录
#master_ip_failover_script= /usr/local/bin/master_ip_failover    //设置自动failover时候的切换脚本
#master_ip_online_change_script= /usr/local/bin/master_ip_online_change    //设置手动切换时候的切换脚本
password=REDHAT.org123    //设置mysql中root用户的密码,这个密码是前文中创建监控用户的那个密码
user=root    //设置监控用户root
ping_interval=1    //设置监控主库,发送ping包的时间间隔,默认是3秒,尝试三次没有回应的时候自动进行railover
remote_workdir=/tmp    //设置远端mysql在发生切换时binlog的保存位置
repl_password=REDHAT.org123    //设置复制用户的密码
repl_user=repl    //设置复制环境中的复制用户名
#report_script=/usr/local/send_report    //设置发生切换后发送的报警的脚本
#secondary_check_script= /usr/local/bin/masterha_secondary_check -s server03 -s server02        #shutdown_script=""
ssh_user=root    //设置ssh的登录用户名

[server5]
hostname=172.25.100.5
port=3306

[server6]
hostname=172.25.100.6
port=3306
candidate_master=1    //设置为候选master,如果设置该参数以后,发生主从切换以后将会将此从库提升为主库,即使这个主库不是集群中事件最新的slave
check_repl_delay=0    //默认情况下如果一个slave落后master 100M的relay logs的话,MHA将不会选择该slave作为一个新的master,因为对于这个slave的恢复需要花费很长时间,通过设置check_repl_delay=0,MHA触发切换在选择一个新的master的时候将会忽略复制延时,这个参数对于设置了candidate_master=1的主机非常有用,因为这个候选主在切换的过程中一定是新的master

[server7]
hostname=172.25.100.7
port=3306

配置SSH登录无密码验证

[root@server1 ~]# ssh-keygen
[root@server1 ~]# ssh-copy-id 172.25.100.1
[root@server1 .ssh]# ls
authorized_keys  id_rsa  id_rsa.pub  known_hosts
[root@server1 ~]# scp -r .ssh/ server5:/root/
[root@server1 ~]# scp -r .ssh/ server6:/root/
[root@server1 ~]# scp -r .ssh/ server7:/root/
测试:
[root@server1 ~]# ssh 172.25.100.5
Last login: Sun Oct  7 10:02:30 2018 from 172.25.100.250
[root@server5 ~]# ^C
[root@server5 ~]# logout
Connection to 172.25.100.5 closed.
[root@server1 ~]# ssh 172.25.100.6
Last login: Sun Oct  7 10:02:37 2018 from 172.25.100.250
[root@server6 ~]# logout
Connection to 172.25.100.6 closed.
[root@server1 ~]# ssh 172.25.100.7
Last login: Sun Oct  7 10:02:48 2018 from 172.25.100.250
[root@server7 ~]# logout
Connection to 172.25.100.7 closed.

检查MHA Manger到所有MHA Node的SSH连接状态:

[root@server1 manager]# masterha_check_ssh --conf=/etc/manager/manager.cnf
Sun Oct  7 10:37:39 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sun Oct  7 10:37:39 2018 - [info] Reading application default configuration from /etc/manager/manager.cnf..
Sun Oct  7 10:37:39 2018 - [info] Reading server configuration from /etc/manager/manager.cnf..
Sun Oct  7 10:37:39 2018 - [info] Starting SSH connection tests..
Sun Oct  7 10:37:40 2018 - [debug]
Sun Oct  7 10:37:39 2018 - [debug]  Connecting via SSH from root@172.25.100.5(172.25.100.5:22) to root@172.25.100.6(172.25.100.6:22)..
Warning: Permanently added '172.25.100.6' (RSA) to the list of known hosts.
Sun Oct  7 10:37:39 2018 - [debug]   ok.
Sun Oct  7 10:37:39 2018 - [debug]  Connecting via SSH from root@172.25.100.5(172.25.100.5:22) to root@172.25.100.1(172.25.100.1:22)..
Sun Oct  7 10:37:40 2018 - [debug]   ok.
Sun Oct  7 10:37:40 2018 - [debug]
Sun Oct  7 10:37:40 2018 - [debug]  Connecting via SSH from root@172.25.100.6(172.25.100.6:22) to root@172.25.100.5(172.25.100.5:22)..
Sun Oct  7 10:37:40 2018 - [debug]   ok.
Sun Oct  7 10:37:40 2018 - [debug]  Connecting via SSH from root@172.25.100.6(172.25.100.6:22) to root@172.25.100.1(172.25.100.1:22)..
Sun Oct  7 10:37:40 2018 - [debug]   ok.
Sun Oct  7 10:37:41 2018 - [debug]
Sun Oct  7 10:37:40 2018 - [debug]  Connecting via SSH from root@172.25.100.1(172.25.100.1:22) to root@172.25.100.5(172.25.100.5:22)..
Sun Oct  7 10:37:40 2018 - [debug]   ok.
Sun Oct  7 10:37:40 2018 - [debug]  Connecting via SSH from root@172.25.100.1(172.25.100.1:22) to root@172.25.100.6(172.25.100.6:22)..
Sun Oct  7 10:37:41 2018 - [debug]   ok.
Sun Oct  7 10:37:41 2018 - [info] All SSH connection tests passed successfully.

检查整个复制环境状况

[root@server1 manager]# masterha_check_repl --conf=/etc/manager/manager.cnf
Sun Oct  7 11:43:11 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sun Oct  7 11:43:11 2018 - [info] Reading application default configuration from /etc/manager/manager.cnf..
Sun Oct  7 11:43:11 2018 - [info] Reading server configuration from /etc/manager/manager.cnf..
Sun Oct  7 11:43:11 2018 - [info] MHA::MasterMonitor version 0.56.
Sun Oct  7 11:43:11 2018 - [info] GTID failover mode = 1
Sun Oct  7 11:43:11 2018 - [info] Dead Servers:
Sun Oct  7 11:43:11 2018 - [info] Alive Servers:
Sun Oct  7 11:43:11 2018 - [info]   172.25.100.5(172.25.100.5:3306)
Sun Oct  7 11:43:11 2018 - [info]   172.25.100.6(172.25.100.6:3306)
Sun Oct  7 11:43:11 2018 - [info]   172.25.100.7(172.25.100.7:3306)
Sun Oct  7 11:43:11 2018 - [info] Alive Slaves:
Sun Oct  7 11:43:11 2018 - [info]   172.25.100.6(172.25.100.6:3306)  Version=5.7.17-log (oldest major version between slaves) log-bin:enabled
Sun Oct  7 11:43:11 2018 - [info]     GTID ON
Sun Oct  7 11:43:11 2018 - [info]     Replicating from 172.25.100.5(172.25.100.5:3306)
Sun Oct  7 11:43:11 2018 - [info]     Primary candidate for the new Master (candidate_master is set)
Sun Oct  7 11:43:11 2018 - [info]   172.25.100.7(172.25.100.7:3306)  Version=5.7.17-log (oldest major version between slaves) log-bin:enabled
Sun Oct  7 11:43:11 2018 - [info]     GTID ON
Sun Oct  7 11:43:11 2018 - [info]     Replicating from 172.25.100.5(172.25.100.5:3306)
Sun Oct  7 11:43:11 2018 - [info] Current Alive Master: 172.25.100.5(172.25.100.5:3306)
Sun Oct  7 11:43:11 2018 - [info] Checking slave configurations..
Sun Oct  7 11:43:11 2018 - [info]  read_only=1 is not set on slave 172.25.100.6(172.25.100.6:3306).
Sun Oct  7 11:43:11 2018 - [info]  read_only=1 is not set on slave 172.25.100.7(172.25.100.7:3306).
Sun Oct  7 11:43:11 2018 - [info] Checking replication filtering settings..
Sun Oct  7 11:43:11 2018 - [info]  binlog_do_db= , binlog_ignore_db=
Sun Oct  7 11:43:11 2018 - [info]  Replication filtering check ok.
Sun Oct  7 11:43:11 2018 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Sun Oct  7 11:43:11 2018 - [info] Checking SSH publickey authentication settings on the current master..
Sun Oct  7 11:43:11 2018 - [info] HealthCheck: SSH to 172.25.100.5 is reachable.
Sun Oct  7 11:43:11 2018 - [info]
172.25.100.5(172.25.100.5:3306) (current master)
 +--172.25.100.6(172.25.100.6:3306)
 +--172.25.100.7(172.25.100.7:3306)

Sun Oct  7 11:43:11 2018 - [info] Checking replication health on 172.25.100.6..
Sun Oct  7 11:43:11 2018 - [info]  ok.
Sun Oct  7 11:43:11 2018 - [info] Checking replication health on 172.25.100.7..
Sun Oct  7 11:43:11 2018 - [info]  ok.
Sun Oct  7 11:43:11 2018 - [warning] master_ip_failover_script is not defined.
Sun Oct  7 11:43:11 2018 - [warning] shutdown_script is not defined.
Sun Oct  7 11:43:11 2018 - [info] Got exit code 0 (Not master dead).

MySQL Replication Health is OK.

检查整个复制环境遇到的问题:

[error][/usr/share/perl5/vendor_perl/MHA/ServerManager.pm, ln301] Got MySQL error when connecting 172.25.100.6(172.25.100.6:3306) :1130:Host 'server1' is not allowed to connect to this MySQL server, but this is not a MySQL crash. Check MySQL server settings.
 at /usr/share/perl5/vendor_perl/MHA/ServerManager.pm line 297

解决办法:
在报告节点中给与复制权限
 grant all on *.* to root@'%' identified by 'REDHAT.org123';
Query OK, 0 rows affected, 1 warning (0.06 sec)
[error][/usr/share/perl5/vendor_perl/MHA/Server.pm, ln393] 172.25.100.6(172.25.100.6:3306): User repl does not exist or does not have REPLICATION SLAVE privilege! Other slaves can not start replication from this host.

解决办法:
在所报告的节点中执行命令
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'172.25.100.%' IDENTIFIED by 'REDHAT.org123';

检查MHA Manager的状态:

[root@server1 manager]# masterha_check_status --conf=/etc/manager/manager.cnf
manager is stopped(2:NOT_RUNNING).

开启 MHA Manager:

[root@server1 manager]# nohup masterha_manager --conf=/etc/manager/manager.cnf &
[1] 1445
[root@server1 manager]# nohup: ignoring input and appending output to `nohup.out'

[root@server1 manager]# masterha_check_status --conf=/etc/manager/manager.cnf
manager (pid:1445) is running(0:PING_OK), master:172.25.100.5

 

10-07 20:46