MHA原理
(1)一主三从架构,主库挂了,但主库能被从库ssh上去的情况下,MHA从三个从库中选择同步最接近的作为新主,然后新主和s2,s3都ssh到原主上通过binlog补上还没有同步的数据,io_thread读取到binlog位置,传到save_binary_logs,然后回放,达到s1,s2,s3和原主一致。
(2)主库无法ssh上去的情况下,即主库的系统无法连接,假设有binlog server ,利用binlog补数据,和前面情况一样;
如果没有binlog server,假设s1靠前,s2和s3会通过relay-log和s1同步(所以从库的relay不能自动清除参数relay_log_purge = 0) ,提升s1为主,把s2和s3作为s1的从;
(3)5.6之后,如果有一主三从的架构,使用GTID复制,keepalived+一主两从,并且其中一个从库使用增强半同步复制,这样完全可以取代MHA,这就是为什么MHA代码为什么不更新,out的原因。
实验环境
OS:CentOS release 6.6 (Final)
数据库:mysql 5.7.16
master :192.168.91.23
slave1 :192.168.91.22
slave2:192.168.91.21
VIP:192.168.91.200
master && masterha_manager | Darren1 | 192.168.91.23 |
slave1 | Darren2 | 192.168.91.22 |
slave2 | Darren3 | 192.168.91.21 |
实验架构图
提前做好一主两从传统的复制(这里不再演示)
提前安装依赖包:(一般yum源可能没有这些依赖包,可以安装epel)
[root@Darren1 tools]# rpm -ivh epel-release-latest-6.noarch.rpm
[root@Darren2 tools]# rpm -ivh epel-release-latest-6.noarch.rpm
[root@Darren3 tools]# rpm -ivh epel-release-latest-6.noarch.rpm
下载软件链接:
http://code.google.com/p/mysql-master-ha/wiki
master上安装manager和node:
[root@Darren1 tools]# ls
mha4mysql-manager-0.56-0.el6.noarch.rpm
mha4mysql-node-0.56-0.el6.noarch.rpm
[root@Darren1 tools]# yum localinstall mha4mysql*
[root@Darren1 mysql]# rpm -qa |grep mha
mha4mysql-node-0.56-0.el6.noarch
mha4mysql-manager-0.56-0.el6.noarch
如果yum也无法安装,那就先安装依赖包,再安装rpm包:
yum install perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager -y
rpm -ivh mha4mysql-node-0.56-0.el6.noarch.rpm
rpm -ivh mha4mysql-manager-0.56-0.el6.noarch.rpm
slave1和slave2在从库上安装node软件:
[root@Darren2 app1]# yum localinstall mha4mysql-node-0.56-0.el6.noarch.rpm
[root@Darren3 app1]# yum localinstall mha4mysql-node-0.56-0.el6.noarch.rpm
物理机器之间信任机制建立
三台机器同样配置:
[root@Darren1 tools]# vim /etc/hosts
192.168.91.23 Darren1
192.168.91.22 Darren2
192.168.91.21 Darren3
在master上面操作:
[root@Darren1 ]# cd /root/.ssh
[root@Darren1 .ssh]# ssh-keygen #一路回车
[root@Darren1 .ssh]# cat /etc/ssh/sshd_config |grep authorized
#AuthorizedKeysFile .ssh/authorized_keys
把生成的秘钥文件重定向到authorized_keys文件中:
[root@Darren1 .ssh]# cat id_rsa.pub >>authorized_keys
[root@Darren1 .ssh]# chmod 600 *
将创建的文件拷贝到slave1和slave2上面:
[root@Darren1 .ssh]# scp * 192.168.91.22:/root/.ssh/
[root@Darren1 .ssh]# scp * 192.168.91.21:/root/.ssh/
在slave1和slave2上面操作:
[root@Darren2 .ssh]#chmod 600 /root/.ssh/*
[root@Darren3 .ssh]#chmod 600 /root/.ssh/*
测试互信是否成功:
[root@Darren1 ~]# ssh Darren2 hostname
[root@Darren1 ~]# ssh Darren3 hostname
[root@Darren2 ~]# ssh Darren1 hostname
[root@Darren2 ~]# ssh Darren3 hostname
[root@Darren3 ~]# ssh Darren1 hostname
[root@Darren3 ~]# ssh Darren2 hostname
创建masterha日志文件目录:
[root@Darren1 masterha]#mkdir -p /var/log/masterha/app1
在master上创建目录及配置文件:
[root@Darren1 ~]# mkdir -p /etc/masterha
[root@Darren1 ~]# cd /etc/masterha/
上传五个文件,文件及脚本内容在文章末尾处贴出来:
[root@Darren1 masterha]# ls
app1.conf drop_vip.sh init_vip.sh masterha_default.conf master_ip_failover master_ip_online_change
[root@Darren1 masterha]# chmod +x drop_vip.sh init_vip.sh master_ip_failover master_ip_online_change
[root@Darren1 masterha]# cat app1.conf [server default] #mha manager工作目录 manager_workdir = /var/log/masterha/app1 manager_log = /var/log/masterha/app1/app1.log remote_workdir = /var/log/masterha/app1 [server1] hostname=192.168.91.23 master_binlog_dir = /data/mysql/mysql3306/logs candidate_master = 1 check_repl_delay = 0 #用防止master故障时,切换时slave有延迟,卡在那里切不过来。 [server2] hostname=192.168.91.22 master_binlog_dir=/data/mysql/mysql3306/logs candidate_master=1 check_repl_delay=0 [server3] hostname=192.168.91.21 master_binlog_dir=/data/mysql/mysql3306/logs candidate_master=1 check_repl_delay=0
[root@Darren1 masterha]# cat masterha_default.conf [server default] #MySQL的用户和密码 user=admin password = 147258 #系统ssh用户 ssh_user=root #复制用户 repl_user=repl repl_password= 147258 #监控 ping_interval=1 #shutdown_script="" #切换调用的脚本 master_ip_failover_script= /etc/masterha/master_ip_failover master_ip_online_change_script= /etc/masterha/master_ip_online_change
分别传到slave1和slave2上相同位置
[root@Darren1 masterha]# scp -r /etc/masterha 192.168.91.22:/etc/
[root@Darren1 masterha]# scp -r /etc/masterha 192.168.91.21:/etc/
创建一个admin用户,用于启动masterha_manager
root@localhost [(none)]>create user admin@'192.168.91.%' identified by '147258';
root@localhost [(none)]>grant all on *.* to 'admin'@'192.168.91.%';
启动VIP
[root@Darren1 masterha]# sh init_vip.sh
节点之间的主从关系检测
[root@Darren1 masterha]# masterha_check_repl --global_conf=/etc/masterha/masterha_default.conf --conf=/etc/masterha/app1.conf
最后输出:MySQL Replication Health is OK.表示检测通过
节点直接的互信检测
[root@Darren1 masterha]# masterha_check_ssh --global_conf=/etc/masterha/masterha_default.conf --conf=/etc/masterha/app1.conf
Thu May 11 15:24:55 2017 - [info] All SSH connection tests passed successfully.
启动masterh
[root@Darren1 app1]# nohup masterha_manager --global_conf=/etc/masterha/masterha_default.conf --conf=/etc/masterha/app1.conf > /tmp/masterha_manager.log 2>&1 &
查看masterha状态
[root@Darren1 app1]# masterha_check_status --global_conf=/etc/masterha/masterha_default.conf --conf=/etc/masterha/app1.conf
app1 (pid:86957) is running(0:PING_OK), master:192.168.91.23
[root@Darren1 app1]# ps -ef |grep master
root 1371 1 0 May04 ? 00:00:02 /usr/libexec/postfix/master
root 78618 67225 0 08:02 pts/0 00:00:00 perl /usr/bin/masterha_manager --global_conf=/etc/masterha/masterha_default.conf --conf=/etc/masterha/app1.conf
root 79204 67246 0 08:07 pts/2 00:00:00 grep master
通过日志文件查看启动过程:
----------------------------------------------- [root@Darren1 app1]# tail -f app1.log Thu May 11 15:47:15 2017 - [info] MHA::MasterMonitor version 0.56. Thu May 11 15:47:16 2017 - [info] GTID failover mode = 0 Thu May 11 15:47:16 2017 - [info] Dead Servers: Thu May 11 15:47:16 2017 - [info] Alive Servers: --罗列mha中的所有节点 Thu May 11 15:47:16 2017 - [info] 192.168.91.23(192.168.91.23:3306) Thu May 11 15:47:16 2017 - [info] 192.168.91.22(192.168.91.22:3306) Thu May 11 15:47:16 2017 - [info] Alive Slaves: --活跃的从库 Thu May 11 15:47:16 2017 - [info] 192.168.91.22(192.168.91.22:3306) Version=5.7.16-log (oldest major version between slaves) log-bin:enabled Thu May 11 15:47:16 2017 - [info] Replicating from 192.168.91.23(192.168.91.23:3306) Thu May 11 15:47:16 2017 - [info] Primary candidate for the new Master (candidate_master is set) Thu May 11 15:47:16 2017 - [info] Current Alive Master: 192.168.91.23(192.168.91.23:3306) --现在的主库 Thu May 11 15:47:16 2017 - [info] Checking slave configurations.. Thu May 11 15:47:16 2017 - [info] read_only=1 is not set on slave 192.168.91.22(192.168.91.22:3306). Thu May 11 15:47:16 2017 - [info] Checking replication filtering settings.. Thu May 11 15:47:16 2017 - [info] binlog_do_db= , binlog_ignore_db= Thu May 11 15:47:16 2017 - [info] Replication filtering check ok. Thu May 11 15:47:16 2017 - [info] GTID (with auto-pos) is not supported Thu May 11 15:47:16 2017 - [info] Starting SSH connection tests.. Thu May 11 15:47:17 2017 - [info] All SSH connection tests passed successfully. Thu May 11 15:47:17 2017 - [info] Checking MHA Node version.. Thu May 11 15:47:17 2017 - [info] Version check ok. Thu May 11 15:47:17 2017 - [info] Checking SSH publickey authentication settings on the current master.. Thu May 11 15:47:17 2017 - [info] HealthCheck: SSH to 192.168.91.23 is reachable. Thu May 11 15:47:17 2017 - [info] Master MHA Node version is 0.56. Thu May 11 15:47:17 2017 - [info] Checking recovery script configurations on 192.168.91.23(192.168.91.23:3306).. save_binary_logs #检查binlog部分,如果主库挂了,从库需要用到binlog补数据,这个命令是主库故障转移时候用到,这里是command=test是测试,没有实际操作。 Thu May 11 15:47:17 2017 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql/mysql3306/logs --output_file=/var/log/masterha/app1/save_binary_logs_test --manager_version=0.56 --start_file=mysql-bin.000029 Thu May 11 15:47:17 2017 - [info] Connecting to [email protected](192.168.91.23:22).. Creating /var/log/masterha/app1 if not exists.. ok. Checking output directory is accessible or not.. ok. Binlog found at /data/mysql/mysql3306/logs, up to mysql-bin.000029 Thu May 11 15:47:17 2017 - [info] Binlog setting check done. #查看从库的relay-log,使用到命令apply_diff_relay_logs,如果主库挂了,并且不能ssh上去也没有binlog server备份,那么选举一个从库作为新主,其他从库需要用到这里的relay-log补数据; Thu May 11 15:47:17 2017 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers.. Thu May 11 15:47:17 2017 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='admin' --slave_host=192.168.91.22 --slave_ip=192.168.91.22 --slave_port=3306 --workdir=/var/log/masterha/app1 --target_version=5.7.16-log --manager_version=0.56 --relay_dir=/data/mysql/mysql3306/data --current_relay_log=relay-bin.000001 --slave_pass=xxx Thu May 11 15:47:17 2017 - [info] Connecting to [email protected](192.168.91.22:22).. Checking slave recovery environment settings.. Relay log found at /data/mysql/mysql3306/data, up to relay-bin.000002 Temporary relay log file is /data/mysql/mysql3306/data/relay-bin.000002 Testing mysql connection and privileges..mysql: [Warning] Using a password on the command line interface can be insecure. done. Testing mysqlbinlog output.. done. Cleaning up test file(s).. done. Thu May 11 15:47:18 2017 - [info] Slaves settings check done. Thu May 11 15:47:18 2017 - [info] #现在的主库和从库 192.168.91.23(192.168.91.23:3306) (current master) +--192.168.91.22(192.168.91.22:3306) Thu May 11 15:47:18 2017 - [info] Checking master_ip_failover_script status: Thu May 11 15:47:18 2017 - [info] /etc/masterha/master_ip_failover --command=status --ssh_user=root --orig_master_host=192.168.91.23 --orig_master_ip=192.168.91.23 --orig_master_port=3306 Thu May 11 15:47:18 2017 - [info] OK. Thu May 11 15:47:18 2017 - [warning] shutdown_script is not defined. #每隔1秒ping一次 Thu May 11 15:47:18 2017 - [info] Set master ping interval 1 seconds. Thu May 11 15:47:18 2017 - [warning] secondary_check_script is not defined. It is highly recommended setting it to check master reachability from two or more routes. Thu May 11 15:47:18 2017 - [info] Starting ping health check on 192.168.91.23(192.168.91.23:3306).. Thu May 11 15:47:18 2017 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond.. ------------------------------------
停止mha
[root@Darren1 app1]# masterha_stop --global_conf=/etc/masterha/masterha_default.conf --conf=/etc/masterha/app1.conf
Stopped app1 successfully.
当主库上的mysql挂了,这个时候会从slave1和slave2中选择一个作为新主,如果选择slave1作为新主,这个时候slave2会把slave1作为新主,以前挂掉的主等修复好之后可以作为slave1的从。
模拟切换
[root@Darren1 app1]# /etc/init.d/mysqld stop
Shutting down MySQL............ SUCCESS!
MHA日志目录下会生成两个文件,其中app1.failover.complete 这个文件下次启动mha之前必须删除或者移走,否则无法启动:
[root@Darren1 app1]# ll
total 24
-rw-r--r-- 1 root root 0 May 10 08:12 app1.failover.complete
-rw-r--r-- 1 root root 17416 May 10 08:12 app1.log
-rw-r--r-- 1 root root 234 May 10 08:12 saved_master_binlog_from_192.168.91.23_3306_20170510081201.binlog
可以看到ip已经切换到从库上面
[root@Darren2 masterha]# ip addr|grep 192
inet 192.168.91.22/24 brd 192.168.91.255 scope global eth0
inet 192.168.91.200/32 scope global eth0
inet 192.168.91.100/32 scope global eth0
slave1上的从库信息也被清除了
root@localhost [(none)]>show slave status\G
Empty set (0.00 sec)
slave2也把slave1作为了新主
root@localhost [(none)]>show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.91.22
Master_User: repl
挂掉的主库怎么办?
可以作为新主的从库:
[root@Darren1 app1]# /etc/init.d/mysqld start
在MHA日志中可以找到master_log_file和master_log_pos:
[root@Darren1 app1]# vim app1.log|grep CHANGE
root@localhost [(none)]>change master to
master_host='192.168.91.22',
master_port=3306,
master_user='repl',
master_password='147258',
Master_Log_File='mysql-bin.000014',
Master_Log_Pos=234;
重新启动mha
[root@Darren1 app1]# nohup masterha_manager --global_conf=/etc/masterha/masterha_default.conf --conf=/etc/masterha/app1.conf > /tmp/masterha_manager.log 2>&1 &
[root@Darren1 app1]# cat app1.master_status.health
94112 0:PING_OK master:192.168.91.22
[root@Darren1 app1]# tail app1.log
Thu May 11 17:06:42 2017 - [warning] secondary_check_script is not defined. It is highly recommended setting it to check master reachability from two or more routes.
Thu May 11 17:06:42 2017 - [info] Starting ping health check on 192.168.91.22(192.168.91.22:3306).. --应为这个时候22已经是主库了,所以这里一直ping22是否存在
Thu May 11 17:06:42 2017 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
可能遇到的报错
(1) [root@Darren1 .ssh]# masterha_check_ssh --global_conf=/etc/masterha/masterha_default.conf --conf=/etc/masterha/app1.conf
......
Thu May 11 19:42:40 2017 - [info] Connecting to [email protected](192.168.91.21:22)..
Can't exec "mysqlbinlog": No such file or directory at /usr/share/perl5/vendor_perl/MHA/BinlogManager.pm line 106.
或者执行同样的命令遇到这样的报错:
Testing mysql connection and privileges..sh: mysql: command not found
mysql command failed with rc 127:0!
解决方法:
需要做一个软连接
ln -s /usr/local/mysql/bin/mysqlbinlog /usr/bin/mysqlbinlog
ln -s /usr/local/mysql/bin/mysql /usr/bin/mysql
(2) 在启动masterha_manager的时候可能报错:
[root@Darren1 app1]# tail app1.log
Thu May 11 21:40:18 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln309] Last failover was done at 2017/05/11 19:04:14. Current time is too early to do failover again. If you want to do failover, manually remove /var/log/masterha/app1/app1.failover.complete and run this script again.
Thu May 11 21:40:18 2017 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR: at /usr/bin/masterha_manager line 65
原因是没有删除app1.failover.complete,rm就好。