MHA的简要介绍
- MHA全称Master High Availability,也就是主节点的高可用,是目前比较成熟的MySQL高可用解决方案。它的主要功能主要是通过一个manager节点来监控主节点和从节点的状况,并会在主节点发生故障的时候,自动将一个数据最贴近Master的从节点转化成主节点。实现自动的故障转移。
- MHA的变成语言是Perl,需要安装一些软件包来进行编译操作,但是总体的编译过程十分地简单。
实验拓扑
要点以及基础知识
- MHA的组件中主要有两个,一个是Manager节点组件。类似于一个监督者。
Node节点组件则是安装于数据库节点,其中一个作为Master。 - MHA在主节点发生故障时需要进行主节点自动切换,所以必不可少地需要管理员权限。所以多个节点之间需要基于ssh秘钥认证。
- MHA的主要配置在于manager。
主机名 | 主机地址 | 角色 |
---|---|---|
node1 | 192.168.2.201 | Master节点,安装node组件 |
node2 | 192.168.2.202 | Slave节点,安装node组件 |
node3 | 192.168.2.203 | Slave节点,安装node组件 |
node4 | 192.168.2.204 | 安装manager组件 |
本文使用CentOS7.1,数据库:MariaDB-5.5.50
关于半同步复制的详细配置,可以参考我的上一篇文章。由于篇幅问题,这里主要讲如何安装配置和使用MHA组件。
因为数据库版本是MariaDB-5.5.50,所以选择编译在codegoole上面的mha4mysql-0.56
注意:本文关闭了selinux,以及iptables。
Perl编译安装
最新版MHA下载地址:
mha4mysql-manager
mha4mysql-node
题外话
本来代码是在codegoogle上面进行托管的,甚至连一些介绍的主页也是在codegoogle上面的。
但是由于github的出现,很多软件都转移到github上边了。codegoole上面的rpm包很多都已经失效。
因为来历不明的rpm不敢安装在实际环境中,所以选择使用perl编译安装。
(1)在每一个节点上面进行编译环境的安装
yum -y install perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-Config-IniFiles ncftp perl-Params-Validate perl-CPAN perl-Test-Mock-LWP.noarch perl-LWP-Authen-Negotiate.noarch perl-devel perl-ExtUtils-CBuilder perl-ExtUtils-MakeMaker
(2)在node4中安装manager组件
a.使用make Makefile.PL检查编译环境,功能类似于./configure
其实node1~node3这三个配置了半同步复制的数据库节点安装的是node组件,但是也是执行这两步。
一般都不会出错。而且node节点不用额外配置,所以不做重复演示了。
[root@node4 mha4mysql-manager-0.56]# perl Makefile.PL
*** Module::AutoInstall version 1.03
*** Checking for Perl dependencies...
[Core Features]
- DBI ...loaded. (1.627)
- DBD::mysql ...loaded. (4.023)
- Time::HiRes ...loaded. (1.9725)
- Config::Tiny ...loaded. (2.14)
- Log::Dispatch ...loaded. (2.41)
- Parallel::ForkManager ...loaded. (1.05)
- MHA::NodeConst ...loaded. (0.56)
*** Module::AutoInstall configuration finished.
Writing Makefile for mha4mysql::manager
b.使用make&&make install安装
[root@node4 mha4mysql-manager-0.56]# make&&make install
Skip blib/lib/MHA/ManagerUtil.pm (unchanged)
Skip blib/lib/MHA/Config.pm (unchanged)
Skip blib/lib/MHA/HealthCheck.pm (unchanged)
Skip blib/lib/MHA/ManagerConst.pm (unchanged)
Skip blib/lib/MHA/ServerManager.pm (unchanged)
Skip blib/lib/MHA/ManagerAdmin.pm (unchanged)
Skip blib/lib/MHA/FileStatus.pm (unchanged)
Skip blib/lib/MHA/ManagerAdminWrapper.pm (unchanged)
Skip blib/lib/MHA/MasterFailover.pm (unchanged)
Skip blib/lib/MHA/MasterRotate.pm (unchanged)
Skip blib/lib/MHA/MasterMonitor.pm (unchanged)
Skip blib/lib/MHA/SSHCheck.pm (unchanged)
Skip blib/lib/MHA/Server.pm (unchanged)
Skip blib/lib/MHA/DBHelper.pm (unchanged)
cp bin/masterha_stop blib/script/masterha_stop
/usr/bin/perl "-Iinc" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/masterha_stop
cp bin/masterha_conf_host blib/script/masterha_conf_host
/usr/bin/perl "-Iinc" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/masterha_conf_host
cp bin/masterha_check_repl blib/script/masterha_check_repl
/usr/bin/perl "-Iinc" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/masterha_check_repl
cp bin/masterha_check_status blib/script/masterha_check_status
/usr/bin/perl "-Iinc" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/masterha_check_status
cp bin/masterha_master_monitor blib/script/masterha_master_monitor
/usr/bin/perl "-Iinc" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/masterha_master_monitor
cp bin/masterha_check_ssh blib/script/masterha_check_ssh
/usr/bin/perl "-Iinc" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/masterha_check_ssh
cp bin/masterha_master_switch blib/script/masterha_master_switch
/usr/bin/perl "-Iinc" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/masterha_master_switch
cp bin/masterha_secondary_check blib/script/masterha_secondary_check
/usr/bin/perl "-Iinc" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/masterha_secondary_check
cp bin/masterha_manager blib/script/masterha_manager
/usr/bin/perl "-Iinc" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/masterha_manager
Manifying blib/man1/masterha_stop.1
Manifying blib/man1/masterha_conf_host.1
Manifying blib/man1/masterha_check_repl.1
Manifying blib/man1/masterha_check_status.1
Manifying blib/man1/masterha_master_monitor.1
Manifying blib/man1/masterha_check_ssh.1
Manifying blib/man1/masterha_master_switch.1
Manifying blib/man1/masterha_secondary_check.1
Manifying blib/man1/masterha_manager.1
Appending installation info to /usr/lib64/perl5/perllocal.pod
数据库节点的配置
半同步复制Master节点Node1
的MariaDB配置文件
[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0
# Settings user and group are ignored when systemd is used.
# If you need to run mysqld under a different user or group,
# customize your systemd unit file for mariadb according to the
# instructions in http://fedoraproject.org/wiki/Systemd
innodb_file_per_table = 1
skip_name_resolve = 1
log_bin = Master-log
log_bin_index = 1
server_id = 1
relay_log=relay-log
relay_log_purge=0
#skip-grant-tables
#skip-networking
[mysqld_safe]
log-error=/var/log/mariadb/mariadb.log
pid-file=/var/run/mariadb/mariadb.pid
#
# include all files from the config directory
#
!includedir /etc/my.cnf.d
这里需要注意的是,
半同步复制主节点和从节点都要启动了二进制日志log_bin = Master-log,中继日志relay_log=relay-log
而且这里关闭了中继日志的修剪功能relay_log_purge=0。因为这由MHA完成。
半同步复制Slave节点Node2
和node3
的MariaDB配置文件
[mysqld]
datadir=/var/lib/mysql/
socket=/var/lib/mysql/mysql.sock
log_bin=Master-bin
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0
# Settings user and group are ignored when systemd is used.
# If you need to run mysqld under a different user or group,
# customize your systemd unit file for mariadb according to the
# instructions in http://fedoraproject.org/wiki/Systemd
skip_name_resolve=true
innodb_file_per_table=ture
server_id = 2
log_bin=bin_log
relay_log=relay-log
read_only = 1
relay_log_purge=0
[mysqld_safe]
log-error=/var/log/mariadb/mariadb.log
pid-file=/var/run/mariadb/mariadb.pid
#
# include all files from the config directory
#
!includedir /etc/my.cnf.d
这里比Master节点多一个read_only=1
假如Slave节点被提升为Master节点的话,MHA会自动将这个read_only=1去掉
并且会将修改其他Slave节点指向新的主节点,可以用show slave status\G查看。
Manager节点配置
(1)复制默认文件作为模板,并清空默认配置
cp /etc/masterha/masterha_default.cnf /etc/masterha/app1.cnf
> /etc/masterha/masterha_default.cnf
(2)配置/etc/masterha/app1.cnf,用于启动manager进程的时候指定。
MHA的一个manager节点可以通过启动多个进程来监控多个MHA集群,所以使用app1,app2的方式。
[server default]
#manager_workdir=/var/log/masterha/app1
#manager_log=/var/log/masterha/app1/manager.log
user=root
password=123456789
manager_workdir=/data/masterha/app1
manager_log=/data/masterha/app1/manager.log
remote_workdir=/data/masterha/app1
ssh_user=root
repl_user=repuser
repl_password=repuser
ping_interval=1
[server1]
hostname=node1
candidate_master=1
[server2]
hostname=node2
candidate_master=1
[server3]
hostname=node3
这里的user和password指的是数据库管理员的账号密码
repl_user和repl_password是具有复制权限的用户和密码
ssh_user=root是ssh的账户,由于是秘钥认证,并不需要密码
配置文件中,hostname=node1是因为主机可以使用node1访问到该主机,这里也可以用ip地址。
(3)创建配置文件中manager_workdir的工作路径
mkdir /data/masterha/app1/
利用MHA的工具测试环境是否正常
(1)测试ssh是否连接正常
[root@node4 mha4mysql-manager-0.56]# masterha_check_ssh --conf=/etc/masterha/app1.cnf
Thu Nov 10 22:59:03 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Thu Nov 10 22:59:03 2016 - [info] Reading application default configuration from /etc/masterha/app1.cnf..
Thu Nov 10 22:59:03 2016 - [info] Reading server configuration from /etc/masterha/app1.cnf..
Thu Nov 10 22:59:03 2016 - [info] Starting SSH connection tests..
Thu Nov 10 22:59:04 2016 - [debug]
Thu Nov 10 22:59:03 2016 - [debug] Connecting via SSH from root@node1(192.168.2.201:22) to root@node2(192.168.2.202:22)..
Thu Nov 10 22:59:03 2016 - [debug] ok.
Thu Nov 10 22:59:03 2016 - [debug] Connecting via SSH from root@node1(192.168.2.201:22) to root@node3(192.168.2.203:22)..
Thu Nov 10 22:59:03 2016 - [debug] ok.
Thu Nov 10 22:59:04 2016 - [debug]
Thu Nov 10 22:59:03 2016 - [debug] Connecting via SSH from root@node2(192.168.2.202:22) to root@node1(192.168.2.201:22)..
Thu Nov 10 22:59:04 2016 - [debug] ok.
Thu Nov 10 22:59:04 2016 - [debug] Connecting via SSH from root@node2(192.168.2.202:22) to root@node3(192.168.2.203:22)..
Thu Nov 10 22:59:04 2016 - [debug] ok.
Thu Nov 10 22:59:05 2016 - [debug]
Thu Nov 10 22:59:04 2016 - [debug] Connecting via SSH from root@node3(192.168.2.203:22) to root@node1(192.168.2.201:22)..
Thu Nov 10 22:59:04 2016 - [debug] ok.
Thu Nov 10 22:59:04 2016 - [debug] Connecting via SSH from root@node3(192.168.2.203:22) to root@node2(192.168.2.202:22)..
Thu Nov 10 22:59:05 2016 - [debug] ok.
Thu Nov 10 22:59:05 2016 - [info] All SSH connection tests passed successfully.
这么多输出信息,其实只看最后一句就知道ssh是否正常了
这里需要注意的是这里指定了刚才配置的app1.
(2)测试复制功能是否正常
[root@node4 mha4mysql-manager-0.56]# masterha_check_repl --conf=/etc/masterha/app1.cnf
Thu Nov 10 23:07:35 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Thu Nov 10 23:07:35 2016 - [info] Reading application default configuration from /etc/masterha/app1.cnf..
Thu Nov 10 23:07:35 2016 - [info] Reading server configuration from /etc/masterha/app1.cnf..
Thu Nov 10 23:07:35 2016 - [info] MHA::MasterMonitor version 0.56.
Thu Nov 10 23:07:35 2016 - [info] GTID failover mode = 0
Thu Nov 10 23:07:35 2016 - [info] Dead Servers:
Thu Nov 10 23:07:35 2016 - [info] Alive Servers:
Thu Nov 10 23:07:35 2016 - [info] node1(192.168.2.201:3306)
Thu Nov 10 23:07:35 2016 - [info] node2(192.168.2.202:3306)
Thu Nov 10 23:07:35 2016 - [info] node3(192.168.2.203:3306)
Thu Nov 10 23:07:35 2016 - [info] Alive Slaves:
Thu Nov 10 23:07:35 2016 - [info] node2(192.168.2.202:3306) Version=5.5.50-MariaDB (oldest major version between slaves) log-bin:enabled
Thu Nov 10 23:07:35 2016 - [info] Replicating from 192.168.2.201(192.168.2.201:3306)
Thu Nov 10 23:07:35 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Nov 10 23:07:35 2016 - [info] node3(192.168.2.203:3306) Version=5.5.50-MariaDB (oldest major version between slaves) log-bin:enabled
Thu Nov 10 23:07:35 2016 - [info] Replicating from 192.168.2.201(192.168.2.201:3306)
Thu Nov 10 23:07:35 2016 - [info] Current Alive Master: node1(192.168.2.201:3306)
Thu Nov 10 23:07:35 2016 - [info] Checking slave configurations..
Thu Nov 10 23:07:35 2016 - [warning] relay_log_purge=0 is not set on slave node3(192.168.2.203:3306).
Thu Nov 10 23:07:35 2016 - [info] Checking replication filtering settings..
Thu Nov 10 23:07:35 2016 - [info] binlog_do_db= , binlog_ignore_db=
Thu Nov 10 23:07:35 2016 - [info] Replication filtering check ok.
Thu Nov 10 23:07:35 2016 - [info] GTID (with auto-pos) is not supported
Thu Nov 10 23:07:35 2016 - [info] Starting SSH connection tests..
Thu Nov 10 23:07:37 2016 - [info] All SSH connection tests passed successfully.
Thu Nov 10 23:07:37 2016 - [info] Checking MHA Node version..
Thu Nov 10 23:07:37 2016 - [info] Version check ok.
Thu Nov 10 23:07:37 2016 - [info] Checking SSH publickey authentication settings on the current master..
Thu Nov 10 23:07:37 2016 - [info] HealthCheck: SSH to node1 is reachable.
Thu Nov 10 23:07:37 2016 - [info] Master MHA Node version is 0.56.
Thu Nov 10 23:07:37 2016 - [info] Checking recovery script configurations on node1(192.168.2.201:3306)..
Thu Nov 10 23:07:37 2016 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/var/lib/mysql,/var/log/mysql --output_file=/data/masterha/app1/save_binary_logs_test --manager_version=0.56 --start_file=Master-log.000006
Thu Nov 10 23:07:37 2016 - [info] Connecting to [email protected](node1:22)..
Creating /data/masterha/app1 if not exists.. ok.
Checking output directory is accessible or not..
ok.
Binlog found at /var/lib/mysql, up to Master-log.000006
Thu Nov 10 23:07:38 2016 - [info] Binlog setting check done.
Thu Nov 10 23:07:38 2016 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers..
Thu Nov 10 23:07:38 2016 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=node2 --slave_ip=192.168.2.202 --slave_port=3306 --workdir=/data/masterha/app1 --target_version=5.5.50-MariaDB --manager_version=0.56 --relay_log_info=/var/lib/mysql/relay-log.info --relay_dir=/var/lib/mysql/ --slave_pass=xxx
Thu Nov 10 23:07:38 2016 - [info] Connecting to [email protected](node2:22)..
Checking slave recovery environment settings..
Opening /var/lib/mysql/relay-log.info ... ok.
Relay log found at /var/lib/mysql, up to relay-log.000004
Temporary relay log file is /var/lib/mysql/relay-log.000004
Testing mysql connection and privileges.. done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
Thu Nov 10 23:07:38 2016 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=node3 --slave_ip=192.168.2.203 --slave_port=3306 --workdir=/data/masterha/app1 --target_version=5.5.50-MariaDB --manager_version=0.56 --relay_log_info=/var/lib/mysql/relay-log.info --relay_dir=/var/lib/mysql/ --slave_pass=xxx
Thu Nov 10 23:07:38 2016 - [info] Connecting to [email protected](node3:22)..
Checking slave recovery environment settings..
Opening /var/lib/mysql/relay-log.info ... ok.
Relay log found at /var/lib/mysql, up to relay-log.000002
Temporary relay log file is /var/lib/mysql/relay-log.000002
Testing mysql connection and privileges.. done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
Thu Nov 10 23:07:38 2016 - [info] Slaves settings check done.
Thu Nov 10 23:07:38 2016 - [info]
node1(192.168.2.201:3306) (current master)
+--node2(192.168.2.202:3306)
+--node3(192.168.2.203:3306)
Thu Nov 10 23:07:38 2016 - [info] Checking replication health on node2..
Thu Nov 10 23:07:38 2016 - [info] ok.
Thu Nov 10 23:07:38 2016 - [info] Checking replication health on node3..
Thu Nov 10 23:07:38 2016 - [info] ok.
Thu Nov 10 23:07:38 2016 - [warning] master_ip_failover_script is not defined.
Thu Nov 10 23:07:38 2016 - [warning] shutdown_script is not defined.
Thu Nov 10 23:07:38 2016 - [info] Got exit code 0 (Not master dead).
MySQL Replication Health is OK.
(3)最激动人心的时刻到了,启动服务!
[root@node4 mha4mysql-manager-0.56]# nohup masterha_manager --conf=/etc/masterha/app1.cnf > /data/masterha/app1/manager.log 2>&1 &
[1] 8463
(4)查看masterha是否正在正常运行,还有主节点信息。
[root@node4 mha4mysql-manager-0.56]# masterha_check_status --conf=/etc/masterha/app1.cnf
app1 (pid:8463) is running(0:PING_OK), master:node1
模拟MHA故障
(1)Master节点·node1·关闭MariaDB
systemctl stop mariadb.service
(2)查看manager节点的状况
[root@node4 mha4mysql-manager-0.56]# masterha_check_status --conf=/etc/masterha/app1.cnfapp1 is stopped(2:NOT_RUNNING).
[1]+ Done nohup masterha_manager --conf=/etc/masterha/app1.cnf > /data/masterha/app1/manager.log 2>&1
可以看出MHA程序masterha_manager已经退出了
同时还要注意一点,在工作路径/data/masterha/app1/下会生成一个app1.failover.complete的文件。
如果需要启动的时候,最好删除这个文件,否则会启动失败。
(3)去node3查看slave信息,node3指向新的Master节点。
MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.2.202
Master_User: repuser
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: bin_log.000002
Read_Master_Log_Pos: 245
Relay_Log_File: relay-log.000002
Relay_Log_Pos: 527
Relay_Master_Log_File: bin_log.000002
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 245
Relay_Log_Space: 815
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 2
(4)node2原本作为从节点所设置的只读属性也自动取消了。
MariaDB [(none)]> show variables like '%read_only%';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| read_only | OFF |
+---------------+-------+
1 row in set (0.00 sec)
(5)灾后重建的步骤
我们知道,当时原有master故障的时候,masterha_manager会通过二进制日志和中继日志的状况,选举出新的master节点,并由只读状态改为可读写的状态会退出。
所以接下来要怎么做呢?
a.删除工作路径下的failover.complete文件。
如/data/masterha/app1/app1.failover.complete
b.原有的master,也就是node1节点。
需要清空数据库,再将node2全备一次,恢复到node1上面来
并配置node1为Slave节点,并指向新的节点node2
c.重新通过masterha_check等工具检测环境是否正常,并重新启动MHA的主程序masterha_manager。