前提: MySQL双主+keepalived实现MySQL的高可用。
环境:
master: 172.16.3.5 TiDB-node1 slave : 172.16.3.7 TiDB-node3 VIP : 172.16.3.100
问题: Master开启之后先进入BACKUP state,然后check script 检测成功之后,进入MASTER state,然后在MASTER上面获取得到VIP;然后在SLVAE上面开启keepalived,也是先进入BACKUP state,按照正常的逻辑,在MASTER 广播的时候SLAVE获取得到了在VRRP这个组里面已经存在了一个MASTER,所以SLAVE应该继续保持BACKUP state,但是BACKUP state在check script成功也进入了MASTER state,并且也获取得到了VIP.
MASTER的keepalived的配置信息:
vrrp_script vs_mysql_82 { #定义检测脚本 script "/usr/local/python/bin/python /etc/keepalived/checkMySQL.py -h 172.16.3.5 -P 3306" interval 60 #脚本执行的间隔时间 } vrrp_instance VI_82 { state BACKUP #初始均为BACKUP state nopreempt #设置为不争抢状态,即MASTER降级为FAULT之后,恢复之后旧主为BACKUP,不升级为MASTER. interface eth0 #绑定的网卡 virtual_router_id 172 #route id;进行分组,相同则分为同一个组 priority 100 #权重 advert_int 5 #keepalived通信的间隔 authentication { auth_type PASS auth_pass 1111 } track_script { vs_mysql_82 #检测脚本 } virtual_ipaddress { 172.16.3.100 } }
SLAVE的keepalived的配置信息:
vrrp_script vs_mysql_82 { script "/usr/local/python/bin/python /etc/keepalived/checkMySQL.py -h 172.16.3.7 -P 3306" interval 60 } vrrp_instance VI_82 { state BACKUP nopreempt interface eth0 virtual_router_id 172 priority 90 advert_int 5 authentication { auth_type PASS auth_pass 1111 } track_script { vs_mysql_82 } virtual_ipaddress { 172.16.3.100 } }
出现这种情况下,我的第一考虑就是发生的脑裂,但是我俩者互ping,都是可以的.并且在master和slave本地执行 ip addr show的情况如下:
master:
[root@TiDB-node1 ~]# ip addr show 1: lo:mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: mtu 1500 qdisc mq state UP qlen 1000 link/ether 00:0c:29:20:ce:b4 brd ff:ff:ff:ff:ff:ff inet 172.16.3.5/22 brd 172.16.3.255 scope global eth0 inet 172.16.3.100/32 scope global eth0 inet6 fe80::20c:29ff:fe20:ceb4/64 scope link valid_lft forever preferred_lft forever
slave:
[root@TiDB-node1 keepalived]# ip addr show 1: lo:mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: mtu 1500 qdisc mq state UP qlen 1000 link/ether 00:0c:29:20:ce:b4 brd ff:ff:ff:ff:ff:ff inet 172.16.3.5/22 brd 172.16.3.255 scope global eth0 inet 172.16.3.100/32 scope global eth0 inet6 fe80::20c:29ff:fe20:ceb4/64 scope link valid_lft forever preferred_lft forever
然后 链接mysql执行select @@hostname
[root@private-STG4 ~]# mysql -urpl -h172.16.3.100 -p -P3306 Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 10 Server version: 5.7.17-log MySQL Community Server (GPL) Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. rpl@mysqldb 16:12: [(none)]> select @@hostname; +------------+ | @@hostname | +------------+ | TiDB-node3 | +------------+ 1 row in set (0.00 sec)
之后继续验证:
1.在master和slave 分别开启执行
tcpdump -i eth0 host 172.16.3.100 -vvvv
2.在slave上面关掉 keepalived
/etc/init.d/keepalived stop
3.在任意一台非master和非slave的机器上面执行
ping 172.16.3.100
4.这个时候在master显示:
[root@TiDB-node1 keepalived]# tcpdump -i eth0 host 172.16.3.100 -vvvv tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes 15:28:11.968811 IP (tos 0x0, ttl 64, id 57379, offset 0, flags [DF], proto ICMP (1), length 84) 172.16.3.15 > 172.16.3.100: ICMP echo request, id 2057, seq 54, length 64 15:28:12.968815 IP (tos 0x0, ttl 64, id 57380, offset 0, flags [DF], proto ICMP (1), length 84) 172.16.3.15 > 172.16.3.100: ICMP echo request, id 2057, seq 55, length 64 15:28:13.968840 IP (tos 0x0, ttl 64, id 57381, offset 0, flags [DF], proto ICMP (1), length 84) 172.16.3.15 > 172.16.3.100: ICMP echo request, id 2057, seq 56, length 64 15:28:14.968870 IP (tos 0x0, ttl 64, id 57382, offset 0, flags [DF], proto ICMP (1), length 84) 172.16.3.15 > 172.16.3.100: ICMP echo request, id 2057, seq 57, length 64 15:28:15.968872 IP (tos 0x0, ttl 64, id 57383, offset 0, flags [DF], proto ICMP (1), length 84)
5.在slave重新启动keepalived
/etc/init.d/keepalived start
6.在master上面显示:
15:28:42.097462 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 172.16.3.100 (Broadcast) tell 172.16.3.100, length 46 15:28:42.097693 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 172.16.3.100 (Broadcast) tell 172.16.3.100, length 46 15:28:42.097706 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 172.16.3.100 (Broadcast) tell 172.16.3.100, length 46 15:28:42.097711 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 172.16.3.100 (Broadcast) tell 172.16.3.100, length 46 15:28:42.097715 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 172.16.3.100 (Broadcast) tell 172.16.3.100, length 46 15:28:47.098555 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 172.16.3.100 (Broadcast) tell 172.16.3.100, length 46 15:28:47.098773 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 172.16.3.100 (Broadcast) tell 172.16.3.100, length 46 15:28:47.098783 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 172.16.3.100 (Broadcast) tell 172.16.3.100, length 46 15:28:47.098786 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 172.16.3.100 (Broadcast) tell 172.16.3.100, length 46 15:28:47.098789 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 172.16.3.100 (Broadcast) tell 172.16.3.100, length 46
7.在slave上面显示:
[root@TiDB-node3 keepalived]# tcpdump -i eth0 host 172.16.3.100 -vvvvv tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes 15:28:42.102540 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 172.16.3.100 (Broadcast) tell 172.16.3.100, length 28 15:28:42.102614 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 172.16.3.100 (Broadcast) tell 172.16.3.100, length 28 15:28:42.102620 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 172.16.3.100 (Broadcast) tell 172.16.3.100, length 28 15:28:42.102625 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 172.16.3.100 (Broadcast) tell 172.16.3.100, length 28 15:28:42.102636 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 172.16.3.100 (Broadcast) tell 172.16.3.100, length 28 15:28:42.974138 IP (tos 0x0, ttl 64, id 57410, offset 0, flags [DF], proto ICMP (1), length 84) 172.16.3.15 > 172.16.3.100: ICMP echo request, id 2057, seq 85, length 64 15:28:42.974268 IP (tos 0x0, ttl 64, id 41230, offset 0, flags [none], proto ICMP (1), length 84) 172.16.3.100 > 172.16.3.15: ICMP echo reply, id 2057, seq 85, length 64 15:28:43.974149 IP (tos 0x0, ttl 64, id 57411, offset 0, flags [DF], proto ICMP (1), length 84) 172.16.3.15 > 172.16.3.100: ICMP echo request, id 2057, seq 86, length 64
按照上面显示的信息可以明确的得出slave是已经抢占了VIP,虽然在master上面可以ip addr show可以看得到VIP,但是这个VIP对外已经不能提供服务了,无法对外提供通信。
那么可以获得结论就是master和slave俩者之间的keepalived无法进行通信,slave不能和master进行通信,所以才会抢占VIP,那么现在的问题就是在于如何得到俩者不能通信的原因了:
- check your firewall to ensure packets aren't being caught - check your networking to ensure em1 is the same network on both machines
在Google了一番之后,发现俩者不能通信可能导致的原因有俩个,一个就是因为防火墙阻塞了俩者的通信,另外就是绑定的网卡名错误。第二个原因可以排除,那么剩下的就只有第一个原因,检查了一番,发现防火墙真的打开的,只对外开放了22,80,3306端口;关掉俩者的防火墙之后,keepalived能够正常工作。
keepalived的通信是vrrp协议.并不是走22,80,3306端口。
20180514补充:
查看message日志,可以大量的获取得到如下所示的日志信息,并且是在实时生成的。造成这种显得原因是因为防火墙的原因导致的。
May 11 11:17:30 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Sending gratuitous ARPs on eth0 for 172.16.3.100 May 11 11:17:35 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Received lower prio advert, forcing new election May 11 11:17:35 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Sending gratuitous ARPs on eth0 for 172.16.3.100 May 11 11:17:40 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Received lower prio advert, forcing new election May 11 11:17:40 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Sending gratuitous ARPs on eth0 for 172.16.3.100 May 11 11:17:45 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Received lower prio advert, forcing new election May 11 11:17:45 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Sending gratuitous ARPs on eth0 for 172.16.3.100 May 11 11:17:50 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Received lower prio advert, forcing new election May 11 11:17:50 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Sending gratuitous ARPs on eth0 for 172.16.3.100 May 11 11:17:55 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Received lower prio advert, forcing new election May 11 11:17:55 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Sending gratuitous ARPs on eth0 for 172.16.3.100 May 11 11:18:00 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Received lower prio advert, forcing new election May 11 11:18:00 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Sending gratuitous ARPs on eth0 for 172.16.3.100 May 11 11:18:05 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Received lower prio advert, forcing new election May 11 11:18:05 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Sending gratuitous ARPs on eth0 for 172.16.3.100 May 11 11:18:10 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Received lower prio advert, forcing new election May 11 11:18:10 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Sending gratuitous ARPs on eth0 for 172.16.3.100 May 11 11:18:15 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Received lower prio advert, forcing new election May 11 11:18:15 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Sending gratuitous ARPs on eth0 for 172.16.3.100 May 11 11:18:20 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Received lower prio advert, forcing new election May 11 11:18:20 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Sending gratuitous ARPs on eth0 for 172.16.3.100 May 11 11:18:25 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Received lower prio advert, forcing new election May 11 11:18:25 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Sending gratuitous ARPs on eth0 for 172.16.3.100 May 11 11:18:30 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Received lower prio advert, forcing new election May 11 11:18:30 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Sending gratuitous ARPs on eth0 for 172.16.3.100 May 11 11:18:35 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Received lower prio advert, forcing new election May 11 11:18:35 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Sending gratuitous ARPs on eth0 for 172.16.3.100 May 11 11:18:40 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Received lower prio advert, forcing new election May 11 11:18:40 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Sending gratuitous ARPs on eth0 for 172.16.3.100 May 11 11:18:45 TiDB-node1 Keepalived_vrrp[6923]: VRRP_Instance(VI_82) Received lower prio advert, forcing new election