《童虎学习笔记》14分钟结合ProxySQL处理超半数MGR节点故障

本文章配套视频 https://www.ixigua.com/7086085500540289572?id=7088719800846778910
本专栏全部文章 https://blog.csdn.net/tonghu_note/category_11755726.html
总目录 https://blog.csdn.net/tonghu_note/article/details/124333034

来我的dou音 aa10246666, 看配套视频


一、实战环境

Primary 节点(node1) mysql 8.0.28 10.211.55.9
Secondary1 节点(node2) mysql 8.0.28 10.211.55.4
Secondary2 节点(node3) mysql 8.0.28 10.211.55.6
ProxySQL 节点(node4) 2.2.0 10.211.55.7

二、模拟超半数节点故障

我们模拟MGR集群中2个Secondary节点都因故障停掉了

1、先干掉一个节点,手动杀掉node2上的mysqld服务

找到mysqld进程号后,进行kill

root@node2:~# ps aux|grep mysqld
root     1155011  0.0  0.0   2064  1412 pts/0    S    10:52   0:00 /bin/sh /usr/local/mysql/bin/mysqld_safe --user=mysql
mysql    1155290  1.3 28.6 1919508 581212 pts/0  Sl   10:52   0:01 /usr/local/mysql/bin/mysqld --basedir=/usr/local/mysql --datadir=/usr/local/mysql/data --plugin-dir=/usr/local/mysql/lib/plugin --user=mysql --log-error=node2.err --pid-file=node2.pid
root     1155950  0.0  0.0   5908   648 pts/0    S+   10:54   0:00 grep mysqld
root@node2:~# kill -9 1155011 1155290

2、查看MGR集群状态

可以看到node2的状态由ONLINE变为UNREACHABLE后在集群中消失了

mysql> select member_host, member_state, member_role from performance_schema.replication_group_members;
+-------------------+-----------------------+------------------+
| member_host | member_state     | member_role |
+------------------+------------------------+------------------+
| node1            | ONLINE                | PRIMARY     |
| node2            | UNREACHABLE  | SECONDARY   |
| node3            | ONLINE                | SECONDARY   |

。。。等一会儿

mysql> select member_host, member_state, member_role from performance_schema.replication_group_members;
+-------------------+-------------------+-------------------+
| member_host | member_state | member_role |
+------------------+---------------------+------------------+
| node1            | ONLINE           | PRIMARY     |
| node3            | ONLINE           | SECONDARY   |

3、验证此时集群可以正常工作

连接ProxySQL节点的程序端口6033,看集群是否可以正常工作

root@node4:/var/lib/proxysql# mysql -uapp_user -papp_pwd -h 127.0.0.1 -P6033 --prompt='Admin> '

Admin> use d1
Admin> show tables;
+--------------+
| Tables_in_d1 |
+--------------+
| t2           |
+--------------+
1 row in set (0.01 sec)

Admin> insert into t2 select 1;
Query OK, 1 row affected (0.01 sec)
Records: 1  Duplicates: 0  Warnings: 0

Admin> select * from t2;
+----+
| id |
+----+
|  1 |
+----+
1 row in set (0.00 sec)

Admin> 

大家可以看到,MGR+ProxySQL是可以正常工作的,说明坏了一个节点不影响集群的工作

4、再干掉一个节点,手动杀掉node3上的mysqld服务

root@node3:~# ps aux|grep mysqld
root       69807  0.0  0.0   2064  1520 ?        S    09:49   0:00 /bin/sh /usr/local/mysql/bin/mysqld_safe --user=mysql
mysql      70098  0.9 29.9 1949132 607712 ?      Sl   09:49   0:26 /usr/local/mysql/bin/mysqld --basedir=/usr/local/mysql --datadir=/usr/local/mysql/data --plugin-dir=/usr/local/mysql/lib/plugin --user=mysql --log-error=node3.err --pid-file=node3.pid
root       70524  0.0  0.0   5908   644 pts/1    R+   10:35   0:00 grep mysqld
root@node3:~# kill -9 69807 70098

5、再查看MGR集群状态

可以看到node3的状态由ONLINE变为UNREACHABLE后,一直是这个状态没有其它变化

mysql> select member_host, member_state, member_role from performance_schema.replication_group_members;
+-------------+--------------+-------------+
| member_host | member_state | member_role |
+-------------+--------------+-------------+
| node1       | ONLINE       | PRIMARY     |
| node3       | UNREACHABLE  | SECONDARY   |
+-------------+--------------+-------------+
2 rows in set (0.00 sec)

。。。等一会儿

mysql> select member_host, member_state, member_role from performance_schema.replication_group_members;
+-------------+--------------+-------------+
| member_host | member_state | member_role |
+-------------+--------------+-------------+
| node1       | ONLINE       | PRIMARY     |
| node3       | UNREACHABLE  | SECONDARY   |
+-------------+--------------+-------------+
2 rows in set (0.00 sec)

6、验证此时集群可以正常工作

连接ProxySQL节点的程序端口6033,看集群是否可以正常工作

root@node4:~# mysql -uapp_user -papp_pwd -h 127.0.0.1 -P6033 --prompt='Admin> '

Admin> use d1
No connection. Trying to reconnect...
Connection id:    28
Current database: *** NONE ***

Database changed
Admin> select database();
ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 111
No connection. Trying to reconnect...
Connection id:    29
Current database: d1

ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 111
No connection. Trying to reconnect...
Connection id:    30
Current database: d1

ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 111
Admin> 

可以看到集群已经不可用了

连接ProxySQL节点的管理端口6032,看集群的成员节点状态是否正常

root@node4:~# mysql -uadmin -padmin -h 127.0.0.1 -P6032 --prompt='Admin> '


Admin> select hostgroup_id, hostname, status from runtime_mysql_servers;
+--------------+-------------+---------+
| hostgroup_id | hostname    | status  |
+--------------+-------------+---------+
| 3            | 10.211.55.6 | SHUNNED |
| 4            | 10.211.55.9 | ONLINE  |
| 4            | 10.211.55.4 | SHUNNED |

+--------------+-------------+---------+
3 rows in set (0.00 sec)

Admin> 

可以看到,其中有2个节点是离线状态SHUNNED,唯一一个ONLINE状态的节点已被移到了4号组(故障组),所以整个集群现在是不可用的状态


三、故障处理

1、在唯一的ONLINE节点node1上执行命令,使集群认为只有一个正常节点

set global group_replication_force_members='10.211.55.9:33061';

set global group_replication_force_members='';

2、再查看集群状态,可以看node3已经从集群中剔除了

mysql> select member_host, member_state, member_role from performance_schema.replication_group_members;
+-------------+--------------+-------------+
| member_host | member_state | member_role |
+-------------+--------------+-------------+
| node1       | ONLINE       | PRIMARY     |
+-------------+--------------+-------------+
1 row in set (0.00 sec)

mysql> 

  3、验证此时集群是否可以正常工作

连接ProxySQL节点的管理端口6032,看集群1号写组是正常ONLINE的,但3号读组是异常SHUNNED的

root@node4:~# mysql -uadmin -padmin -h 127.0.0.1 -P6032 --prompt='Admin> '

Admin> select hostgroup_id, hostname, status from runtime_mysql_servers;
+--------------+-------------+---------+
| hostgroup_id | hostname    | status  |
+--------------+-------------+---------+
| 1            | 10.211.55.9 | ONLINE  |
| 4            | 10.211.55.4 | SHUNNED |
| 3            | 10.211.55.6 | SHUNNED |
+--------------+-------------+---------+
3 rows in set (0.01 sec)

Admin> 

由上面可以看到node1已经从4号组移到了1号组,这个移到是ProxySQL自动移的,不是我们手工发起命令移的

再连接ProxySQL节点的程序端口6033,看集群是否可以正常工作

root@node4:~# mysql -uapp_user -papp_pwd -h 127.0.0.1 -P6033 --prompt='Admin> '

Admin> use d1
No connection. Trying to reconnect...
Connection id:    44
Current database: *** NONE ***

Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
Admin> insert into t2 select 2;
Query OK, 1 row affected (0.00 sec)
Records: 1  Duplicates: 0  Warnings: 0

Admin> select * from t2;
ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 111
No connection. Trying to reconnect...
Connection id:    45
Current database: d1

ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 111
No connection. Trying to reconnect...
Connection id:    46
Current database: d1

ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 111
Admin> 

大家可以看到,插入语句可以执行成功,但其它语句都无法执行。

这是因为插入语句走的是1号写组,而1号写组里的node1节点是ONLINE,是可以提供服务的;而其它语句执行不可成功,是因为其它的语句走的都是3组,而3组中没有正常的节点;所以说现在的集群是只恢复了一个写节点的集群,不是正常的状态。

4、将node2恢复后加入到集群,并查看集群状态

root@node2:~# mysqld_safe --user=mysql &

root@node2:~# mysql -uroot -proot

mysql> start group_replication;
Query OK, 0 rows affected (19.59 sec)

mysql> select member_host, member_state, member_role from performance_schema.replication_group_members;
+-------------+--------------+-------------+
| member_host | member_state | member_role |
+-------------+--------------+-------------+
| node1       | ONLINE       | PRIMARY     |
| node2       | ONLINE       | SECONDARY   |
+-------------+--------------+-------------+
2 rows in set (0.01 sec)

mysql> 

5、验证此时集群是否可以正常工作

连接ProxySQL节点的管理端口6032,看集群1号写组是正常ONLINE的,3号读组中1台是ONLINE,1台异常SHUNNED的

root@node4:~# mysql -uadmin -padmin -h 127.0.0.1 -P6032 --prompt='Admin> '

Admin> select hostgroup_id, hostname, status from runtime_mysql_servers;
+--------------+-------------+---------+
| hostgroup_id | hostname    | status  |
+--------------+-------------+---------+
| 1            | 10.211.55.9 | ONLINE  |
| 3            | 10.211.55.4 | ONLINE  |

| 3            | 10.211.55.6 | SHUNNED |
+--------------+-------------+---------+
3 rows in set (0.01 sec)

Admin> 

由上面可以看到node2已经移到了3号组,这个移到是ProxySQL自动移的,不是我们手工发起命令移的

再连接ProxySQL节点的程序端口6033,看集群是否可以正常工作

root@node4:~# mysql -uapp_user -papp_pwd -h 127.0.0.1 -P6033 --prompt='Admin> '

Admin> use d1

Admin> show tables;
+--------------+
| Tables_in_d1 |
+--------------+
| t2           |
+--------------+
1 row in set (0.01 sec)

Admin> select * from t2;
+----+
| id |
+----+
|  1 |
|  2 |
+----+
2 rows in set (0.00 sec)

Admin> insert into t2 select 3;
Query OK, 1 row affected (0.01 sec)
Records: 1  Duplicates: 0  Warnings: 0

Admin> 

上图可以看到,插入和查询都正常了,说明集群恢复了,之后再把node3也加进MGR中来,集群就完全恢复之前的正常状态了。


四、总结

当出现类拟的故障时,如下的命令是非常重要的,大家一定要记住

set global group_replication_force_members='10.211.55.9:33061';

set global group_replication_force_members='';

你可能感兴趣的:(MGR,mysql,sql,分布式)