本文章配套视频 | https://www.ixigua.com/7086085500540289572?id=7088719800846778910 |
本专栏全部文章 | https://blog.csdn.net/tonghu_note/category_11755726.html |
总目录 | https://blog.csdn.net/tonghu_note/article/details/124333034 |
来我的dou音 aa10246666, 看配套视频
Primary 节点(node1) | mysql 8.0.28 | 10.211.55.9 |
Secondary1 节点(node2) | mysql 8.0.28 | 10.211.55.4 |
Secondary2 节点(node3) | mysql 8.0.28 | 10.211.55.6 |
ProxySQL 节点(node4) | 2.2.0 | 10.211.55.7 |
我们模拟MGR集群中2个Secondary节点都因故障停掉了
找到mysqld进程号后,进行kill
root@node2:~# ps aux|grep mysqld
root 1155011 0.0 0.0 2064 1412 pts/0 S 10:52 0:00 /bin/sh /usr/local/mysql/bin/mysqld_safe --user=mysql
mysql 1155290 1.3 28.6 1919508 581212 pts/0 Sl 10:52 0:01 /usr/local/mysql/bin/mysqld --basedir=/usr/local/mysql --datadir=/usr/local/mysql/data --plugin-dir=/usr/local/mysql/lib/plugin --user=mysql --log-error=node2.err --pid-file=node2.pid
root 1155950 0.0 0.0 5908 648 pts/0 S+ 10:54 0:00 grep mysqld
root@node2:~# kill -9 1155011 1155290
可以看到node2的状态由ONLINE变为UNREACHABLE后在集群中消失了
mysql> select member_host, member_state, member_role from performance_schema.replication_group_members;
+-------------------+-----------------------+------------------+
| member_host | member_state | member_role |
+------------------+------------------------+------------------+
| node1 | ONLINE | PRIMARY |
| node2 | UNREACHABLE | SECONDARY |
| node3 | ONLINE | SECONDARY |。。。等一会儿
mysql> select member_host, member_state, member_role from performance_schema.replication_group_members;
+-------------------+-------------------+-------------------+
| member_host | member_state | member_role |
+------------------+---------------------+------------------+
| node1 | ONLINE | PRIMARY |
| node3 | ONLINE | SECONDARY |
连接ProxySQL节点的程序端口6033,看集群是否可以正常工作
root@node4:/var/lib/proxysql# mysql -uapp_user -papp_pwd -h 127.0.0.1 -P6033 --prompt='Admin> '
Admin> use d1
Admin> show tables;
+--------------+
| Tables_in_d1 |
+--------------+
| t2 |
+--------------+
1 row in set (0.01 sec)Admin> insert into t2 select 1;
Query OK, 1 row affected (0.01 sec)
Records: 1 Duplicates: 0 Warnings: 0Admin> select * from t2;
+----+
| id |
+----+
| 1 |
+----+
1 row in set (0.00 sec)Admin>
大家可以看到,MGR+ProxySQL是可以正常工作的,说明坏了一个节点不影响集群的工作
root@node3:~# ps aux|grep mysqld
root 69807 0.0 0.0 2064 1520 ? S 09:49 0:00 /bin/sh /usr/local/mysql/bin/mysqld_safe --user=mysql
mysql 70098 0.9 29.9 1949132 607712 ? Sl 09:49 0:26 /usr/local/mysql/bin/mysqld --basedir=/usr/local/mysql --datadir=/usr/local/mysql/data --plugin-dir=/usr/local/mysql/lib/plugin --user=mysql --log-error=node3.err --pid-file=node3.pid
root 70524 0.0 0.0 5908 644 pts/1 R+ 10:35 0:00 grep mysqld
root@node3:~# kill -9 69807 70098
可以看到node3的状态由ONLINE变为UNREACHABLE后,一直是这个状态没有其它变化
mysql> select member_host, member_state, member_role from performance_schema.replication_group_members;
+-------------+--------------+-------------+
| member_host | member_state | member_role |
+-------------+--------------+-------------+
| node1 | ONLINE | PRIMARY |
| node3 | UNREACHABLE | SECONDARY |
+-------------+--------------+-------------+
2 rows in set (0.00 sec)。。。等一会儿
mysql> select member_host, member_state, member_role from performance_schema.replication_group_members;
+-------------+--------------+-------------+
| member_host | member_state | member_role |
+-------------+--------------+-------------+
| node1 | ONLINE | PRIMARY |
| node3 | UNREACHABLE | SECONDARY |
+-------------+--------------+-------------+
2 rows in set (0.00 sec)
连接ProxySQL节点的程序端口6033,看集群是否可以正常工作
root@node4:~# mysql -uapp_user -papp_pwd -h 127.0.0.1 -P6033 --prompt='Admin> '
Admin> use d1
No connection. Trying to reconnect...
Connection id: 28
Current database: *** NONE ***Database changed
Admin> select database();
ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 111
No connection. Trying to reconnect...
Connection id: 29
Current database: d1ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 111
No connection. Trying to reconnect...
Connection id: 30
Current database: d1ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 111
Admin>
可以看到集群已经不可用了
连接ProxySQL节点的管理端口6032,看集群的成员节点状态是否正常
root@node4:~# mysql -uadmin -padmin -h 127.0.0.1 -P6032 --prompt='Admin> '
Admin> select hostgroup_id, hostname, status from runtime_mysql_servers;
+--------------+-------------+---------+
| hostgroup_id | hostname | status |
+--------------+-------------+---------+
| 3 | 10.211.55.6 | SHUNNED |
| 4 | 10.211.55.9 | ONLINE |
| 4 | 10.211.55.4 | SHUNNED |
+--------------+-------------+---------+
3 rows in set (0.00 sec)Admin>
可以看到,其中有2个节点是离线状态SHUNNED,唯一一个ONLINE状态的节点已被移到了4号组(故障组),所以整个集群现在是不可用的状态
set global group_replication_force_members='10.211.55.9:33061';
set global group_replication_force_members='';
mysql> select member_host, member_state, member_role from performance_schema.replication_group_members;
+-------------+--------------+-------------+
| member_host | member_state | member_role |
+-------------+--------------+-------------+
| node1 | ONLINE | PRIMARY |
+-------------+--------------+-------------+
1 row in set (0.00 sec)mysql>
连接ProxySQL节点的管理端口6032,看集群1号写组是正常ONLINE的,但3号读组是异常SHUNNED的
root@node4:~# mysql -uadmin -padmin -h 127.0.0.1 -P6032 --prompt='Admin> '
Admin> select hostgroup_id, hostname, status from runtime_mysql_servers;
+--------------+-------------+---------+
| hostgroup_id | hostname | status |
+--------------+-------------+---------+
| 1 | 10.211.55.9 | ONLINE |
| 4 | 10.211.55.4 | SHUNNED |
| 3 | 10.211.55.6 | SHUNNED |
+--------------+-------------+---------+
3 rows in set (0.01 sec)Admin>
由上面可以看到node1已经从4号组移到了1号组,这个移到是ProxySQL自动移的,不是我们手工发起命令移的
再连接ProxySQL节点的程序端口6033,看集群是否可以正常工作
root@node4:~# mysql -uapp_user -papp_pwd -h 127.0.0.1 -P6033 --prompt='Admin> '
Admin> use d1
No connection. Trying to reconnect...
Connection id: 44
Current database: *** NONE ***Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -ADatabase changed
Admin> insert into t2 select 2;
Query OK, 1 row affected (0.00 sec)
Records: 1 Duplicates: 0 Warnings: 0Admin> select * from t2;
ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 111
No connection. Trying to reconnect...
Connection id: 45
Current database: d1ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 111
No connection. Trying to reconnect...
Connection id: 46
Current database: d1ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 111
Admin>
大家可以看到,插入语句可以执行成功,但其它语句都无法执行。
这是因为插入语句走的是1号写组,而1号写组里的node1节点是ONLINE,是可以提供服务的;而其它语句执行不可成功,是因为其它的语句走的都是3组,而3组中没有正常的节点;所以说现在的集群是只恢复了一个写节点的集群,不是正常的状态。
root@node2:~# mysqld_safe --user=mysql &
root@node2:~# mysql -uroot -proot
mysql> start group_replication;
Query OK, 0 rows affected (19.59 sec)mysql> select member_host, member_state, member_role from performance_schema.replication_group_members;
+-------------+--------------+-------------+
| member_host | member_state | member_role |
+-------------+--------------+-------------+
| node1 | ONLINE | PRIMARY |
| node2 | ONLINE | SECONDARY |
+-------------+--------------+-------------+
2 rows in set (0.01 sec)mysql>
连接ProxySQL节点的管理端口6032,看集群1号写组是正常ONLINE的,3号读组中1台是ONLINE,1台异常SHUNNED的
root@node4:~# mysql -uadmin -padmin -h 127.0.0.1 -P6032 --prompt='Admin> '
Admin> select hostgroup_id, hostname, status from runtime_mysql_servers;
+--------------+-------------+---------+
| hostgroup_id | hostname | status |
+--------------+-------------+---------+
| 1 | 10.211.55.9 | ONLINE |
| 3 | 10.211.55.4 | ONLINE |
| 3 | 10.211.55.6 | SHUNNED |
+--------------+-------------+---------+
3 rows in set (0.01 sec)Admin>
由上面可以看到node2已经移到了3号组,这个移到是ProxySQL自动移的,不是我们手工发起命令移的
再连接ProxySQL节点的程序端口6033,看集群是否可以正常工作
root@node4:~# mysql -uapp_user -papp_pwd -h 127.0.0.1 -P6033 --prompt='Admin> '
Admin> use d1
Admin> show tables;
+--------------+
| Tables_in_d1 |
+--------------+
| t2 |
+--------------+
1 row in set (0.01 sec)Admin> select * from t2;
+----+
| id |
+----+
| 1 |
| 2 |
+----+
2 rows in set (0.00 sec)Admin> insert into t2 select 3;
Query OK, 1 row affected (0.01 sec)
Records: 1 Duplicates: 0 Warnings: 0Admin>
上图可以看到,插入和查询都正常了,说明集群恢复了,之后再把node3也加进MGR中来,集群就完全恢复之前的正常状态了。
当出现类拟的故障时,如下的命令是非常重要的,大家一定要记住
set global group_replication_force_members='10.211.55.9:33061';
set global group_replication_force_members='';