问题背景:
单主MGR集群,主节点在服务器意外重启之后,无法重新加入现有MGR集群,报错误[MY-011526] [Repl] Plugin group_replication reported: 'This member has more executed transactions than those present in the group. Local transactions:
问题分析:
原本MGR集群为3节点的单主集群,主节点为192.168.1.100
select * from replication_group_members;
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 355ae888-2eba-11ee-baeb-000c2986fd67 | 192.168.1.201 | 3306 | ONLINE | SECONDARY | 8.0.22 |
| group_replication_applier | 66d57e9a-2eb9-11ee-b56a-000c290bcc56 | 192.168.1.100 | 3306 | ONLINE | PRIMARY | 8.0.22 |
| group_replication_applier | 6c695395-2eb9-11ee-8cbc-000c29186152 | 192.168.1.200 | 3306 | ONLINE | SECONDARY | 8.0.22 |
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+
MGR集群在服务器意外重启之后,主节点切换为192.168.1.201
select * from replication_group_members;
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 355ae888-2eba-11ee-baeb-000c2986fd67 | 192.168.1.201 | 3306 | ONLINE | PRIMARY | 8.0.22 |
| group_replication_applier | 6c695395-2eb9-11ee-8cbc-000c29186152 | 192.168.1.200 | 3306 | ONLINE | SECONDARY | 8.0.22 |
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+
查看报错的信息,可以看到节点无法加入的原因为本地的GTID大于MGR集群的GTID,本地实例的server_uuid为:66d57e9a-2eb9-11ee-b56a-000c290bcc56,MGR集群的group_replication_group_name为:1db95f8f-5ba5-11ea-ace5-000c29a9919b
所以这里存在问题的GTID为节点自己server_uuid的GTID:66d57e9a-2eb9-11ee-b56a-000c290bcc56:1-12比MGR集群GTID:66d57e9a-2eb9-11ee-b56a-000c290bcc56:1-11大
[MY-011526] [Repl] Plugin group_replication reported: 'This member has more executed transactions than those present in the group.
Local transactions: 1db95f8f-5ba5-11ea-ace5-000c29a9919b:1-28,
38c1dc4f-2ebb-11ee-8e75-000c290bcc56:1-5,
66d57e9a-2eb9-11ee-b56a-000c290bcc56:1-12
>
Group transactions: 1db95f8f-5ba5-11ea-ace5-000c29a9919b:1-29,
38c1dc4f-2ebb-11ee-8e75-000c290bcc56:1-5,
66d57e9a-2eb9-11ee-b56a-000c290bcc56:1-11'
正常MGR集群的从节点会被设置为super_read_only超级自读,主节点的GTID只更新group_replication_group_name的GTID:1db95f8f-5ba5-11ea-ace5-000c29a9919b,节点自己的server_uuid不会更新,除非当前节点已经被剔除或者不是主节点并且人为关闭只读去执行了操作,才会导致节点自己的server_uuid发生更新
通过查看当前的进程列表发现存在应用的连接会话,与应用进一步沟通确认,由于应用的程序无法识别MGR集群的主节点切换也没有通过中间件代理类似MySQL Router去访问MGR集群,导致MGR集群发生了主节点切换,应用依然还是访问旧的主节点,旧节点被写入数据,导致旧节点的本地GTID发生了更新
[email protected] 18:36: [(none)]>show processlist;
+----+-----------------+---------------------+------+---------+------+------------------------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+-----------------+---------------------+------+---------+------+------------------------+------------------+
| 6 | event_scheduler | localhost | NULL | Daemon | 2428 | Waiting on empty queue | NULL |
| 46 | root | localhost | NULL | Query | 0 | init | show processlist |
| 83 | app_user | 192.168.2.10:36068 | NULL | Sleep | 3 | | NULL |
+----+-----------------+---------------------+------+---------+------+------------------------+------------------+
问题修复:
1 跟应用一起确认以哪个节点为主节点,然后以该节点为主节点重建MGR集群,由于切换后应用写入的还是旧主节点192.168.1.100,所以以192.168.1.100为主节点
2 获取MGR集群与旧主节点的差异数据,如果有差异数据,需要将数据交给应用去分析是否回插入旧主节点192.168.1.100
--根据旧主节点的gtid进行过滤,获取差异数据
mysqlbinlog -vv --base64-output=decode-rows --executed-gtids='1db95f8f-5ba5-11ea-ace5-000c29a9919b:1-28,
38c1dc4f-2ebb-11ee-8e75-000c290bcc56:1-5,
66d57e9a-2eb9-11ee-b56a-000c290bcc56:1-12' /opt/mysql/log/binlog.000001 > /tmp/binlog_1_gtid.sql;
3 对每个节点数据进行备份,然后以节点192.168.1.100为主节点开始重建MGR集群
mysqldump -uroot -p -S /tmp/mysql.sock --master-data=2 --single-transaction -A --events --triggers --routines --default-character-set=utf8 > mysqlall.sql
4 全部节点停止MGR组复制
stop group_replication;
5 在主节点192.168.1.100引导初始化MGR组,作为主节点
set global group_replication_bootstrap_group=on;
start group_replication;
[email protected] 21:07: [test2]>select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 66d57e9a-2eb9-11ee-b56a-000c290bcc56 | 192.168.1.100 | 3306 | ONLINE | PRIMARY | 8.0.22 |
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+
1 row in set (0.00 sec)
[email protected] 21:07: [test2]>
重建方式一:通过修改从节点gtid的方式直接重建MGR集群,主从节点可能存在数据差异
修改从节点192.168.1.200/201的GTID,重新加入集群
--手动设置从节点的gtid,确保从节点的gtid再新MGR集群的范围以内
---group_replication_group_name的GTID设置为主节点的初始gtid值
---主节点server_uuid保持原值
reset master;
set global gtid_purged='1db95f8f-5ba5-11ea-ace5-000c29a9919b:1-28,38c1dc4f-2ebb-11ee-8e75-000c290bcc56:1-5,66d57e9a-2eb9-11ee-b56a-000c290bcc56:1-11';
--加入MGR集群
start group_replication;
MGR集群重建完成
[email protected] 21:20: [test2]>select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 355ae888-2eba-11ee-baeb-000c2986fd67 | 192.168.1.201 | 3306 | ONLINE | SECONDARY | 8.0.22 |
| group_replication_applier | 66d57e9a-2eb9-11ee-b56a-000c290bcc56 | 192.168.1.100 | 3306 | ONLINE | PRIMARY | 8.0.22 |
| group_replication_applier | 6c695395-2eb9-11ee-8cbc-000c29186152 | 192.168.1.200 | 3306 | ONLINE | SECONDARY | 8.0.22 |
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+
3 rows in set (0.00 sec)
重建方式二:通过克隆方式重建两个从节点,主从节点数据完成一致,需要数据库版本8.0.17+
主从节点检查克隆插件
--确认安装了克隆插件
[email protected] 21:21: [test2]>SELECT PLUGIN_NAME, PLUGIN_STATUS FROM INFORMATION_SCHEMA.PLUGINS WHERE PLUGIN_NAME = 'clone';
+-------------+---------------+
| PLUGIN_NAME | PLUGIN_STATUS |
+-------------+---------------+
| clone | ACTIVE |
+-------------+---------------+
--安装克隆插件
INSTALL PLUGIN clone SONAME 'mysql_clone.so';
从节点依次执行克隆操作
--配置克隆源端,为主节点
SET GLOBAL clone_valid_donor_list = '192.168.1.100:3306';
--执行克隆
CLONE INSTANCE FROM 'rpl_user'@'192.168.1.100':3306 IDENTIFIED BY '123456';
--查看克隆进度,会先删除数据,再进行拷贝
[email protected] 21:38: [(none)]>SELECT STAGE, STATE, END_TIME FROM performance_schema.clone_progress;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id: 29
Current database: *** NONE ***
+-----------+-----------+----------------------------+
| STAGE | STATE | END_TIME |
+-----------+-----------+----------------------------+
| DROP DATA | Completed | 2023-08-05 21:36:20.706373 |
| FILE COPY | Completed | 2023-08-05 21:37:56.385411 |
| PAGE COPY | Completed | 2023-08-05 21:37:56.798611 |
| REDO COPY | Completed | 2023-08-05 21:37:57.106462 |
| FILE SYNC | Completed | 2023-08-05 21:38:16.364659 |
| RESTART | Completed | 2023-08-05 21:38:20.678312 |
| RECOVERY | Completed | 2023-08-05 21:38:22.450064 |
+-----------+-----------+----------------------------+
7 rows in set (0.01 sec)
[email protected] 21:39: [(none)]>
--克隆完成之后,从节点会自己加入集群
[email protected] 21:40: [(none)]>select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 66d57e9a-2eb9-11ee-b56a-000c290bcc56 | 192.168.1.100 | 3306 | ONLINE | PRIMARY | 8.0.22 |
| group_replication_applier | 6c695395-2eb9-11ee-8cbc-000c29186152 | 192.168.1.200 | 3306 | ONLINE | SECONDARY | 8.0.22 |
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+
2 rows in set (0.00 sec)
[email protected] 21:40: [(none)]>
MGR集群重建完成
[email protected] 21:43: [(none)]>select * from performance_schema.replication_group_members;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id: 29
Current database: *** NONE ***
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 355ae888-2eba-11ee-baeb-000c2986fd67 | 192.168.1.201 | 3306 | ONLINE | SECONDARY | 8.0.22 |
| group_replication_applier | 66d57e9a-2eb9-11ee-b56a-000c290bcc56 | 192.168.1.100 | 3306 | ONLINE | PRIMARY | 8.0.22 |
| group_replication_applier | 6c695395-2eb9-11ee-8cbc-000c29186152 | 192.168.1.200 | 3306 | ONLINE | SECONDARY | 8.0.22 |
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+
3 rows in set (0.01 sec)
[email protected] 21:48: [(none)]>
重建方式三:通过手动备份的方式mysqldump,xtrabackup重建
备份主节点的数据
mysqldump -uroot -p -S /tmp/mysql.sock --master-data=2 --single-transaction -A --events --triggers --routines --default-character-set=utf8 > mysqlall.sql
恢复两个从节点的数据
--清空恢复从节点的gtid信息以及binlog
reset master
--使用主节点的备份恢复数据
source mysqlall.sql
将从节点加入MGR集群,完成重建MGR
start group_replication;
重建MGR完成之后,关闭主节点的引导参数
set global group_replication_bootstrap_group=off;