目录
一、部署单主模式组复制
1. 安装MGR插件
2. 准备配置文件
3. 重启主库实例
4. 启动组复制
5. 向组中添加实例
二、组复制监控
三、容错示例
1. 一个SECONDARY实例正常shutdown
2. 一个SECONDARY实例异常shutdown
3. PRIMARY实例正常shutdown
4. PRIMARY实例异常shutdown
MGR作为MySQL服务器的插件提供,组中的每个服务器都需要配置和安装插件。本文说明配置具有三个服务器的组复制的详细步骤,三个独立的MySQL实例已经安装好。拓扑结构如图1所示。
MySQL版本为8.0.16,各服务器对应的IP地址和主机名如下。所有服务器都要先配置好不同的主机名,并修改/etc/hosts文件配置域名解析。
单主模式中,规划hdp2为PRIMARY,hdp3、hdp4为SECONDARY。
在hdp2的MySQL实例中安装MGR插件。
mysql> install plugin group_replication soname 'group_replication.so';
Query OK, 0 rows affected (0.01 sec)
检查MGR插件是否安装。
mysql> show plugins;
+---------------------------------+----------+--------------------+----------------------+---------+
| Name | Status | Type | Library | License |
+---------------------------------+----------+--------------------+----------------------+---------+
| binlog | ACTIVE | STORAGE ENGINE | NULL | GPL |
(...)
| group_replication | ACTIVE | GROUP REPLICATION | group_replication.so | GPL |
+---------------------------------+----------+--------------------+----------------------+---------+
46 rows in set (0.00 sec)
hdp2的配置文件/etc/my.cnf内容如下:
[mysqld]
server_id=1125
gtid_mode=ON
enforce-gtid-consistency=true
binlog_checksum=NONE
innodb_buffer_pool_size=4G
disabled_storage_engines="MyISAM,BLACKHOLE,FEDERATED,ARCHIVE,MEMORY"
log_bin=binlog
log_slave_updates=ON
binlog_format=ROW
master_info_repository=TABLE
relay_log_info_repository=TABLE
transaction_write_set_extraction=XXHASH64
group_replication_group_name="aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"
group_replication_start_on_boot=off
group_replication_local_address= "172.16.1.125:33061"
group_replication_group_seeds= "172.16.1.125:33061,172.16.1.126:33061,172.16.1.127:33061"
group_replication_bootstrap_group=off
主要参数说明:
hdp3、hdp4的配置文件中只有server_id和group_replication_local_address两个参数值不同,其它配置参数与hdp2相同:
# hdp3:
server_id=1126
group_replication_local_address= "172.16.1.126:33061"
# hdp4:
server_id=1127
group_replication_local_address= "172.16.1.127:33061"
执行下面的命令重启hdp2的MySQL实例。
mysqladmin -uroot -p123456 shutdown
mysqld_safe --defaults-file=/etc/my.cnf &
在hdp2上执行以下步骤启动组复制。组复制使用异步复制协议实现分布式恢复,在将组成员加入组之前同步数据。分布式恢复过程依赖于名为group_replication_recovery的复制通道,该通道用于将事务从捐赠者转移到加入该组的成员。因此需要设置具有正确权限的复制用户,以便组复制可以建立直接的成员到成员恢复复制通道。
(1)创建复制用户
create user 'repl'@'%' identified with 'mysql_native_password' by '123456';
grant replication slave on *.* to 'repl'@'%';
(2)配置用于新成员与捐赠者之间异步复制的复制通道
change master to master_user='repl', master_password='123456' for channel 'group_replication_recovery';
(3)启动组复制
要启动该组,需指示服务器S1引导该组,然后启动组复制。此引导程序应仅由单个服务器完成,该服务器启动组并且只执行一次。
set global group_replication_bootstrap_group=on;
start group_replication;
set global group_replication_bootstrap_group=off;
(4)确认组复制是否启动成功
一旦START GROUP_REPLICATION语句返回,该组就已启动。可以检查该组现在是否已创建,并且其中包含一个成员:
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 8eed0f5b-6f9b-11e9-94a9-005056a57a4e | hdp2 | 3306 | ONLINE | PRIMARY | 8.0.16 |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
1 row in set (0.00 sec)
此表中的信息确认组中的成员具有唯一标识符8eed0f5b-6f9b-11e9-94a9-005056a57a4e,它是ONLINE并且在hdp2上侦听端口3306上的客户端连接。
为了证明服务器确实在一个组中并且它能够处理负载,创建一个表并向其添加一些内容。
create database test;
use test;
create table t1(a bigint auto_increment primary key);
组复制环境下要求每个表都需要有主键,否则表上的DML会报错:
ERROR 3098 (HY000): The table does not comply with the requirements by an external plugin.
创建一个能长时间执行的存储过程。
delimiter //
create procedure p1(a int)
begin
declare i int default 1;
while i<=a do
insert into t1 select null;
set i=i+1;
end while;
end;
//
delimiter ;
-- 模拟联机事务
call p1(100000);
在上一步存储过程运行期间执行下面的步骤,联机向组中添加实例。
(1)从hdp2向hdp3、hdp4联机复制数据
在hdp2上执行以下命令,可以开两个终端并行执行。关于xtrabackup的使用细节,参见https://wxy0327.blog.csdn.net/article/details/90081518#3.%20%E8%81%94%E6%9C%BA。
# 复制到hdp3
xtrabackup -uroot -p123456 --socket=/tmp/mysql.sock --no-lock --backup --compress --stream=xbstream --parallel=4 --target-dir=./ | ssh [email protected] "xbstream -x -C /usr/local/mysql/data/ --decompress"
# 复制到hdp4
xtrabackup -uroot -p123456 --socket=/tmp/mysql.sock --no-lock --backup --compress --stream=xbstream --parallel=4 --target-dir=./ | ssh [email protected] "xbstream -x -C /usr/local/mysql/data/ --decompress"
(2)在hdp3、hdp4上应用日志
分别在hdp3、hdp4上执行以下命令:
xtrabackup --prepare --target-dir=/usr/local/mysql/data/
(3)启动hdp3、hdp4的MySQL实例
分别在hdp3、hdp4上执行以下命令:
mysqld_safe --defaults-file=/etc/my.cnf &
(4)将hdp3、hdp4加入到组中
分别在hdp3、hdp4上执行以下SQL命令:
-- 重置relay log info
reset slave all;
-- 设置复制通道
change master to master_user='repl', master_password='123456' for channel 'group_replication_recovery';
-- 添加到组
start group_replication;
此命令返回后查看performance_schema.replication_group_members表,MEMBER_STATE开始时的值为RECOVERING,表示新增服务器正在追赶主库。当赶上主库时,MEMBER_STATE值改为ONLINE,最终显示该组中有三个ONLINE状态的服务器。注意,组复制中每个成员执行的事务不是同步的,但最终同步。更确切地说,事务以相同的顺序传递给所有组成员,但是它们的执行不同步,这意味着在接受提交事务之后,每个成员按照自己的进度提交。
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 5c93a708-a393-11e9-8343-005056a5497f | hdp4 | 3306 | ONLINE | SECONDARY | 8.0.16 |
| group_replication_applier | 5f045152-a393-11e9-8020-005056a50f77 | hdp3 | 3306 | ONLINE | SECONDARY | 8.0.16 |
| group_replication_applier | 8eed0f5b-6f9b-11e9-94a9-005056a57a4e | hdp2 | 3306 | ONLINE | PRIMARY | 8.0.16 |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
3 rows in set (0.00 sec)
取值 |
含义 |
状态是否在组内同步 |
ONLINE |
表示该成员可正常提供服务 |
YES |
RECOVERING |
表示当前成员正在从其它节点恢复数据 |
YES |
OFFLINE |
表示组复制插件已经加载,但是该成员不属于任何一个复制组 |
NO |
ERROR |
表示成员在recovery阶段出现错误或者从其它节点同步状态中出现错误 |
NO |
UNREACHABLE |
成员处于不可达状态,无法与之进行网络通讯 |
NO |
从上表可以知道,只有ONLINE和RECOVERING两种状态会在集群中得到同步。这个状态同步是指状态在所有成员上查询均能保持一致。对于OFFLINE、ERROR和UNREABLE:
成员状态转移如图2所示。
当一个成员加进一个复制组,其状态首先变成RECOVERING,表示当前成员正处于集群恢复阶段。这个阶段下,成员会选择集群中一个成员作为捐赠者(donor),利用传统的异步复制做数据恢复。当数据能够成功追平,成员的状态将会变成ONLINE,这个过程中通过其他成员也可以看到该节点的状态,不管是RECOVERING还是最后的ONLINE。
假如该成员在RECOVERING阶段出现了异常,如选择donor进行复制失败或者在追赶donor数据的过程中失败,那么该成员的状态将会变成ERROR。注意,这时候在其它成员上查询时,发现该RECOVERING节点已经从组里面被移除。
另外,如果一个ONLINE成员失去与其它成员的通讯(可能因为成员宕机或者网络异常),则该成员在其他成员上面查询到的状态将会是UNREACHABLE。如果这个UNREACHABLE成员在规定的超时时间内没有恢复,那么成员将会被移除。这个规定的超时时间,取决于集群失去这个成员后还能不能达到可用状态。如果失去这个成员集群仍然可用,那么这个UNREACHABLE的超时时间很短,几乎看不到这个状态。但是,如果失去这个成员后集群马上不可用,那么这个成员将会一直处于UNREACHABLE状态。
以一个例子来验证。kill(注意是kill实例而不是正常shutdown实例)hdp4的MySQL实例。
ps -ef | grep mysqld | grep -v grep | awk '{print $2}' | xargs kill -9
通过其它可用成员查询到,那一kill掉的实例从集群中被移除了:
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 5f045152-a393-11e9-8020-005056a50f77 | hdp3 | 3306 | ONLINE | SECONDARY | 8.0.16 |
| group_replication_applier | 8eed0f5b-6f9b-11e9-94a9-005056a57a4e | hdp2 | 3306 | ONLINE | PRIMARY | 8.0.16 |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
2 rows in set (0.00 sec)
接下来再kill掉hdp3的MySQL实例。再次查询replication_group_members:
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 5f045152-a393-11e9-8020-005056a50f77 | hdp3 | 3306 | UNREACHABLE | SECONDARY | 8.0.16 |
| group_replication_applier | 8eed0f5b-6f9b-11e9-94a9-005056a57a4e | hdp2 | 3306 | ONLINE | PRIMARY | 8.0.16 |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
2 rows in set (0.00 sec)
这个时候,UNREACHABLE状态将一直持续。而且此时,集群不满足2N + 1,集群已经不可用(即使有主成员,它也是不可写的)。恢复组复制步骤:
(1)在hdp2重新创建一个新的复制组
stop group_replication;
set global group_replication_bootstrap_group=on;
start group_replication;
set global group_replication_bootstrap_group=off;
(2)启动hdp3、hdp4的MySQL实例
mysqld_safe --defaults-file=/etc/my.cnf &
(3)将hdp3、hdp4重新加入新复制组
change master to master_user='repl', master_password='123456' for channel 'group_replication_recovery';
start group_replication;
此时查看二进制日志中的视图更改事件,可以看到创建过两个复制组,每添加一个成员,view_id的序号加1。
[mysql@hdp2/usr/local/mysql/data]$mysqlbinlog binlog.000003 | grep view_id
#190711 12:19:17 server id 1125 end_log_pos 3406788 View_change_log_event: view_id=15628187590651916:2
#190711 12:19:17 server id 1125 end_log_pos 6037781 View_change_log_event: view_id=15628187590651916:3
#190711 14:59:01 server id 1125 end_log_pos 21750924 View_change_log_event: view_id=15628283431266410:1
#190711 14:59:01 server id 1125 end_log_pos 21751315 View_change_log_event: view_id=15628283431266410:2
#190711 14:59:01 server id 1125 end_log_pos 21751706 View_change_log_event: view_id=15628283431266410:3
与监控传统主从复制的show slave status不同,组复制监控主要依赖以下几个performance_schema表:
performance_schema.replication_group_members表用于监视作为组成员的不同服务器实例的状态。只要视图更改,就会更新表中的信息。例如,因新成员加入而动态更改组的配置时。此时,服务器交换一些元数据以使其自身同步并继续一起协作。信息在作为复制组成员的所有服务器实例之间共享,因此可以从任何成员查询有关所有组成员的信息。此表可用作获取复制组状态的高级视图。
performance_schema.replication_group_member_stats表提供与认证过程相关的组级信息,以及由复制组的每个成员接收和发起的事务的统计信息。信息在作为复制组成员的所有服务器实例之间共享,因此可以从任何成员查询有关所有组成员的信息。刷新远程成员的统计信息由group_replication_flow_control_period配置参数中指定的消息周期控制(缺省值为1秒),因此可能与本地收集的查询成员的统计信息略有不同。
该表字段对于监视组中连接成员的性能很重要。例如,假设组中的一个成员总是在其队列中报告与其它成员相比存在大量事务。这意味着该成员存在延迟,并且无法与该组的其它成员保持同步。根据此信息,可能决定从组中删除该成员,或者延迟处理该组其它成员上的事务,以减少排队事务的数量。此信息还可以帮助决定如何调整组复制插件的流控制。
performance_schema.replication_connection_status显示有关组复制的信息,例如已从组接收并在应用程序队列(中继日志)中排队的事务。performance_schema.replication_applier_status显示与组复制相关的通道和线程的状态。如果有许多不同的工作线程应用事务,那么该表也可用于监视每个工作线程正在执行的操作。
performance_schema.replication_connection_status表字段含义如下:
performance_schema.replication_applier_status表字段含义如下:
以三成员为例,验证以下场景下对整个集群的影响:
因为只有三个成员,这四种场景均能够保证最大票数。无法保证最大票数时,如上面例子中三个成员中的两个异常宕机,则整个集群无法正常读写,需要管理员人为介入解决问题。这种情况显然不属于容错的范畴。
(1)主库上执行长时间运行的事务
-- 在hdp2上执行
use test;
truncate table t1;
call p1(100000);
(2)在上一步执行期间停止一个从库
# 停止hdp4的MySQL实例
mysqladmin -uroot -p123456 shutdown
(3)检查剩余组复制成员状态
在hdp2上检查复制组成员状态:
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 5f045152-a393-11e9-8020-005056a50f77 | hdp3 | 3306 | ONLINE | SECONDARY | 8.0.16 |
| group_replication_applier | 8eed0f5b-6f9b-11e9-94a9-005056a57a4e | hdp2 | 3306 | ONLINE | PRIMARY | 8.0.16 |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
2 rows in set (0.00 sec)
在hdp3检查复制事务和数据:
mysql> select * from performance_schema.replication_group_member_stats where member_id='5f045152-a393-11e9-8020-005056a50f77'\G
*************************** 1. row ***************************
CHANNEL_NAME: group_replication_applier
VIEW_ID: 15628283431266410:4
MEMBER_ID: 5f045152-a393-11e9-8020-005056a50f77
COUNT_TRANSACTIONS_IN_QUEUE: 0
COUNT_TRANSACTIONS_CHECKED: 29540
COUNT_CONFLICTS_DETECTED: 0
COUNT_TRANSACTIONS_ROWS_VALIDATING: 10221
TRANSACTIONS_COMMITTED_ALL_MEMBERS: 8eed0f5b-6f9b-11e9-94a9-005056a57a4e:1-2,
aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-119329
LAST_CONFLICT_FREE_TRANSACTION: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:129549
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 1
COUNT_TRANSACTIONS_REMOTE_APPLIED: 29540
COUNT_TRANSACTIONS_LOCAL_PROPOSED: 0
COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
1 row in set (0.00 sec)
mysql> select min(a),max(a),count(*) from test.t1;
+--------+--------+----------+
| min(a) | max(a) | count(*) |
+--------+--------+----------+
| 1 | 29538 | 29538 |
+--------+--------+----------+
1 row in set (0.00 sec)
可以看到,一个SECONDARY实例正常shutdown,对应用来说只是少了只读实例。复制组中的剩余成员状态依然是ONLINE。主库正常读写,从库正常复制,并且没有积压的事务(COUNT_TRANSACTIONS_IN_QUEUE为0)。
(4)恢复shutdown的实例
启动hdp4的MySQL实例:
mysqld_safe --defaults-file=/etc/my.cnf &
在hdp4上检查组复制成员状态:
mysql> select * from performance_schema.replication_group_members;
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | | | NULL | OFFLINE | | |
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+
1 row in set (0.00 sec)
此时hdp4的状态为OFFLINE,已经从复制组中被移除。在hdp4检查复制事务和数据:
mysql> select * from performance_schema.replication_group_member_stats where member_id='5c93a708-a393-11e9-8343-005056a5497f'\G
Empty set (0.00 sec)
mysql> select min(a),max(a),count(*) from test.t1;
+--------+--------+----------+
| min(a) | max(a) | count(*) |
+--------+--------+----------+
| 1 | 2803 | 2803 |
+--------+--------+----------+
1 row in set (0.02 sec)
可以看到,此时performance_schema.replication_group_member_stats表中已经没有此成员相关的信息,实例停止时插入了2803条数据。
将hdp4重新加入复制组:
change master to master_user='repl', master_password='123456' for channel 'group_replication_recovery';
start group_replication;
再次检查成员状态、复制事务和数据:
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 5c93a708-a393-11e9-8343-005056a5497f | hdp4 | 3306 | RECOVERING | SECONDARY | 8.0.16 |
| group_replication_applier | 5f045152-a393-11e9-8020-005056a50f77 | hdp3 | 3306 | ONLINE | SECONDARY | 8.0.16 |
| group_replication_applier | 8eed0f5b-6f9b-11e9-94a9-005056a57a4e | hdp2 | 3306 | ONLINE | PRIMARY | 8.0.16 |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
3 rows in set (0.00 sec)
mysql> select * from performance_schema.replication_group_member_stats where member_id='5c93a708-a393-11e9-8343-005056a5497f'\G
*************************** 1. row ***************************
CHANNEL_NAME: group_replication_applier
VIEW_ID: 15628283431266410:5
MEMBER_ID: 5c93a708-a393-11e9-8343-005056a5497f
COUNT_TRANSACTIONS_IN_QUEUE: 1
COUNT_TRANSACTIONS_CHECKED: 0
COUNT_CONFLICTS_DETECTED: 0
COUNT_TRANSACTIONS_ROWS_VALIDATING: 0
TRANSACTIONS_COMMITTED_ALL_MEMBERS: 8eed0f5b-6f9b-11e9-94a9-005056a57a4e:1-2,
aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-134250
LAST_CONFLICT_FREE_TRANSACTION:
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0
COUNT_TRANSACTIONS_REMOTE_APPLIED: 0
COUNT_TRANSACTIONS_LOCAL_PROPOSED: 0
COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
1 row in set (0.00 sec)
mysql> select min(a),max(a),count(*) from test.t1;
+--------+--------+----------+
| min(a) | max(a) | count(*) |
+--------+--------+----------+
| 1 | 64545 | 64545 |
+--------+--------+----------+
1 row in set (0.01 sec)
hdp4开始处于RECOVERING状态,表明它正在追赶组的复制进度,当赶上后,它的状态将变为ONLINE。由此可见,一个SECONDARY实例正常shutdown基本对复制组没有影响(就是少了一个读成员)。当把它重新加入组中,落后的事务自动恢复,直至赶上状态自动变为ONLINE。
(1)主库上执行长时间运行的事务
-- 在hdp2上执行
use test;
truncate table t1;
call p1(100000);
(2)在上一步执行期间停止一个从库
# 停止hdp4的MySQL实例
ps -ef | grep mysqld | grep -v grep | awk {'print $2'} | xargs kill -9
(3)检查剩余组复制成员状态
在hdp2上检查复制组成员状态:
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 5f045152-a393-11e9-8020-005056a50f77 | hdp3 | 3306 | ONLINE | SECONDARY | 8.0.16 |
| group_replication_applier | 8eed0f5b-6f9b-11e9-94a9-005056a57a4e | hdp2 | 3306 | ONLINE | PRIMARY | 8.0.16 |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
2 rows in set (0.00 sec)
在hdp3检查复制事务和数据:
mysql> select * from performance_schema.replication_group_member_stats where member_id='5f045152-a393-11e9-8020-005056a50f77'\G
*************************** 1. row ***************************
CHANNEL_NAME: group_replication_applier
VIEW_ID: 15628283431266410:6
MEMBER_ID: 5f045152-a393-11e9-8020-005056a50f77
COUNT_TRANSACTIONS_IN_QUEUE: 0
COUNT_TRANSACTIONS_CHECKED: 106387
COUNT_CONFLICTS_DETECTED: 0
COUNT_TRANSACTIONS_ROWS_VALIDATING: 6385
TRANSACTIONS_COMMITTED_ALL_MEMBERS: 8eed0f5b-6f9b-11e9-94a9-005056a57a4e:1-2,
aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-200011
LAST_CONFLICT_FREE_TRANSACTION: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:206397
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0
COUNT_TRANSACTIONS_REMOTE_APPLIED: 106389
COUNT_TRANSACTIONS_LOCAL_PROPOSED: 0
COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
1 row in set (0.00 sec)
mysql> select min(a),max(a),count(*) from test.t1;
+--------+--------+----------+
| min(a) | max(a) | count(*) |
+--------+--------+----------+
| 1 | 6385 | 6385 |
+--------+--------+----------+
1 row in set (0.00 sec)
可以看到,一个SECONDARY实例异常shutdown,对复制组的影响与正常shutdown别无二致。
(4)恢复shutdown的实例
启动hdp4的MySQL实例:
mysqld_safe --defaults-file=/etc/my.cnf &
在hdp4上检查组复制成员状态:
mysql> select * from performance_schema.replication_group_members;
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | | | NULL | OFFLINE | | |
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+
1 row in set (0.01 sec)
此时hdp4的状态为OFFLINE,已经从复制组中被移除。 在hdp4检查复制事务和数据:
mysql> select * from performance_schema.replication_group_member_stats where member_id='5c93a708-a393-11e9-8343-005056a5497f'\G
Empty set (0.00 sec)
mysql> select min(a),max(a),count(*) from test.t1;
+--------+--------+----------+
| min(a) | max(a) | count(*) |
+--------+--------+----------+
| 1 | 1759 | 1759 |
+--------+--------+----------+
1 row in set (0.02 sec)
可以看到,此时performance_schema.replication_group_member_stats表中已经没有此成员相关的信息,实例停止时插入了1759条数据。
将hdp4重新加入复制组:
change master to master_user='repl', master_password='123456' for channel 'group_replication_recovery';
start group_replication;
再次检查成员状态、复制事务和数据:
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 5c93a708-a393-11e9-8343-005056a5497f | hdp4 | 3306 | RECOVERING | SECONDARY | 8.0.16 |
| group_replication_applier | 5f045152-a393-11e9-8020-005056a50f77 | hdp3 | 3306 | ONLINE | SECONDARY | 8.0.16 |
| group_replication_applier | 8eed0f5b-6f9b-11e9-94a9-005056a57a4e | hdp2 | 3306 | ONLINE | PRIMARY | 8.0.16 |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
3 rows in set (0.00 sec)
mysql> select * from performance_schema.replication_group_member_stats where member_id='5c93a708-a393-11e9-8343-005056a5497f'\G
*************************** 1. row ***************************
CHANNEL_NAME: group_replication_applier
VIEW_ID: 15628283431266410:7
MEMBER_ID: 5c93a708-a393-11e9-8343-005056a5497f
COUNT_TRANSACTIONS_IN_QUEUE: 16608
COUNT_TRANSACTIONS_CHECKED: 0
COUNT_CONFLICTS_DETECTED: 0
COUNT_TRANSACTIONS_ROWS_VALIDATING: 0
TRANSACTIONS_COMMITTED_ALL_MEMBERS: 8eed0f5b-6f9b-11e9-94a9-005056a57a4e:1-2,
aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-230667
LAST_CONFLICT_FREE_TRANSACTION:
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0
COUNT_TRANSACTIONS_REMOTE_APPLIED: 0
COUNT_TRANSACTIONS_LOCAL_PROPOSED: 0
COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
1 row in set (0.00 sec)
mysql> select min(a),max(a),count(*) from test.t1;
+--------+--------+----------+
| min(a) | max(a) | count(*) |
+--------+--------+----------+
| 1 | 38350 | 38350 |
+--------+--------+----------+
1 row in set (0.00 sec)
hdp4正处于RECOVERING状态,表明它正在追赶组的复制进度,当赶上后,它的状态将变为ONLINE。由此可见,一个SECONDARY实例的正常shutdown和异常shutdown,对复制组的影响和恢复过程都没有任何区别。其实就多了一个MySQL实例的恢复过程,这是在重新启动hdp4时自动进行的,对用户是完全透明。
(1)主库上执行长时间运行的事务
-- 在hdp2上执行
use test;
truncate table t1;
call p1(100000);
(2)在上一步执行期间停止主库
# 停止hdp2的MySQL实例
mysqladmin -uroot -p123456 shutdown
(3)检查剩余组复制成员状态
在hdp3、hdp4上检查复制组成员状态:
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 5c93a708-a393-11e9-8343-005056a5497f | hdp4 | 3306 | ONLINE | PRIMARY | 8.0.16 |
| group_replication_applier | 5f045152-a393-11e9-8020-005056a50f77 | hdp3 | 3306 | ONLINE | SECONDARY | 8.0.16 |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
2 rows in set (0.01 sec)
可以看到,当PRIMARY实例正常shutdown,复制组中的剩余成员状态依然是ONLINE,并且自动把其中一个SECONDARY(本例中为hdp4)提升为PRIMARY。在hdp3、hdp4上检查复制事务和数据:
mysql> select * from performance_schema.replication_group_member_stats\G
*************************** 1. row ***************************
CHANNEL_NAME: group_replication_applier
VIEW_ID: 15628283431266410:8
MEMBER_ID: 5c93a708-a393-11e9-8343-005056a5497f
COUNT_TRANSACTIONS_IN_QUEUE: 0
COUNT_TRANSACTIONS_CHECKED: 987
COUNT_CONFLICTS_DETECTED: 0
COUNT_TRANSACTIONS_ROWS_VALIDATING: 1
TRANSACTIONS_COMMITTED_ALL_MEMBERS: 8eed0f5b-6f9b-11e9-94a9-005056a57a4e:1-2,
aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-301000
LAST_CONFLICT_FREE_TRANSACTION: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:301000
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0
COUNT_TRANSACTIONS_REMOTE_APPLIED: 987
COUNT_TRANSACTIONS_LOCAL_PROPOSED: 0
COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
*************************** 2. row ***************************
CHANNEL_NAME: group_replication_applier
VIEW_ID: 15628283431266410:8
MEMBER_ID: 5f045152-a393-11e9-8020-005056a50f77
COUNT_TRANSACTIONS_IN_QUEUE: 0
COUNT_TRANSACTIONS_CHECKED: 200989
COUNT_CONFLICTS_DETECTED: 0
COUNT_TRANSACTIONS_ROWS_VALIDATING: 1
TRANSACTIONS_COMMITTED_ALL_MEMBERS: 8eed0f5b-6f9b-11e9-94a9-005056a57a4e:1-2,
aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-301000
LAST_CONFLICT_FREE_TRANSACTION: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:301000
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0
COUNT_TRANSACTIONS_REMOTE_APPLIED: 200992
COUNT_TRANSACTIONS_LOCAL_PROPOSED: 0
COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
2 rows in set (0.00 sec)
mysql> select min(a),max(a),count(*) from test.t1;
+--------+--------+----------+
| min(a) | max(a) | count(*) |
+--------+--------+----------+
| 1 | 986 | 986 |
+--------+--------+----------+
1 row in set (0.00 sec)
(4)恢复shutdown的实例
启动hdp2的MySQL实例:
mysqld_safe --defaults-file=/etc/my.cnf &
在hdp2上检查组复制成员状态:
mysql> select * from performance_schema.replication_group_members;
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | | | NULL | OFFLINE | | |
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+
1 row in set (0.01 sec)
此时hdp2的状态为OFFLINE,已经从复制组中被移除。在hdp2检查复制事务和数据:
mysql> select * from performance_schema.replication_group_member_stats where member_id='8eed0f5b-6f9b-11e9-94a9-005056a57a4e'\G
Empty set (0.00 sec)
mysql> select min(a),max(a),count(*) from test.t1;
+--------+--------+----------+
| min(a) | max(a) | count(*) |
+--------+--------+----------+
| 1 | 986 | 986 |
+--------+--------+----------+
1 row in set (0.01 sec)
可以看到,此时performance_schema.replication_group_member_stats表中已经没有此成员相关的信息,实例停止时插入了986条数据。
将hdp2重新加入复制组:
change master to master_user='repl', master_password='123456' for channel 'group_replication_recovery';
start group_replication;
再次检查成员状态、复制事务和数据:
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 5c93a708-a393-11e9-8343-005056a5497f | hdp4 | 3306 | ONLINE | PRIMARY | 8.0.16 |
| group_replication_applier | 5f045152-a393-11e9-8020-005056a50f77 | hdp3 | 3306 | ONLINE | SECONDARY | 8.0.16 |
| group_replication_applier | 8eed0f5b-6f9b-11e9-94a9-005056a57a4e | hdp2 | 3306 | ONLINE | SECONDARY | 8.0.16 |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
3 rows in set (0.00 sec)
mysql> select * from performance_schema.replication_group_member_stats where member_id='8eed0f5b-6f9b-11e9-94a9-005056a57a4e'\G
*************************** 1. row ***************************
CHANNEL_NAME: group_replication_applier
VIEW_ID: 15628283431266410:9
MEMBER_ID: 8eed0f5b-6f9b-11e9-94a9-005056a57a4e
COUNT_TRANSACTIONS_IN_QUEUE: 0
COUNT_TRANSACTIONS_CHECKED: 0
COUNT_CONFLICTS_DETECTED: 0
COUNT_TRANSACTIONS_ROWS_VALIDATING: 0
TRANSACTIONS_COMMITTED_ALL_MEMBERS: 8eed0f5b-6f9b-11e9-94a9-005056a57a4e:1-2,
aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-301001
LAST_CONFLICT_FREE_TRANSACTION:
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0
COUNT_TRANSACTIONS_REMOTE_APPLIED: 0
COUNT_TRANSACTIONS_LOCAL_PROPOSED: 0
COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
1 row in set (0.00 sec)
mysql> select min(a),max(a),count(*) from test.t1;
+--------+--------+----------+
| min(a) | max(a) | count(*) |
+--------+--------+----------+
| 1 | 986 | 986 |
+--------+--------+----------+
1 row in set (0.00 sec)
hdp2处于ONLINE状态,但角色已经变成了SECONDARY。
此时hdp2变为一个只读成员:
mysql> show variables like '%read_only%';
+-----------------------+-------+
| Variable_name | Value |
+-----------------------+-------+
| innodb_read_only | OFF |
| read_only | ON |
| super_read_only | ON |
| transaction_read_only | OFF |
+-----------------------+-------+
4 rows in set (0.00 sec)
而hdp4成为读写成员(主库):
mysql> show variables like '%read_only%';
+-----------------------+-------+
| Variable_name | Value |
+-----------------------+-------+
| innodb_read_only | OFF |
| read_only | OFF |
| super_read_only | OFF |
| transaction_read_only | OFF |
+-----------------------+-------+
4 rows in set (0.01 sec)
由此可见,当PRIMARY实例正常shutdown,组复制会把一个SECONDARY提升为新的PRIMARY。而原来的PRIMARY在重新加入组后立即ONLINE,角色变为SECONDARY。这一切都是自动进行的,应用需要做的是重新连接新的PRIMARY实例,以便继续执行读写事务。
(1)主库上执行长时间运行的事务
-- 在hdp4上执行
use test;
truncate table t1;
call p1(100000);
(2)在上一步执行期间停止主库
# 停止hdp4的MySQL实例
ps -ef | grep mysqld | grep -v grep | awk {'print $2'} | xargs kill -9
(3)检查剩余组复制成员状态
在hdp2、hdp3上检查复制组成员状态:
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 5f045152-a393-11e9-8020-005056a50f77 | hdp3 | 3306 | ONLINE | PRIMARY | 8.0.16 |
| group_replication_applier | 8eed0f5b-6f9b-11e9-94a9-005056a57a4e | hdp2 | 3306 | ONLINE | SECONDARY | 8.0.16 |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
2 rows in set (0.01 sec)
可以看到,情况与PRIMARY实例正常shutdown一样,复制组中的剩余成员状态依然是ONLINE,并且自动把其中一个SECONDARY(本例中为hdp3)提升为PRIMARY。 在hdp2、hdp3上检查复制事务和数据:
mysql> select * from performance_schema.replication_group_member_stats\G
*************************** 1. row ***************************
CHANNEL_NAME: group_replication_applier
VIEW_ID: 15628283431266410:10
MEMBER_ID: 5f045152-a393-11e9-8020-005056a50f77
COUNT_TRANSACTIONS_IN_QUEUE: 0
COUNT_TRANSACTIONS_CHECKED: 1227
COUNT_CONFLICTS_DETECTED: 0
COUNT_TRANSACTIONS_ROWS_VALIDATING: 1
TRANSACTIONS_COMMITTED_ALL_MEMBERS: 8eed0f5b-6f9b-11e9-94a9-005056a57a4e:1-2,
aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-303809
LAST_CONFLICT_FREE_TRANSACTION: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:303809
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0
COUNT_TRANSACTIONS_REMOTE_APPLIED: 1227
COUNT_TRANSACTIONS_LOCAL_PROPOSED: 0
COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
*************************** 2. row ***************************
CHANNEL_NAME: group_replication_applier
VIEW_ID: 15628283431266410:10
MEMBER_ID: 8eed0f5b-6f9b-11e9-94a9-005056a57a4e
COUNT_TRANSACTIONS_IN_QUEUE: 0
COUNT_TRANSACTIONS_CHECKED: 2806
COUNT_CONFLICTS_DETECTED: 0
COUNT_TRANSACTIONS_ROWS_VALIDATING: 1
TRANSACTIONS_COMMITTED_ALL_MEMBERS: 8eed0f5b-6f9b-11e9-94a9-005056a57a4e:1-2,
aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-303809
LAST_CONFLICT_FREE_TRANSACTION: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:303809
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0
COUNT_TRANSACTIONS_REMOTE_APPLIED: 2808
COUNT_TRANSACTIONS_LOCAL_PROPOSED: 0
COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
2 rows in set (0.00 sec)
mysql> select min(a),max(a),count(*) from test.t1;
+--------+--------+----------+
| min(a) | max(a) | count(*) |
+--------+--------+----------+
| 1 | 1226 | 1226 |
+--------+--------+----------+
1 row in set (0.00 sec)
(4)恢复shutdown的实例
启动hdp4的MySQL实例:
mysqld_safe --defaults-file=/etc/my.cnf &
在hdp4上检查组复制成员状态:
mysql> select * from performance_schema.replication_group_members;
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | | | NULL | OFFLINE | | |
+---------------------------+-----------+-------------+-------------+--------------+-------------+----------------+
1 row in set (0.00 sec)
此时hdp4的状态为OFFLINE,已经从复制组中被移除。在hdp4检查复制事务和数据:
mysql> select * from performance_schema.replication_group_member_stats where member_id='5c93a708-a393-11e9-8343-005056a5497f'\G
Empty set (0.01 sec)
mysql> select min(a),max(a),count(*) from test.t1;
+--------+--------+----------+
| min(a) | max(a) | count(*) |
+--------+--------+----------+
| 1 | 1225 | 1225 |
+--------+--------+----------+
1 row in set (0.01 sec)
可以看到,此时performance_schema.replication_group_member_stats表中已经没有此成员相关的信息,实例停止时插入了1225条数据。原PRIMARY比新的PRIMARY还少了一条数据,可见各个实例的提交进度由自己而不是复制组控制。
将hdp4重新加入复制组:
change master to master_user='repl', master_password='123456' for channel 'group_replication_recovery';
start group_replication;
再次检查成员状态、复制事务和数据:
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 5c93a708-a393-11e9-8343-005056a5497f | hdp4 | 3306 | ONLINE | SECONDARY | 8.0.16 |
| group_replication_applier | 5f045152-a393-11e9-8020-005056a50f77 | hdp3 | 3306 | ONLINE | PRIMARY | 8.0.16 |
| group_replication_applier | 8eed0f5b-6f9b-11e9-94a9-005056a57a4e | hdp2 | 3306 | ONLINE | SECONDARY | 8.0.16 |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
3 rows in set (0.00 sec)
mysql> select * from performance_schema.replication_group_member_stats where member_id='5c93a708-a393-11e9-8343-005056a5497f'\G
*************************** 1. row ***************************
CHANNEL_NAME: group_replication_applier
VIEW_ID: 15628283431266410:11
MEMBER_ID: 5c93a708-a393-11e9-8343-005056a5497f
COUNT_TRANSACTIONS_IN_QUEUE: 0
COUNT_TRANSACTIONS_CHECKED: 0
COUNT_CONFLICTS_DETECTED: 0
COUNT_TRANSACTIONS_ROWS_VALIDATING: 1
TRANSACTIONS_COMMITTED_ALL_MEMBERS: 8eed0f5b-6f9b-11e9-94a9-005056a57a4e:1-2,
aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-303809
LAST_CONFLICT_FREE_TRANSACTION:
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0
COUNT_TRANSACTIONS_REMOTE_APPLIED: 0
COUNT_TRANSACTIONS_LOCAL_PROPOSED: 0
COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
1 row in set (0.00 sec)
mysql> select min(a),max(a),count(*) from test.t1;
+--------+--------+----------+
| min(a) | max(a) | count(*) |
+--------+--------+----------+
| 1 | 1226 | 1226 |
+--------+--------+----------+
1 row in set (0.00 sec)
hdp4处于ONLINE状态,但角色已经变成了SECONDARY,数据也已经和其它成员一致。PRIMARY实例正常shutdown与异常shutdown,对组复制和应用的影响来说没有任何区别。最终得出的结论是,只要保证最大票数的实例可用,组复制中成员的数据恢复、主从角色交换等容错行为是全自动的,应用只要在必要时修改到主库的连接即可。