配置的集群成员,通信时会把主机名与ip地址进行对应,最好是在/etc/hosts中设置好,如果没有设置,则会碰到如下
错误:
2018-05-02T13:04:32.437256Z 10 [Note] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='myserver01', master_port= 24801, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='myserver01', master_port= 24801, master_log_file='', master_log_pos= 4, master_bind=''.
2018-05-02T13:04:32.449673Z 10 [Note] Plugin group_replication reported: 'Establishing connection to a group replication recovery donor 595937c0-4d9c-11e8-a819-00163e06ea60 at myserver01 port: 24801.'
2018-05-02T13:04:32.450108Z 14 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2018-05-02T13:04:32.450878Z 14 [ERROR] Slave I/O for channel 'group_replication_recovery': error connecting to master 'rpl_user@myserver01:24801' - retry-time: 60 retries: 1, Error_code: 2005
2018-05-02T13:04:32.450887Z 14 [Note] Slave I/O thread for channel 'group_replication_recovery' killed while connecting to master
2018-05-02T13:04:32.450890Z 14 [Note] Slave I/O thread exiting for channel 'group_replication_recovery', read up to log 'FIRST', position 4
2018-05-02T13:04:32.451012Z 10 [ERROR] Plugin group_replication reported: 'There was an error when connecting to the donor server. Please check that group_replication_recovery channel credentials and all MEMBER_HOST column values of performance_schema.replication_group_members table are correct and DNS resolvable.'
2018-05-02T13:04:32.451020Z 10 [ERROR] Plugin group_replication reported: 'For details please check performance_schema.replication_connection_status table and error log messages of Slave I/O for channel group_replication_recovery.'
2018-05-02T13:04:32.451185Z 10 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 3/10'
在视图中会看到状态一直在recovering:
mysql> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 29d2ae7b-4de8-11e8-a27e-00163e06ea60 | myserver01 | 24802 | RECOVERING |
| group_replication_applier | 595937c0-4d9c-11e8-a819-00163e06ea60 | myserver01 | 24801 | ONLINE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
2 rows in set (0.00 sec)
查看当前集群的成员状态及哪个实例为主节点的方法:
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 29d2ae7b-4de8-11e8-a27e-00163e06ea60 | myserver01 | 24802 | ONLINE |
| group_replication_applier | 2fdfc55d-4de8-11e8-a3af-00163e06ea60 | myserver01 | 24803 | ONLINE |
| group_replication_applier | 595937c0-4d9c-11e8-a819-00163e06ea60 | myserver01 | 24801 | ONLINE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.00 sec)
mysql> show global status like '%group_replication_primary%';
+----------------------------------+--------------------------------------+
| Variable_name | Value |
+----------------------------------+--------------------------------------+
| group_replication_primary_member | 595937c0-4d9c-11e8-a819-00163e06ea60 |
+----------------------------------+--------------------------------------+
1 row in set (0.01 sec)
查看当前实例是否有延迟等性能问题(通过队列深度判断)的方法:
mysql> select * from performance_schema.replication_group_member_stats\G
*************************** 1. row ***************************
CHANNEL_NAME: group_replication_applier
VIEW_ID: 15252658123489942:3
MEMBER_ID: 2fdfc55d-4de8-11e8-a3af-00163e06ea60
COUNT_TRANSACTIONS_IN_QUEUE: 0
COUNT_TRANSACTIONS_CHECKED: 0
COUNT_CONFLICTS_DETECTED: 0
COUNT_TRANSACTIONS_ROWS_VALIDATING: 0
TRANSACTIONS_COMMITTED_ALL_MEMBERS: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-6
LAST_CONFLICT_FREE_TRANSACTION:
1 row in set (0.00 sec)
通过查看COUNT_TRANSACTIONS_IN_QUEUE的值,可以判断等待处理的事务数。
在单主模式(默认模式)下,只有主服务器是读写模式,其他机器会被设置为只读模式,read_only、super_read_only都被设置为ON状态,也就是设置为超级只读模式:
mysql> show variables like '%read_only%';
+-----------------------+-------+
| Variable_name | Value |
+-----------------------+-------+
| innodb_read_only | OFF |
| read_only | ON |
| super_read_only | ON |
| transaction_read_only | OFF |
| tx_read_only | OFF |
+-----------------------+-------+
5 rows in set (0.00 sec)
当节点需要恢复时,可以调整重试的次数与间隔:
mysql> show global variables like '%group_replication_recovery_retry%';
+----------------------------------------+-------+
| Variable_name | Value |
+----------------------------------------+-------+
| group_replication_recovery_retry_count | 10 |
+----------------------------------------+-------+
1 row in set (0.00 sec)
mysql> show global variables like '%group_replication_recovery_reconnect%';
+-----------------------------------------------+-------+
| Variable_name | Value |
+-----------------------------------------------+-------+
| group_replication_recovery_reconnect_interval | 60 |
+-----------------------------------------------+-------+
1 row in set (0.01 sec)
区分故障与节点主动退出的区别,如果节点是主动退出集群的,存活的集群的节点总数会做调整,这样退出的节点多不会导致剩下节点数目小于原始规划数的一半而导致仲裁失败,节点只读。我们测试如下的节点自动退出情形,第一步,把节点3退出(stop group_replication或者数据库正常shutdown):
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 29d2ae7b-4de8-11e8-a27e-00163e06ea60 | myserver01 | 24802 | ONLINE |
| group_replication_applier | 2fdfc55d-4de8-11e8-a3af-00163e06ea60 | myserver01 | 24803 | ONLINE |
| group_replication_applier | 595937c0-4d9c-11e8-a819-00163e06ea60 | myserver01 | 24801 | ONLINE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.00 sec)
mysql> stop group_replication;
Query OK, 0 rows affected (9.51 sec)
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 2fdfc55d-4de8-11e8-a3af-00163e06ea60 | myserver01 | 24803 | OFFLINE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
1 row in set (0.00 sec)
可以看到一个有3个节点的集群,节点3退出看,只能看到自己属于OFFLINE状态,这时在其他节点上观察:
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 29d2ae7b-4de8-11e8-a27e-00163e06ea60 | myserver01 | 24802 | ONLINE |
| group_replication_applier | 595937c0-4d9c-11e8-a819-00163e06ea60 | myserver01 | 24801 | ONLINE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
2 rows in set (0.00 sec)
可以看到在线节点信息上,已经把第3个节点信息去掉,不影响仲裁。等节点3维护完启动群组复制功能后,会重新加进来,重新配置集群成员信息。
如果时网络,机器故障等情况,成员信息不会自动调整,成员状态会处于UNREACHABLE,什么时候会有ERROR状态,是已经加入集群但恢复失败的情况下会出现。
5.7.20后有个新的参数group_replication_member_weight,单主模式下,当选举发生时,权重越大优先选择为新的主机。
5. 7.19后有参数group_replication_transaction_size_limit,指定允许的事务大小,防止过大事务导致组同步失败,这个值允许设置的最大值为2147483647,2G,好像大于max_allowed_packet的1G,最好设置成小于1G,直接避免超过1G事务导致其他复制失败的情况。
使用组复制的基础要求:
1、必须使用innodb引擎,冲突时要回滚,使用其他引擎可能会数据不一致。
2、每个表都要求显示定义主键,用于冲突检测识别。
3、开启GTID,通过GTID来追踪各事务在各机器上的情况 。
4、binlog使用row模式,确保各机器数据的一致性。
5、使用READ COMMITTED的隔离级别来规避Gap Locks。