- 请保证所有的集群机器在一个子网内,网络必须要通, 不然会失败;
- 统一使用hostname进行配置;请更改每台机器的hosts文件;
- 报错
ERROR:
Group Replication join failed.
ERROR: Error joining instance to cluster: 'host-192-168-1-101:3306' - Query failed. MySQL Error (3092): The server is not configureroperly to be an active member of the group. Please see more details on error log.. Query: START group_replication
通过如下方式进行处理
mysql> install plugin group_replication soname 'group_replication.so'; ##安装插件
mysql> set global group_replication_allow_local_disjoint_gtids_join=ON;
mysql> START GROUP_REPLICATION;
mysql> select * from performance_schema.replication_group_members;
- 报错
Dba.getCluster: This function is not available through a session to a standalone instance (RuntimeError)
说明集群中的主节点已经不在该机器上,查询后更改机器重试一下即可;
- 报错
Dba.getCluster: Dba.getCluster: Unable to get cluster. The instance 'host-192-168-1-101:3306'
may belong to a different ReplicaSet as the one registered in the Metadata since the value of 'group_replication_group_name'
does not match the one registered in the ReplicaSet's Metadata: possible split-brain scenario. Please connect to another member of the ReplicaSet to get the Cluster. (RuntimeError)
最致命的错误,master/slave的数据不一致所致,没办法,只能重新来
mysql-js>dba.dropMetadataSchema();
6) 请保证集群中的数据库表都存在主键,不然会挂掉;
- 安装集群监控,保证集群中机器挂掉的时候及时启动,不然所有节点宕机的时候就是灾难到来之时!!! 到时哭都来不及;
- 如何重置Innodb cluster集群环境
主节点:
mysql-js>dba.dropMetadataSchema(); 登录mysql-shell清空集群
mysql> stop group_replication;
mysql> reset master; (清空日志,确保和从库的表没有冲突奥,)
mysql> reset slave;
其他节点(主要清理和主库的主从信息, 确保主库和从库的表没有冲突奥)
mysql> stop group_replication;
mysql> reset master;
mysql> reset slave
- 主机名和 /etc/hosts中名字不一致
出现报错:
[Repl] Slave I/O for channel 'group_replication_recovery': error connecting to master 'mysql_innodb_cluster_r0430970923@mysql3:3306' - retry-time: 60 retries: 1, Error_code: MY-002005
[ERROR] [MY-011582] [Repl] Plugin group_replication reported: 'There was an error when connecting to the donor server. Please check that group_replication_recovery channel credentials and all MEMBER_HOST column values of performance_schema.replication_group_members table are correct and DNS resolvable.'
[ERROR] [MY-011583] [Repl] Plugin group_replication reported: 'For details please check performance_schema.replication_connection_status table and error log messages of Slave I/O for channel group_replication_recovery.'
- 主库的日志应用卡在某个位置无法应用到从库
出现报错:
[ERROR] [MY-010586] [Repl] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.000007' position 151
[ERROR] [MY-010584] [Repl] Slave SQL for channel 'group_replication_applier': Error executing row event: 'Unknown database 'mysql_innodb_cluster_metadata'', Error_code: MY-001049
重建master:
mysql> stop group_replication;
mysql> reset master;
- 报错
[ERROR] Slave SQL for channel 'group_replication_recovery': Could not execute Write_rows event on table mysql_innodb_cluster_metadata.instances;
Cannot add or update a child row: a foreign key constraint fails (mysql_innodb_cluster_metadata.instances, CONSTRAINT instances_ibfk_1 FOREIGN KEY (host_id) REFERENCES hosts (host_id)),
Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log binlog.000001, end_log_pos 3059, Error_code: 1452
解决方式:清空表mysql_innodb_cluster_metadata.hosts; 重新建立集群
- 报错
This member has more executed transactions than those present in the group
解决方式:
mysql> stop group_replication;
mysql> reset master;
- 用户操作系统资源的限制
[Warning] Buffered warning: Changed limits: max_open_files: 1024 (requested 5000)
[Warning] Buffered warning: Changed limits: table_open_cache: 431 (requested 2000)
解决方式:
# vim /etc/security/limits.conf # 添加下面内容
mysql soft nproc 2047
mysql hard nproc 16384
mysql soft nofile 1024
mysql hard nofile 65535
- 报错
dba.rebootClusterFromCompleteOutage: The active session instance isn't the most updated in comparison with the ONLINE instances of the
Cluster's metadata.
在集群没有起来时某些机器的数据表发生变动,导致数据不一致;
解决方式:
所有MySQL机器通过reset master命令清空binlogs
mysql> reset master;
mysql> show master logs;
然后再运行Dba.rebootClusterFromCompleteOutage重启集群。
- service mysql restart 无法重启mysql,mysql stuck,并一直输出日志'[Note] Plugin group_replication reported: '[GCS] cli_err 2''
解决方式:唯一停止MySQL的命令为:
#pkill -9 mysqld
- 如何将Multi-Primary改为Single-Primary?
a) 解散原来的集群:mysql-js> cluster.dissolve({force: true})
b) 每台主机MySQL修改如下配置:
mysql> set global group_replication_enforce_update_everywhere_checks=OFF;
mysql> set global group_replication_single_primary_mode=ON;
c) 重新创建集群:
mysql-js> var cluster = dba.createCluster('mysqlCluster');
mysql-js> cluster.addInstance('chianyu@svr2:3306');
mysql-js> cluster.addInstance('chianyu@svr3:3306');
- 组复制的限制
- 事物锁缺失问题:
- 组复制建议,事物隔离级别,read commit
- 序列化隔离级别:多主模式不支持
- 并发DDL和DML: 多主模式下,不支持 一边对一个表进行DDL,另一边进行更新,这样对于DDL在其他实例上操作有未检出的风险
- 外键级联约束:多主模式下,多级外键依赖对引起多级操作, 因此可能导致未知冲突,建议打开 group_replication_enforce_update_everywhere_checks=ON
- 大事物,超过5秒未提交,会导致组通信失败,
- 多主模式下:select * for update 会导致 死锁。因为这个锁并非全组共享。
- 部分复制不支持:组复制下,设置部分复制,会过滤事物,导致组事物不一致。
- Mysql 8.0.11 group_replication_enforce_update_everywhere_checks=ON 多主模式下不支持。
- 停止复制的情况下,某个节点执行命令后再启动,会因为本地有私有事物,无法加入集群。需要全局 reset master 重新开始集群复制。
- 多实例环境不要用 3306端口
多实例环境下,某个实例采用了默认的3306端口,会导致经常性的误操作。
一台主机最多部署10个实例
比如:
cluster节点A服务器启用三个端口实例: 3310, 3320, 3330,
cluster节点B服务器启用三个端口实例: 3310, 3320, 3330
cluster节点C服务器启用三个端口实例: 3310, 3320, 3330
实例数据目录分别为: /data/mysql3310, /data/mysql3320, /data/mysql3330
管理节点D服务启动三个端口route端口实例: 3310, 3320, 3330
管理节点E服务启动三个端口route端口实例: 3310, 3320, 3330
实例数据目录分别为: /data/router3310, /data/router3320, /data/router3330
- 脑裂场景
当集群中有部分节点出现UNREACHABLE状态,此时集群无法做出决策,会出现以下局面,此时只剩下一个活跃节点,此节点只能提供查询,无法写入,执行写入操作会hang住。
mysql-js> cl.status();
{
"clusterName": "myCluster",
"defaultReplicaSet": {
"name": "default",
"primary": "node-3:3305",
"status": "NO_QUORUM",
"statusText": "Cluster has no quorum as visible from 'node-1:3305' and cannot process write transactions. 2 members are not active",
"topology": {
"node-1:3305": {
"address": "node-1:3305",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE"
},
"node-2:3305": {
"address": "node-2:3305",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "(MISSING)"
},
"node-3:3305": {
"address": "node-3:3305",
"mode": "R/W",
"readReplicas": {},
"role": "HA",
"status": "UNREACHABLE"
}
}
}
}
修复这种状态,需要执行forceQuorumUsingPartitionOf指定当前活跃节点(如果是多个则选择primary node),此时活跃节点可以提供读写操作,然后将其他节点加入此集群。
mysql-js> cluster.forceQuorumUsingPartitionOf('root@node-1:3305')
Restoring replicaset 'default' from loss of quorum, by using the partition composed of [nt-metra-1:3305]
Please provide the password for 'root@nt-metra-1:3305':
Restoring the InnoDB cluster ...
The InnoDB cluster was successfully restored using the partition from the instance 'root@nt-metra-1:3305'.
WARNING: To avoid a split-brain scenario, ensure that all other members of the replicaset are removed or joined back to the group that was restored.
mysql-js> cluster.status();
{
"clusterName": "myCluster",
"defaultReplicaSet": {
"name": "default",
"primary": "nt-metra-1:3305",
"status": "OK_NO_TOLERANCE",
"statusText": "Cluster is NOT tolerant to any failures. 2 members are not active",
"topology": {
"nt-metra-1:3305": {
"address": "nt-metra-1:3305",
"mode": "R/W",
"readReplicas": {},
"role": "HA",
"status": "ONLINE"
},
"nt-metra-2:3305": {
"address": "nt-metra-2:3305",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "(MISSING)"
},
"nt-metra-3:3305": {
"address": "nt-metra-3:3305",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "(MISSING)"
}
}
}
}
# 根据情况使用
cluster.rejoinInstance("root@node-2:3305")
#或者,删除错误节点,重新添加:
cluster.removeInstance('root@node-2:3305');
cluster.addInstance('root@node-2:3305');