Mysql InnoDB Cluster集群部署中的问题汇集

请保证所有的集群机器在一个子网内，网络必须要通, 不然会失败；
统一使用hostname进行配置；请更改每台机器的hosts文件；
报错

ERROR:
Group Replication join failed.
ERROR: Error joining instance to cluster: 'host-192-168-1-101:3306' - Query failed. MySQL Error (3092): The server is not configureroperly to be an active member of the group. Please see more details on error log.. Query: START group_replication
 
通过如下方式进行处理
mysql> install plugin group_replication soname 'group_replication.so'; ##安装插件
mysql> set global group_replication_allow_local_disjoint_gtids_join=ON;
mysql> START GROUP_REPLICATION;
mysql> select * from performance_schema.replication_group_members;

报错

Dba.getCluster: This function is not available through a session to a standalone instance (RuntimeError)
说明集群中的主节点已经不在该机器上，查询后更改机器重试一下即可；

报错

Dba.getCluster: Dba.getCluster: Unable to get cluster. The instance 'host-192-168-1-101:3306'
may belong to a different ReplicaSet as the one registered in the Metadata since the value of 'group_replication_group_name'
does not match the one registered in the ReplicaSet's Metadata: possible split-brain scenario. Please connect to another member of the ReplicaSet to get the Cluster. (RuntimeError)
 
最致命的错误，master/slave的数据不一致所致，没办法，只能重新来
mysql-js>dba.dropMetadataSchema();

6) 请保证集群中的数据库表都存在主键，不然会挂掉；

安装集群监控，保证集群中机器挂掉的时候及时启动，不然所有节点宕机的时候就是灾难到来之时!!! 到时哭都来不及；
如何重置Innodb cluster集群环境

主节点：
mysql-js>dba.dropMetadataSchema();   登录mysql-shell清空集群
 
mysql> stop group_replication;
mysql> reset master;               （清空日志，确保和从库的表没有冲突奥，）
mysql> reset slave;
  
其他节点(主要清理和主库的主从信息， 确保主库和从库的表没有冲突奥)
mysql> stop group_replication;
mysql> reset master;
mysql> reset slave

主机名和 /etc/hosts中名字不一致

出现报错:
[Repl] Slave I/O for channel 'group_replication_recovery': error connecting to master 'mysql_innodb_cluster_r0430970923@mysql3:3306' - retry-time: 60 retries: 1, Error_code: MY-002005
[ERROR] [MY-011582] [Repl] Plugin group_replication reported: 'There was an error when connecting to the donor server. Please check that group_replication_recovery channel credentials and all MEMBER_HOST column values of performance_schema.replication_group_members table are correct and DNS resolvable.'
[ERROR] [MY-011583] [Repl] Plugin group_replication reported: 'For details please check performance_schema.replication_connection_status table and error log messages of Slave I/O for channel group_replication_recovery.'

主库的日志应用卡在某个位置无法应用到从库

出现报错:
[ERROR] [MY-010586] [Repl] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.000007' position 151
[ERROR] [MY-010584] [Repl] Slave SQL for channel 'group_replication_applier': Error executing row event: 'Unknown database 'mysql_innodb_cluster_metadata'', Error_code: MY-001049
  
重建master：
mysql> stop group_replication;
mysql> reset master;

报错

[ERROR] Slave SQL for channel 'group_replication_recovery': Could not execute Write_rows event on table mysql_innodb_cluster_metadata.instances;
Cannot add or update a child row: a foreign key constraint fails (mysql_innodb_cluster_metadata.instances, CONSTRAINT instances_ibfk_1 FOREIGN KEY (host_id) REFERENCES hosts (host_id)),
Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log binlog.000001, end_log_pos 3059, Error_code: 1452
  
解决方式：清空表mysql_innodb_cluster_metadata.hosts; 重新建立集群

报错

This member has more executed transactions than those present in the group
 
解决方式：
mysql> stop group_replication;
mysql> reset master;

用户操作系统资源的限制

[Warning] Buffered warning: Changed limits: max_open_files: 1024 (requested 5000)
[Warning] Buffered warning: Changed limits: table_open_cache: 431 (requested 2000)
  
解决方式：
# vim /etc/security/limits.conf      #  添加下面内容
mysql soft nproc 2047
mysql hard nproc 16384
mysql soft nofile 1024
mysql hard nofile 65535

报错

dba.rebootClusterFromCompleteOutage: The active session instance isn't the most updated in comparison with the ONLINE instances of the
Cluster's metadata.
  
在集群没有起来时某些机器的数据表发生变动，导致数据不一致；
解决方式：
所有MySQL机器通过reset master命令清空binlogs
mysql> reset master;
mysql> show master logs;
然后再运行Dba.rebootClusterFromCompleteOutage重启集群。

service mysql restart 无法重启mysql，mysql stuck，并一直输出日志'[Note] Plugin group_replication reported: '[GCS] cli_err 2''

解决方式：唯一停止MySQL的命令为：
#pkill -9 mysqld

如何将Multi-Primary改为Single-Primary？

a) 解散原来的集群：mysql-js> cluster.dissolve({force: true})
b) 每台主机MySQL修改如下配置：
mysql> set global group_replication_enforce_update_everywhere_checks=OFF;
mysql> set global group_replication_single_primary_mode=ON;
c) 重新创建集群：
mysql-js> var cluster = dba.createCluster('mysqlCluster');
mysql-js> cluster.addInstance('chianyu@svr2:3306');
mysql-js> cluster.addInstance('chianyu@svr3:3306');

组复制的限制

- 事物锁缺失问题：
- 组复制建议，事物隔离级别，read commit
- 序列化隔离级别：多主模式不支持
- 并发DDL和DML: 多主模式下，不支持 一边对一个表进行DDL,另一边进行更新，这样对于DDL在其他实例上操作有未检出的风险
- 外键级联约束：多主模式下，多级外键依赖对引起多级操作， 因此可能导致未知冲突，建议打开 group_replication_enforce_update_everywhere_checks=ON
- 大事物，超过5秒未提交，会导致组通信失败，
- 多主模式下：select * for update 会导致 死锁。因为这个锁并非全组共享。
- 部分复制不支持：组复制下，设置部分复制，会过滤事物，导致组事物不一致。
- Mysql 8.0.11 group_replication_enforce_update_everywhere_checks=ON 多主模式下不支持。
- 停止复制的情况下，某个节点执行命令后再启动，会因为本地有私有事物，无法加入集群。需要全局 reset master 重新开始集群复制。

多实例环境不要用 3306端口

多实例环境下，某个实例采用了默认的3306端口，会导致经常性的误操作。
一台主机最多部署10个实例
  
比如:
cluster节点A服务器启用三个端口实例: 3310, 3320, 3330,
cluster节点B服务器启用三个端口实例: 3310, 3320, 3330
cluster节点C服务器启用三个端口实例: 3310, 3320, 3330
实例数据目录分别为: /data/mysql3310,  /data/mysql3320, /data/mysql3330
  
管理节点D服务启动三个端口route端口实例: 3310, 3320, 3330
管理节点E服务启动三个端口route端口实例: 3310, 3320, 3330
实例数据目录分别为: /data/router3310, /data/router3320, /data/router3330

脑裂场景
当集群中有部分节点出现UNREACHABLE状态，此时集群无法做出决策，会出现以下局面，此时只剩下一个活跃节点，此节点只能提供查询，无法写入，执行写入操作会hang住。

mysql-js> cl.status();
{
    "clusterName": "myCluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "node-3:3305", 
        "status": "NO_QUORUM", 
        "statusText": "Cluster has no quorum as visible from 'node-1:3305' and cannot process write transactions. 2 members are not active", 
        "topology": {
            "node-1:3305": {
                "address": "node-1:3305", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "ONLINE"
            }, 
            "node-2:3305": {
                "address": "node-2:3305", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)"
            }, 
            "node-3:3305": {
                "address": "node-3:3305", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "UNREACHABLE"
            }
        }
    }
}

修复这种状态，需要执行forceQuorumUsingPartitionOf指定当前活跃节点(如果是多个则选择primary node)，此时活跃节点可以提供读写操作，然后将其他节点加入此集群。

mysql-js> cluster.forceQuorumUsingPartitionOf('root@node-1:3305')
Restoring replicaset 'default' from loss of quorum, by using the partition composed of [nt-metra-1:3305]

Please provide the password for 'root@nt-metra-1:3305': 
Restoring the InnoDB cluster ...

The InnoDB cluster was successfully restored using the partition from the instance 'root@nt-metra-1:3305'.

WARNING: To avoid a split-brain scenario, ensure that all other members of the replicaset are removed or joined back to the group that was restored.

mysql-js> cluster.status();
{
    "clusterName": "myCluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "nt-metra-1:3305", 
        "status": "OK_NO_TOLERANCE", 
        "statusText": "Cluster is NOT tolerant to any failures. 2 members are not active", 
        "topology": {
            "nt-metra-1:3305": {
                "address": "nt-metra-1:3305", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "ONLINE"
            }, 
            "nt-metra-2:3305": {
                "address": "nt-metra-2:3305", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)"
            }, 
            "nt-metra-3:3305": {
                "address": "nt-metra-3:3305", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)"
            }
        }
    }
}

# 根据情况使用
cluster.rejoinInstance("root@node-2:3305")
#或者，删除错误节点，重新添加：
cluster.removeInstance('root@node-2:3305');
cluster.addInstance('root@node-2:3305');

Mysql InnoDB Cluster集群部署中的问题汇集

你可能感兴趣的:(Mysql InnoDB Cluster集群部署中的问题汇集)