MySQL InnoDB Cluster 是最新GA的MySQL高可用方案,利用MySQL Group Replication和MySQL Shell、MySQL Router可以轻松搭建强壮的高可用方案。
MySQL Shell 是新的mysql 客户端工具支持x protocol和mysql protocol,具备JavaScript和python可编程能力,作为搭建InnoDB Cluster管理工具。
MySQL Router 是访问路由转发中间件,提供应用程序访问的failover能力。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CwKDEo6u-1597371597755)(https://jeremyxu2010.github.io/images/20190526/innodb_cluster_overview.png)]
上面这张图看着比较清楚,通过MySQL Shell可以配置出一个高可用自动进行故障转移的MySQL InnoDB Cluster,在后续运维过程中也可以通过MySQL Shell对集群进行状态监控及管理维护。通过MySQL Router向应用层屏蔽底层集群的细节,以应用层将普通的MySQL协议访问集群。
MySQL Group Replication 是最新GA的同步复制方式,具有以下特点:
这里准备了3台虚拟机mysql0
、mysql1
、mysql2
,IP分别为192.168.7.30
、192.168.7.31
、192.168.7.32
。
第一步是在三台虚拟机上均安装mysql-community-server、mysql-shell、mysql-router软件包。
# 配置mysql的yum源
$ yum install -y https://dev.mysql.com/get/mysql57-community-release-el7-8.noarch.rpm
# 安装
$ yum install -y mysql-community-server mysql-shell mysql-router
为保证三台虚拟机上可正常通过名称解析到对方的IP,配置DNS,或将主机名称映射写入hosts文件中。
$ cat << EOF >> /etc/hosts
192.168.7.30 mysql0
192.168.7.31 mysql1
192.168.7.32 mysql2
EOF
为了后续操作方便,这里修改三台虚拟机上MySQL的root密码。
# 首先得到初始的root密码
$ systemctl start mysqld
$ ORIGINAL_ROOT_PASSWORD=$(awk '/temporary password/{print $NF}' /var/log/mysqld.log)
# 这里将mysql的root密码修改为R00T@mysql,这个密码符合复杂度要求
$ MYSQL_PWD="$ORIGINAL_ROOT_PASSWORD" mysql --connect-expired-password -e "ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY 'R00T@mysql';"
# 顺便允许mysql可在其它主机登录过来
$ MYSQL_PWD="R00T@mysql" mysql --connect-expired-password -e "CREATE USER 'root'@'%' IDENTIFIED WITH mysql_native_password BY 'R00T@mysql';"
$ MYSQL_PWD="R00T@mysql" mysql --connect-expired-password -e "GRANT ALL ON *.* TO 'root'@'%' WITH GRANT OPTION; FLUSH PRIVILEGES;"
MySQL InnoDB Cluster底层依赖Group Replication模式,而配置Group Replication模式首先要通过dba.configureLocalInstance
设置每台虚拟机上的本地实例必要参数并持久化配置。
# 通过mysqlsh即可轻松完成本机实例的配置
$ cat << EOF > config_local_instance.js
dba.configureLocalInstance('root@localhost:3306', {'password': 'R00T@mysql', 'interactive': false, 'mycnfPath': '/etc/my.cnf'})
EOF
$ mysqlsh --no-password --js --file=config_local_instance.js
# 重启后才能生效
$ systemctl restart mysqld
# 再检查一下本地实例配置的状况
$ cat << EOF > check_instance_configuration.js
dba.checkInstanceConfiguration('root@localhost:3306', {'password': 'R00T@mysql', 'interactive': false})
EOF
$ mysqlsh --no-password --js --file=check_instance_configuration.js
首先需要在所有节点的防火墙开通接口3306
和33061
(用于建立Cluster)。
firewall-cmd --zone=public --add-port=3306/tcp --permanent
firewall-cmd --zone=public --add-port=33061/tcp --permanent
firewall-cmd --reload
# 查看是否添加成功
firewall-cmd --zone=public --list-port
然后在mysql0这台虚拟机上进行以下操作就可以了。
$ cat << EOF > init_cluster.js
shell.connect('root@localhost:3306', 'R00T@mysql')
dba.createCluster('mycluster', {'localAddress': 'mysql0'})
var cluster=dba.getCluster('mycluster')
cluster.addInstance('root@mysql1:3306', {'localAddress': 'mysql1', 'password': 'R00T@mysql'})
cluster.addInstance('root@mysql2:3306', {'localAddress': 'mysql2', 'password': 'R00T@mysql'})
EOF
$ mysqlsh --no-password --js --file=init_cluster.js
如果在mysqlsh
中查看集群状态,会类似如下输出:
$ mysqlsh
> shell.connect('root@localhost:3306', 'R00T@mysql')
> var cluster=dba.getCluster('mycluster')
> cluster.status()
{
"clusterName": "mycluster",
"defaultReplicaSet": {
"name": "default",
"primary": "mysql0:3306",
"ssl": "REQUIRED",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
"topology": {
"mysql0:3306": {
"address": "mysql0:3306",
"mode": "R/W",
"readReplicas": {},
"role": "HA",
"status": "ONLINE"
},
"mysql1:3306": {
"address": "mysql1:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE"
},
"mysql2:3306": {
"address": "mysql2:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "mysql0:3306"
}
可见是一个读写(R/W
),其余为只读(R/O
)。
mysql0
节点故障在mysql0
节点停止mysqld
服务:systemctl stop mysqld
。
在mysql1
节点上查看集群状态:
...
"statusText": "Cluster is NOT tolerant to any failures. 1 member is not active",
"topology": {
"mysql0:3306": {
"address": "mysql0:3306",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"shellConnectError": "MySQL Error 2003 (HY000): Can't connect to MySQL server on 'mysql0' (111)",
"status": "(MISSING)"
},
"mysql1:3306": {
"address": "mysql1:3306",
"mode": "R/W",
"readReplicas": {},
"role": "HA",
"status": "ONLINE"
},
...
此时mysql
被置为可读写的主节点,mysql2
为只读节点。
下面启动mysql0
的mysqld
服务:systemctl start mysqld
。
此时对于故障恢复的节点需要执行rejoinInstance
操作:
$ mysqlsh
> shell.connect('root@localhost:3306', 'R00T@mysql')
> var cluster=dba.getCluster('mycluster')
> cluster.status()
...
"statusText": "Cluster is NOT tolerant to any failures. 1 member is not active",
"topology": {
"mysql0:3306": {
"address": "mysql0:3306",
"mode": "R/W",
"readReplicas": {},
"role": "HA",
"status": "(MISSING)"
},
...
> cluster.rejoinInstance('root@mysql0:3306', {'password': 'R00T@mysql'})
> cluster.status()
...
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
"topology": {
"mysql0:3306": {
"address": "mysql0:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE"
},
"mysql1:3306": {
"address": "mysql1:3306",
"mode": "R/W",
"readReplicas": {},
"role": "HA",
"status": "ONLINE"
},
...
可见mysql0
作为从节点加入了。
当集群的所有节点都offline,直接获取集群信息失败,如何重新恢复集群。
> var cluster=dba.getCluster('mycluster')
Dba.getCluster: This function is not available through a session to a standalone instance (metadata exists, instance belongs to that metadata, but GR is not active) (RuntimeError)
执行rebootClusterFromCompleteOutage
命令,可恢复集群(下面是在mysql2
上执行的)。
> dba.rebootClusterFromCompleteOutage('mycluster')
Restoring the cluster 'mycluster' from complete outage...
The instance 'mysql0:3306' was part of the cluster configuration.
Would you like to rejoin it to the cluster? [y/N]: y
The instance 'mysql1:3306' was part of the cluster configuration.
Would you like to rejoin it to the cluster? [y/N]: y
NOTE: Group Replication will communicate with other members using 'mysql2:33061'. Use the localAddress option to override.
...
当集群中有部分节点出现UNREACHABLE状态,此时集群无法做出决策,会出现以下局面,此时只剩下一个活跃节点,此节点只能提供查询,无法写入,执行写入操作会hang住。
{
"clusterName": "mycluster",
"defaultReplicaSet": {
"name": "default",
"primary": "mysql2:3306",
"ssl": "DISABLED",
"status": "NO_QUORUM",
"statusText": "Cluster has no quorum as visible from 'mysql2:3306' and cannot process write transactions. 2 members are not active",
"topology": {
"mysql0:3306": {
"address": "mysql0:3306",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"shellConnectError": "MySQL Error 2003 (HY000): Can't connect to MySQL server on 'mysql0' (111)",
"status": "(MISSING)"
},
"mysql1:3306": {
"address": "mysql1:3306",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"shellConnectError": "MySQL Error 2003 (HY000): Can't connect to MySQL server on 'mysql1' (110)",
"status": "UNREACHABLE"
},
"mysql2:3306": {
"address": "mysql2:3306",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE"
}
},
...
修复这种状态,需要执行forceQuorumUsingPartitionOf
指定当前活跃节点(如果是多个则选择primary node),此时活跃节点可以提供读写操作,然后将其他节点加入此集群。
> cluster.forceQuorumUsingPartitionOf('root@mysql2:3306')
Restoring cluster 'mycluster' from loss of quorum, by using the partition composed of [mysql2:3306]
Restoring the InnoDB cluster ...
Please provide the password for 'root@mysql2:3306': **********
Save password for 'root@mysql2:3306'? [Y]es/[N]o/Ne[v]er (default No): y
The InnoDB cluster was successfully restored using the partition from the instance 'root@mysql2:3306'.
WARNING: To avoid a split-brain scenario, ensure that all other members of the cluster are removed or joined back to the group that was restored.
> cluster.status()
...
"status": "OK_NO_TOLERANCE",
"statusText": "Cluster is NOT tolerant to any failures. 2 members are not active",
"topology": {
"mysql0:3306": {
"address": "mysql0:3306",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"shellConnectError": "MySQL Error 2003 (HY000): Can't connect to MySQL server on 'mysql0' (111)",
"status": "(MISSING)"
},
"mysql1:3306": {
"address": "mysql1:3306",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"shellConnectError": "MySQL Error 2003 (HY000): Can't connect to MySQL server on 'mysql1' (113)",
"status": "(MISSING)"
},
"mysql2:3306": {
"address": "mysql2:3306",
"mode": "R/W",
"readReplicas": {},
"role": "HA",
"status": "ONLINE"
}
...
可以看到节点状态从UNREACHABLE
变成了(MISSING)
,集群状态从NO_QUORUM
变为OK_NO_TOLERANCE
。此时就可以进行写操作了。
故障修复后把节点加进来:
> cluster.rejoinInstance('root@mysql0:3306', {'password': 'R00T@mysql'})
> cluster.rejoinInstance('root@mysql1:3306', {'password': 'R00T@mysql'})
# 这里最初UNREACHABLE的节点在加入回来会报错,似乎是一个BUG,可以先尝试删除再加入
> cluster.removeInstance('root@mysql1:3306', {'password': 'R00T@mysql', 'force': true})
> cluster.addInstance('root@mysql1:3306', {'localAddress': 'mysql1', 'password': 'R00T@mysql'})
首先需要在所有节点的防火墙开通接口6446
和6447
(用于建立Cluster)。
firewall-cmd --zone=public --add-port=6446/tcp --permanent
firewall-cmd --zone=public --add-port=6447/tcp --permanent
firewall-cmd --reload
# 查看是否添加成功
firewall-cmd --zone=public --list-port
为了向应用层屏蔽底层集群的细节,我们还可以在三台虚拟机上均部署mysql-router。
# 以当前集群信息创建mysql-router的配置信息,注意这里密码R00T@mysql被编码为R00T%40mysql
$ mysqlrouter --bootstrap root:R00T%40mysql@mysql2:3306 --user=mysqlrouter
# 重启mysqlrouter服务
$ systemctl enable mysqlrouter
$ systemctl restart mysqlrouter
mysqlrouter --bootstrap
的时候会根据提供的URI获取集群元数据,如果该URI是只读的,那会自动找到读写节点。
利用keepalived
配置高可用。
安装keepalived:
yum install keepalived
# 备份配置文件
cp /etc/keepalived/keepalived.conf /etc/keepalived/keepalived.conf.bak
编辑配置文件/etc/keepalived/keepalived.conf
# MASTER节点
global_defs {
router_id MYSQL_ROUTER # 各节点统一ID
vrrp_skip_check_adv_addr
vrrp_strict
vrrp_garp_interval 0
vrrp_gna_interval 0
}
vrrp_script check_mysqlrouter {
script "/usr/bin/killall -0 /usr/bin/mysqlrouter" # 检测mysqlrouter是否在运行
interval 2
weight 2
fall 2
}
vrrp_instance VI_1 {
state MASTER # 主节点
interface ens192 # VIP绑定的网卡
virtual_router_id 33 # 各节点统一的虚拟ID
priority 102 # 数越高优先级越高
advert_int 1 # 检测间隔 1s
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
192.168.7.33 # VIP
}
track_script {
check_mysqlrouter # 检测脚本
}
}
# BACKUP节点(不同的配置)
state BACKUP # 备节点
priority 101 # 数值低于MASTER
配置防火墙:
firewall-cmd --add-rich-rule='rule protocol value="vrrp" accept' --permanent
firewall-cmd --reload
重启keepalived:systemctl restart keepalived
,然后ip a
查看VIP绑定情况。