业务说,为什么10号机房缺少这条数据,其他机房却有?
mysql> select * from tbl_groupinfo where gid=xxxxxxx limit 10; +------------+--------------+-------------+---------------------+------------+--------------+------------+-------------+-----------------+--------+--------+----------+-------------+-------------------+---------------+--------------+-----------+----------+----------+---------------------+-----------+ | sid | tm_timestamp | tm_lasttime | gid | group_name | default_flag | group_attr | group_owner | group_extension | is_del | app_id | mic_seat | invite_perm | invite_media_perm | pub_id_search | apply_verify | public_id | introduc | topic_id | __version | __deleted | +------------+--------------+-------------+---------------------+------------+--------------+------------+-------------+-----------------+--------+--------+----------+-------------+-------------------+---------------+--------------+-----------+----------+----------+---------------------+-----------+ | xxxxxxxxxx | 1495773704 | 1495773704 | xxxxxxxxxxx | 处对象 | 0 | 5 | 3611732366 | vx:wtc2033 | 0 | 18 | 8 | 0 | 0 | 1 | 0 | 0 | | 0 | 6126694332813803019 | 0 | +------------+--------------+-------------+---------------------+-------
大概断定,10号机房的数据同步是有问题的,先看这条记录,是从哪个机房插入的,然后再看10号机房与该机房之间的同步是否有问题,使用8827登录,获取这条数据的版本号__version,由函数转换得到这条数据,来自14号机房插入的, 日期:2017-05-26 05:03:03 机房号:14 端口号:11
这相当于MySQL里的binlog,会记录每条SQL,来自于哪个server-id,目的是为了防止循环复制,myshard不仅在binlog记录server-id,每条记录都带有版本号,包含了从哪个机房,哪个端口写入的,什么时候写入的
到这里,知道14号机房写入的数据,无法同步到10号机房,可以去14号看一下同步命令
[root@centos local]# echo stat | /scripts/nc_myshard 0 14505 |egrep "speed|behind|offset" shard_local Read_offset 48494420885 shard_local Read_speed 33373 shard_local Read_bytes_behind 0 sync_r12m0 Read_offset 48494420885 sync_r12m0 Read_speed 33373 sync_r12m0 Read_bytes_behind 0 sync_r13m0 Read_offset 48494420885 sync_r13m0 Read_speed 33373 sync_r13m0 Read_bytes_behind 0 sync_r1m0 Read_offset 48494420885 sync_r1m0 Read_speed 33373 sync_r1m0 Read_bytes_behind 0 sync_r3m0 Read_offset 48494420885 sync_r3m0 Read_speed 33373 sync_r3m0 Read_bytes_behind 0 shard_remote Read_offset 52080697507 shard_remote Read_speed 27290 shard_remote Read_bytes_behind 0
发现没有r10m0这个机房来拉取数据,那证明同步有问题了,去10号机房看同步的日志,看到不断去重连14号机房这个点
[root@localhost db_sync_HelloSrv_r10m0_d]# zcat db_sync_xxxxxxxx_r10m0_d.log.13.gz|grep xxx.xxx.xxx.144|more May 13 15:05:31 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0 May 13 15:05:51 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1 May 13 15:06:11 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2 May 13 15:06:41 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0 May 13 15:07:01 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1 May 13 15:07:21 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2 May 13 15:07:51 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0 May 13 15:08:11 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1 May 13 15:08:31 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2 May 13 15:09:01 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0 May 13 15:09:21 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1 May 13 15:09:41 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2 May 13 15:10:11 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0 May 13 15:10:31 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1 May 13 15:10:51 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2 May 13 15:11:21 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:0 May 13 15:11:41 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:1 May 13 15:12:01 info db_sync_xxxxxxxx_r10m0_d[]: [tid:77428] [OpLogSynchronizer::run] connecting to sync_r14m0 via xxx.xxx.xxx.144:12505, retry:2
看到有很多日志,不断重试去连接14号机房,其中最早的重连发生在
db_sync_xxxxxxxx_r10m0_d.log.13.gz
这个文件,而这个文件在5月14日记录的
-rw-r--r--. 1 root adm 174K May 13 00:10 db_sync_xxxxxxxx_r10m0_d.log.14.gz -rw-r--r--. 1 root adm 300K May 14 00:10 db_sync_xxxxxxxx_r10m0_d.log.13.gz -rw-r--r--. 1 root adm 230K May 15 00:10 db_sync_xxxxxxxx_r10m0_d.log.12.gz -rw-r--r--. 1 root adm 234K May 16 00:10 db_sync_xxxxxxxx_r10m0_d.log.11.gz -rw-r--r--. 1 root adm 260K May 17 00:10 db_sync_xxxxxxxx_r10m0_d.log.10.gz -rw-r--r--. 1 root adm 261K May 18 00:10 db_sync_xxxxxxxx_r10m0_d.log.9.gz -rw-r--r--. 1 root adm 260K May 19 00:10 db_sync_xxxxxxxx_r10m0_d.log.8.gz -rw-r--r--. 1 root adm 258K May 20 00:10 db_sync_xxxxxxxx_r10m0_d.log.7.gz -rw-r--r--. 1 root adm 260K May 21 00:10 db_sync_xxxxxxxx_r10m0_d.log.6.gz -rw-r--r--. 1 root adm 268K May 22 00:10 db_sync_xxxxxxxx_r10m0_d.log.5.gz -rw-r--r--. 1 root adm 254K May 23 00:10 db_sync_xxxxxxxx_r10m0_d.log.4.gz -rw-r--r--. 1 root adm 259K May 24 00:10 db_sync_xxxxxxxx_r10m0_d.log.3.gz -rw-r--r--. 1 root adm 262K May 25 00:10 db_sync_xxxxxxxx_r10m0_d.log.2.gz -rw-r--r--. 1 root adm 262K May 26 00:10 db_sync_xxxxxxxx_r10m0_d.log.1.gz
一般重连只有2种可能,一个是14号机房没有开放白名单,不允许10号机房访问,但之前搭建成功,肯定白名单是开放了,很可能防火墙出问题,于是在14号机房,进行
iptables -n -L|grep 10号机房的IP
发现电信IP是开放了规则,但是联通的IP是没有开放防火墙规则,这是双线机房,而我在5月12日部署的环境,说明部署环境2天后,因为网络质量,电信通道无法连接,改为了联通通道了,而联通IP没有授权,这就导致10号机房无法顺利连接14号机房了,但是当时业务没有使用这个数据库,昨天5月25日,业务开始部署进程在14号机房,发现数据没同步,才找DBA的。我于是马上加入防火墙规则,然后重启同步进程,重新拉取数据,但10号机房还是在报错不断重连
然而在14号机房可以看到另外一个错误
May 26 15:41:25 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3159] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:41:35 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3161] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:41:45 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3163] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:41:55 err db_sync_xxxxxxxx_r14m0_d[]: [tid:3234] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:42:05 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4411] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:42:15 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4416] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:42:25 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4560] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:42:35 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4656] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:42:45 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4657] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:42:55 err db_sync_xxxxxxxx_r14m0_d[]: [tid:4730] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:43:05 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5476] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:43:15 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5478] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:43:25 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5508] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:43:35 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5511] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:43:45 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5554] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132 May 26 15:43:55 err db_sync_xxxxxxxx_r14m0_d[]: [tid:5557] [SyncServer::dumpRecord2] client:sync_r10m0, request log[local] from invalid offset:2587613132
一直在报一个位置点2587613132不存在,无法拉取...那是因为太久没有连接上,而14号机房的binlog只保留7天导致的,14号+7天=21号出问题,于是解决方案是把在14号机房,寻找存在但还没有被删除的位置点,让10号机房去拉数据,然后询问业务在14号机房有写入操作的表有哪些,然后把14号机房的表数据导出来,然后倒入到10号机房
myshard的好处是可以通过导数来去修补缺失的数据,而mysql只能用percona的修复工具,这也是给自己一个教训,在机房网络条件差的情况下,开通ip必须全部ip都开了,另外业务需要补充数据,事后开会总结了几个规则
对myshard监控的监控一定要做足够,为了避免数据落后能够及时发现
业务人员在申请数据库申请权限时,多线机房要提供全部IP(电信IP,联通IP,内网IP,管理网IP)
myshard要做一致性hash,对于同一个用户,在哪个机房写入数据,在哪个机房进行修改数据
在同步落后的情况下,不要做节点之间的切业务