一、场景:早上7点被电话叫醒,说夜间有大量业务订单不成功,web server上报了很多JDBCException和SocketTimeoutException,很明显与数据库有关,考虑到今天是新部署的xtrabackup备份脚本运行的第一天,怀疑与此有关。生产系统环境为redHat5.8,mysql版本为5.6,xtrabackup版本为2.2.10
innobackupex: Backup created in directory '/backup/inc_1'
150825 06:55:57 innobackupex: Connection to database server closed
150825 06:55:57 innobackupex: completed OK!
150825 01:04:09 innobackupex: Continuing after ibbackup has suspended
150825 01:04:09 innobackupex: Executing FLUSH TABLES WITH READ LOCK...
>> log scanned up to (547712171257)
>> log scanned up to (547712171257)
>> log scanned up to (547712171257)
一堆log scanned up to
150825 06:55:44 innobackupex: All tables locked and flushed to disk
150825 06:55:44 innobackupex: Starting to backup non-InnoDB tables and files
150825 06:55:57 innobackupex: All tables unlocked
很明显,xtrabackup在执行"FLUSH TABLES WITH READ LOCK"的时候被阻塞了,一直阻塞到06:55才执行成功。之前没太理解"FLUSH TABLES WITH READ LOCK"的威力,这个东西的执行,会遇到两种情况:
1)、很快执行成功,这样子后续的备份能较快完成。问题是"FLUSH TABLES WITH READ LOCK"到“All tables unlocked(unlock tables)"期间,肯定是会锁表的,这里不管你是innodb表还是MyISAM表。从上面的日志可以看出,锁了13秒,这13秒,所有的DML语句都会被阻塞。
2)、无法快速执行成功,这样灾难就来了,也就是本次故障发生的根源。什么会阻塞"FLUSH TABLES WITH READ LOCK"的执行呢?可以做下实验确认下
这里可以看到,一个长查询就把"FLUSH TABLES WITH READ LOCK"给阻塞了。
对任何一个表,比如t1,执行lock table t1 read,然后再在其它会话执行"FLUSH TABLES WITH READ LOCK",阻塞。如果对t1加写锁lock table t1 write,更会阻塞"FLUSH TABLES WITH READ LOCK"。
不管什么原因,"FLUSH TABLES WITH READ LOCK"被阻塞后,结果是很悲哀的。不仅所有库的DML语句会被阻塞,而且长查询或者显式加锁的数据库,查询都会有问题。
# Time: 150825 6:55:57
# User@Host: root[root] @ [] Id: 9922648
# Query_time: 1534.181557 Lock_time: 0.000000 Rows_sent: 0 Rows_examined: 0
SET timestamp=1440456957;
INSERT INTO `globallink_g`.`g_order` (`id`, `userid`, `create_date`, `modify_date`, `amount_paid`, `order_status`) VALUES (234977, 'test', '2015-08-25 06:36:39', '2015-08-25 06:36:42', 232, 1);
# Time: 150825 6:55:58
# User@Host: root[root] @ [] Id: 9921401
# Query_time: 4173.803182 Lock_time: 0.000081 Rows_sent: 1 Rows_examined: 9742
SET timestamp=1440456958;
call g_getActiveImsi('201508250551','10008');
1、该生产系统的数据库架构是传统的HA模式,即数据目录在共享磁阵,cluster软件监控mysql进程,不是流行的mysql master-slave架构,如果是master-slave架构,可以直接备份从库,这样子不会对主库有任何影响。
Use this option to disable table lock with FLUSH TABLES WITH READ LOCK. Use it only if ALL your tables are InnoDB and you DO NOT CARE about the binary log position of the backup. This option shouldn’ t be used if there are any DDL statements being executed or if any updates are happening on non-InnoDB tables (this includes the system MyISAM tables in the mysql database), otherwise it could lead to an inconsistent backup. If you are considering to use --no-lock because your backups are failing to acquire the lock,this could be because of incoming replication events preventing the lock from succeeding. Please try using
--safe-slave-backup to momentarily stop the replication slave thread, this may help the backup to succeed and you then don’t need to resort to using this option. xtrabackup_binlog_info is not created when �Cno-lock option is used (because SHOW MASTER STATUS may be inconsistent), but under certain conditions xtrabackup_binlog_pos_innodb can be used instead to get consistent binlog coordinates as described in Working with Binary Logs.
本文出自 “记忆碎片” 博客,谢绝转载!