全网唯一解决Mysql数据库宕机生产事故的通用方法高级DBA真实案例解答

解决Mysql生产事故的通用方法高级DBA真实案例解答全网唯一

国内90%的生产数据库用的最多的就是MySql数据库。企业软件生产环境通常情况下,都有一主一从,或者一主多从的HA高可用架构。结合作者本人实际经验结合一个实际的生产数据库宕机的例子,来讲解应该用怎么样的思路去解决问题。在生产事故过程中,有那些注意的关键点
作者本人简介:现任国内某大型软件公司大数据研发工程师、MySQL数据库DBA,软件架构师。直接参与设计国家级亿级别大数据项目。并维护真实企业级生产数据库300余个。紧急处理数据库生产事故上百起,挽回数据丢失所操作的灾难损失不计其数。

本文拿一个真正生产案例MYSQL宕机的实际案例做讲解,主要是叙述解决问题的思路跟方法,毕竟真正解决生产数据库事故的研发人员也是极少的。

一、遇到生产级别的数据库事故第一步需要怎么做?

第一步先去生产服务器去检查数据库备份是否完整,通过查看生产数据库的备份文件的大小判断。先理性的判断,如果数据丢失了,会造成多大的损失,数据可以全量恢复到什么时间段。比如1天前,或者几小时前。对事故做一个初步的预估,最坏的情况跟最好的情况。判断事故的严重性,这一步会影响后面的解决问题的操作的

二、通过数据库日志、应用日志判断寻找问题的线索

查看my.cnf或者my.ini判断运行日志跟错误日志的位置,如果没有配置,则可以配置上,再重启数据库。

[mysqld]
log-error=/home/gs/mysql-8.0.18/log/error.log
log=/home/gs/mysql-8.0.18/log/mysql.log

通常错误信息都会写在错误日志里面。
如果需要更加细致的信息就需要打开运行日志

三、过滤出数据库错误日志的错误的信息

真实生产MYSQL数据库报错如下:

2023-05-15T04:04:07.728576Z 1 [Warning] [MY-013541] [InnoDB] Doublewrite page 21 for {space: 26572, page_no:8950} could not be restored. File name unknown for tablespace ID 26572
2023-05-15T04:04:07.728745Z 1 [Warning] [MY-013541] [InnoDB] Doublewrite page 22 for {space: 26572, page_no:8951} could not be restored. File name unknown for tablespace ID 26572
2023-05-15T04:04:07.728895Z 1 [Warning] [MY-013541] [InnoDB] Doublewrite page 23 for {space: 26572, page_no:8952} could not be restored. File name unknown for tablespace ID 26572
2023-05-15T04:04:07.729068Z 1 [Warning] [MY-013541] [InnoDB] Doublewrite page 24 for {space: 26572, page_no:8953} could not be restored. File name unknown for tablespace ID 26572
2023-05-15T04:04:07.729235Z 1 [Warning] [MY-013541] [InnoDB] Doublewrite page 25 for {space: 26572, page_no:8954} could not be restored. File name unknown for tablespace ID 26572
2023-05-15T04:04:07.729391Z 1 [Warning] [MY-013541] [InnoDB] Doublewrite page 26 for {space: 26572, page_no:8955} could not be restored. File name unknown for tablespace ID 26572
2023-05-15T04:04:07.729546Z 1 [Warning] [MY-013541] [InnoDB] Doublewrite page 27 for {space: 26572, page_no:8956} could not be restored. File name unknown for tablespace ID 26572
2023-05-15T04:04:07.729705Z 1 [Warning] [MY-013541] [InnoDB] Doublewrite page 28 for {space: 26572, page_no:8957} could not be restored. File name unknown for tablespace ID 26572
2023-05-15T04:04:07.729865Z 1 [Warning] [MY-013541] [InnoDB] Doublewrite page 29 for {space: 26572, page_no:8958} could not be restored. File name unknown for tablespace ID 26572
2023-05-15T04:04:07.730027Z 1 [Warning] [MY-013541] [InnoDB] Doublewrite page 30 for {space: 26572, page_no:8959} could not be restored. File name unknown for tablespace ID 26572
2023-05-15T04:04:07.730221Z 1 [ERROR] [MY-012179] [InnoDB] Could not find any file associated with the tablespace ID: 26572
2023-05-15T04:04:07.730400Z 1 [ERROR] [MY-012964] [InnoDB] Use --innodb-directories to find the tablespace files. If that fails then use --innodb-force-recovery=1 to ignore this and to permanently lose all changes to the missing tablespace(s)
2023-05-15T04:04:07.830762Z 1 [ERROR] [MY-012930] [InnoDB] Plugin initialization aborted with error Generic error.
2023-05-15T04:04:08.228539Z 1 [ERROR] [MY-010334] [Server] Failed to initialize DD Storage Engine
2023-05-15T04:04:08.229036Z 0 [ERROR] [MY-010020] [Server] Data Dictionary initialization failed.
2023-05-15T04:04:08.229530Z 0 [ERROR] [MY-010119] [Server] Aborting
2023-05-15T04:04:08.230195Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.26)  MySQL Community Server - GPL.
2023-05-15T04:04:12.242005Z 0 [Warning] [MY-010140] [Server] Could not increase number of max_open_files to more than 10000 (request: 100000)
2023-05-15T04:04:12.242023Z 0 [Warning] [MY-010141] [Server] Changed limits: max_connections: 9190 (requested 10000)
2023-05-15T04:04:12.242031Z 0 [Warning] [MY-010142] [Server] Changed limits: table_open_cache: 400 (requested 2000)
2023-05-15T04:04:12.479065Z 0 [Warning] [MY-011068] [Server] The syntax 'slave_parallel_type' is deprecated and will be removed in a future release. Please use replica_parallel_type instead.
2023-05-15T04:04:12.479081Z 0 [Warning] [MY-011068] [Server] The syntax 'slave_parallel_workers' is deprecated and will be removed in a future release. Please use replica_parallel_workers instead.
2023-05-15T04:04:12.479104Z 0 [Warning] [MY-011068] [Server] The syntax 'slave_skip_errors' is deprecated and will be removed in a future release. Please use replica_skip_errors instead.
2023-05-15T04:04:12.479129Z 0 [Warning] [MY-011068] [Server] The syntax 'expire-logs-days' is deprecated and will be removed in a future release. Please use binlog_expire_logs_seconds instead.
2023-05-15T04:04:12.479199Z 0 [Warning] [MY-010915] [Server] 'NO_ZERO_DATE', 'NO_ZERO_IN_DATE' and 'ERROR_FOR_DIVISION_BY_ZERO' sql modes should be used with strict mode. They will be merged with strict mode in a future release.
2023-05-15T04:04:12.480718Z 0 [System] [MY-010116] [Server] /usr/sbin/mysqld (mysqld 8.0.26) starting as process 303345
2023-05-15T04:04:12.482007Z 0 [Warning] [MY-013242] [Server] --character-set-server: 'utf8' is currently an alias for the character set UTF8MB3, but will be an alias for UTF8MB4 in a future release. Please consider using UTF8MB4 in order to be unambiguous.
2023-05-15T04:04:12.482016Z 0 [Warning] [MY-013244] [Server] --collation-server: 'utf8_general_ci' is a collation of the deprecated character set UTF8MB3. Please consider using UTF8MB4 with an appropriate collation instead.
2023-05-15T04:04:12.483293Z 0 [Warning] [MY-010161] [Server] You need to use --log-bin to make --binlog-format work.
2023-05-15T04:04:12.490129Z 1 [System] [MY-013576] [InnoDB] InnoDB initialization has started.
2023-05-15T04:04:17.252454Z 1 [Warning] [MY-013541] [InnoDB] Doublewrite page 21 for {space: 26572, page_no:8950} could not be restored. File name unknown for tablespace ID 26572
2023-05-15T04:04:17.252659Z 1 [Warning] [MY-013541] [InnoDB] Doublewrite page 22 for {space: 26572, page_no:8951} could not be restored. File name unknown for tablespace ID 26572
2023-05-15T04:04:17.252813Z 1 [Warning] [MY-013541] [InnoDB] Doublewrite page 23 for {space: 26572, page_no:8952} could not be restored. File name unknown for tablespace ID 26572
2023-05-15T04:04:17.252992Z 1 [Warning] [MY-013541] [InnoDB] Doublewrite page 24 for {space: 26572, page_no:8953} could not be restored. File name unknown for tablespace ID 26572
2023-05-15T04:04:17.253150Z 1 [Warning] [MY-013541] [InnoDB] Doublewrite page 25 for {space: 26572, page_no:8954} could not be restored. File name unknown for tablespace ID 26572
2023-05-15T04:04:17.253310Z 1 [Warning] [MY-013541] [InnoDB] Doublewrite page 26 for {space: 26572, page_no:8955} could not be restored. File name unknown for tablespace ID 26572
2023-05-15T04:04:17.253472Z 1 [Warning] [MY-013541] [InnoDB] Doublewrite page 27 for {space: 26572, page_no:8956} could not be restored. File name unknown for tablespace ID 26572
2023-05-15T04:04:17.253635Z 1 [Warning] [MY-013541] [InnoDB] Doublewrite page 28 for {space: 26572, page_no:8957} could not be restored. File name unknown for tablespace ID 26572
2023-05-15T04:04:17.253802Z 1 [Warning] [MY-013541] [InnoDB] Doublewrite page 29 for {space: 26572, page_no:8958} could not be restored. File name unknown for tablespace ID 26572
2023-05-15T04:04:17.253970Z 1 [Warning] [MY-013541] [InnoDB] Doublewrite page 30 for {space: 26572, page_no:8959} could not be restored. File name unknown for tablespace ID 26572
2023-05-15T04:04:17.254166Z 1 [ERROR] [MY-012179] [InnoDB] Could not find any file associated with the tablespace ID: 26572
2023-05-15T04:04:17.254350Z 1 [ERROR] [MY-012964] [InnoDB] Use --innodb-directories to find the tablespace files. If that fails then use --innodb-force-recovery=1 to ignore this and to permanently lose all changes to the missing tablespace(s)
2023-05-15T04:04:17.354703Z 1 [ERROR] [MY-012930] [InnoDB] Plugin initialization aborted with error Generic error.
2023-05-15T04:04:17.752409Z 1 [ERROR] [MY-010334] [Server] Failed to initialize DD Storage Engine
2023-05-15T04:04:17.752885Z 0 [ERROR] [MY-010020] [Server] Data Dictionary initialization failed.
2023-05-15T04:04:17.753405Z 0 [ERROR] [MY-010119] [Server] Aborting
2023-05-15T04:04:17.753960Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.26)  MySQL Community Server - GPL.

全网唯一解决Mysql数据库宕机生产事故的通用方法高级DBA真实案例解答_第1张图片
先看[ERROR]标识的日志,错误是因为数据库宕机的原因,但是往往错误信息与上方的警告信息互相关联。其实本质可能是同一个问题。

[MY-012179] [InnoDB] Could not find any file associated with the tablespace ID: 26572

本次案例中的报错主要是讲表空间文件破坏或者丢失,线索就是表空间ID。我们要跟福尔摩斯一样,通过这个表空间ID去寻找究竟是那个数据库的那个表关联到这个表空间。把这个损害的表定位到,然后尝试恢复数据。

四、启动数据库安全模式,先将数据库忽略错误启动。

[mysqld]
innodb_force_recovery = 1  #我们从1到6依次的启动,按级别的高级顺序。

全网唯一解决Mysql数据库宕机生产事故的通用方法高级DBA真实案例解答_第2张图片
MySQL在转储文件或者导入数据的过程中,出现中断、失败或者异常,造成数据无法回滚,可以通过innodb_force_recovery强力迫使InnoDB存储引擎运行,同时阻止后台操作运行,以便转储表数据。

mysql官方提供了6个等级
Innodb事务型存储引擎,通过redo,undo,double write这些特性保证数据的完整,针对硬件故障,内核bug,突然断电的事件,需要手动对Innodb进行恢复;
可以将Innodb page 损坏分为几类,data page 损坏,secondary_index page 损坏, root index 损坏,data dictionary 损坏,恢复的难度依次增加;与朋友一起恢复innodb的时候,重新认识了下innodb_force_recovery;
最初对innodb_force_recovery 的认识只是错误的停留在 它只针对无法启动的时候使用,1-6的参数,对损坏数据只在启动的时候不去检查;后来才明白启用该参数后,MySQL redo only 就是为了保证对应参数里面的值,不启用后台thread的任何检查直至设置innodb_force_recovery=0才可以,同时跟大家分享下, check table 的结果对于innodb 是不可信的(明明error log报page错误,但检测结果仍是ok) (以下内容多参考官网)
innodb_force_recovery 在使用的时候,能尽量从1-6依次递增,=3的时候,已经包括 =1 和=2的处理情况,一般 = 1-3的时候,数据的完整性相对来说还是可以保证的(除了已经损坏的部分),>=4 的时候可能造成 page处于一种相对“过时”(obsolete state),(如果不进行重建损坏的表),可能造成B-trees and other database structures 的损坏,>0 的时候,INSERT,UPDATE,DELETE这些操作都是禁止的,下面介绍下各个参数的具体含义:
1 (SRV_FORCE_IGNORE_CORRUPT):
强制忽略corrupt page并自动跳过,期间可以dump table;
2 (SRV_FORCE_NO_BACKGROUND):
在前置忽略corrupt page 的基础上(包含=1的作用),阻塞 master thread 和 任何的 purge thread 运行(有效防止在purge的时候发生MySQL crash)
3 (SRV_FORCE_NO_TRX_UNDO):
在忽略 corrupt page,阻塞 purge thread的基础上,不进行 transaction rollback;
4 (SRV_FORCE_NO_IBUF_MERGE):
在忽略 corrupt page,阻塞 purge thread,禁止 transaction rollback 基础上,禁止 merge insert buffer,对 table statistics 不进行更新;(这样会损坏 data file,等恢复后最好重建所有的secondary index);
5 (SRV_FORCE_NO_UNDO_LOG_SCAN):
在忽略 corrupt page ,阻塞purge thread,禁止 transaction rollback,禁止merge insert buffer,停止 table statistic 的基础上,在启动 MySQL的时候,不在扫描 undo logs,对待incomplete transactions as committed;
6 (SRV_FORCE_NO_LOG_REDO):
在以上所有的基础上,redo log 不进行前滚(roll-forward)
这里再次提醒下,对Innodb_force_recovery的赋值最好是依次递增(除非自己做过严格测试)

五、MYSQL服务容错启动之后定位破坏的表文件

MYSQL容错启动之后,先通过表空间ID的线索找到是那个表。
我们需要进入information_schma数据库

在这里插入图片描述

select * from INNODB_TABLES where SPACE=26572
select * from INNODB_TABLESPACES where SPACE=26572

通过2个查询既可以获取到具体是那个表关联文件破坏了。

六、通过定位到具体的表,再通过历史备份恢复生产数据

将生产数据库的该表直接drop掉,然后重建,再把数据从最近历史备份分离出来,恢复回去。这样破坏的文件会重建。

也可以将该数据库先DUMP下来,然后删除整个数据库重新恢复。所有的破坏的表文件,都会重建!
innodb_force_recovery >0是不能更新操作的。所以先要在>0的状态尽量DUMP将生产的数据备份出来,如果DUMP报错,则用历史的备份恢复。然后删除掉关联坏的表或者数据库。DROP的时候尽量按表、数据库的单位删除,因为会关联很多莫名的未知文件。整体思路就是找到坏的文件关联的表或者数据库,让MySql去删除操作,然后按常规的顺序重建,重新创建文件。
然后在innodb_force_recovery =0的状态下,再重新新增表操作并插入数据。

七、关闭安全模式,重新启动生产数据库。

[mysqld]
innodb_force_recovery = 0  #配置成0

重启数据库服务。在innodb_force_recovery =0的状态下,再重新新增表操作并插入数据

如果配置有主从的集群,则需要在master主操作,将坏的表,数据库重建,然后将操作遗传至各个slave,因为上述是drop操作,理论上主从同步过程中没有什么冲突。如果存在冲突,可以配置从库slalve忽略一些冲突(master表文件破坏关联的问题)。或者笨方法,主从数据重置同步,主从关系重新配置(傻瓜方法)。

作者后言

本案例只是举一个简单的经常遇到的宕机案例,而实际的情况会更加复杂,比如掺杂这数据库主从集群数据恢复问题,或者中间的操作过程并没有作者叙述的那么顺利,中间的每个环节都可能被卡住。实际的情况,也根据实际的局面跟以往的经验去尝试解决,本文案例只是举一个毕竟常见的宕机情况。基本上数据库运维人员都会碰到。最好提前部署好HA高可用架构,让数据丢失的可能性降低到最低。但是往往出现生产事故中,是需要DBA迅速马上解决问题的,真实的情况遇到的压力会非常大,时间也非常紧迫,造成的人力物力财力的损失也不可估量,更需要我们的“胆量”,“信心”,“魄力”,‘技术能力’,毕竟真正能处理数据库生产事故的开发是国内开发团队中也是极少数。

你可能感兴趣的:(DBA运维,数据库技术,MySQL,数据库,mysql,dba)