Kubernetes&Database-记录和反思一次容器云环境Mysql启动失败的问题

Kubernetes&Database-记录和反思一次容器云环境Mysql启动失败的问题

昨天晚上十点多运维的同事发了条信息,让帮忙协助解决某容器云项目中mysql无法启动的问题。
这里的mysql是公司容器云产品中的一个组件,主要用来存储平台组件的版本信息,每次hotfix都会更新这个库,其他组件也会在新建集群和升级操作时会用到这个库。
其实导致发生该问题的原因很简单,但因为不在现场沟通上存在问题,导致半个多小时才排查完。
注:平台使用的存储为glusterfs。

一开始看到的mysql日志如下:


InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
13:42:00 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

通过上面日志表现出的信息,初步认定为mysql容器因为其他原因down掉,在重启时因为数据冲突出现bug的提示。
于是让运维的同事先把挂载的数据文件备份出来,以免之后的操作造成数据损坏。
根据日志里给出的链接:http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html,尝试修改innodb_force_recovery参数忽略错误,强行拉库,然后进行备份和重建的操作。

但在之后看到的日志中出现:

~ # kubectl logs -f am-mysql-6985689999-qh9v2                                                                             root@kube-master-1
2020-08-23T13:42:15.650213Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2020-08-23T13:42:15.656529Z 0 [Warning] Can't create test file /var/lib/mysql/am-mysql-6985689999-qh9v2.lower-test
2020-08-23T13:42:15.656661Z 0 [Note] mysqld (mysqld 5.7.26) starting as process 1 ...
2020-08-23T13:42:15.667091Z 0 [Warning] Can't create test file /var/lib/mysql/am-mysql-6985689999-qh9v2.lower-test
2020-08-23T13:42:15.672911Z 0 [Warning] Can't create test file /var/lib/mysql/am-mysql-6985689999-qh9v2.lower-test
2020-08-23T13:42:15.675303Z 0 [Note] InnoDB: PUNCH HOLE support available
2020-08-23T13:42:15.675338Z 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2020-08-23T13:42:15.675342Z 0 [Note] InnoDB: Uses event mutexes
2020-08-23T13:42:15.675346Z 0 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2020-08-23T13:42:15.675349Z 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2020-08-23T13:42:15.675354Z 0 [Note] InnoDB: Using Linux native AIO
2020-08-23T13:42:15.675745Z 0 [Note] InnoDB: Number of pools: 1
2020-08-23T13:42:15.675925Z 0 [Note] InnoDB: Using CPU crc32 instructions
2020-08-23T13:42:15.678565Z 0 [Note] InnoDB: Initializing buffer pool, total size = 128M, instances = 1, chunk size = 128M
2020-08-23T13:42:15.690587Z 0 [Note] InnoDB: Completed initialization of buffer pool
2020-08-23T13:42:15.694116Z 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2020-08-23T13:42:15.720597Z 0 [Note] InnoDB: Highest supported file format is Barracuda.
2020-08-23T13:42:15.726557Z 0 [Note] InnoDB: Log scan progressed past the checkpoint lsn 12862247
2020-08-23T13:42:15.726580Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 12864344
2020-08-23T13:42:15.726893Z 0 [Note] InnoDB: Database was not shutdown normally!
2020-08-23T13:42:15.726904Z 0 [Note] InnoDB: Starting crash recovery.
2020-08-23T13:42:15.742203Z 0 [Note] InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percent: 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
2020-08-23T13:42:16.245030Z 0 [Note] InnoDB: Apply batch completed
2020-08-23T13:42:16.390120Z 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
2020-08-23T13:42:16.390155Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2020-08-23T13:42:16.396469Z 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2020-08-23 13:42:18 0x7f295c22a740  InnoDB: Assertion failure in thread 139815616161600 in file os0file.cc line 3109
InnoDB: We intentionally generate a memory trap.

其实这个错误很明显,但一开始重点关注到了bug相关的信息…
在启库时InnoDB: Setting file ‘./ibtmp1’ size to 12 MB.
后面正常的日志应该是[Note] InnoDB: File ‘./ibtmp1’ size is now 12 MB.
但在该次事故中却报了线程创建文件失败的错误。
结合前面Can’t create test file的警告,基本可以判断是存储出了问题,再一看pv的存储容量分配,基本就明了了,是因为存储空间不够导致的这次问题。
至于为什么存储容量会出问题,是因为mysql挂载的pv与其他应用在一个分区下,但是分配出去的配额超过了quota,其他应用又占用了绝大多数空间,导致mysql无法使用存储,最根本的原因还是因为存储资源使用的不规范。

从这个小事故中,提醒了我们一定要清楚上下文环境,不能想当然的进行相关的排查。这个问题本来两三分钟就可以解决掉,结果用了大半个小时的时间。

你可能感兴趣的:(Database,容器云,mysql,k8s)