惊魂未定,回忆mysql挂掉运维历程。

5分钟前问题已经修复了,但是这个问题是很难重现的。趁着自己还有印象赶紧写下,希望能对后人有所帮助。(环境centos)

问题重现

2018年2月1号晚上11点左右,思考好了新架构之后开始动手设计数据库了。然后突然发现mysql连接不上。嗯,mysql挂掉了。

问题分析

首先还是先看看能不能手动启动服务。

sudo service mysqld restart

结果是(注意"Active: failed (Result: start-limit)“)

MySQL Server Loaded: loaded (/usr/lib/systemd/system/mysqld.service; enabled) 
Active: failed (Result: start-limit) since Sun 2015-12-06 03:14:54 GMT;
4min 7s ago Process: 6992 ExecStart=/usr/sbin/mysqld
--daemonize $MYSQLD_OPTS (code=exited, status=1/FAILURE) Process: 6971 ExecStartPre=/usr/bin/mysqld_pre_systemd (code=exited,
status=0/SUCCESS) 
Dec 06 03:14:54 localhost.localdomain systemd[1]: mysqld.service: control process exited, code=exited status=1 
Dec 06 03:14:54 localhost.localdomain systemd[1]: Failed to start MySQL Server. 
Dec 06 03:14:54 localhost.localdomain systemd[1]: Unit  mysqld.service entered failed state. 
Dec 06 03:14:54 localhost.localdomain systemd[1]: mysqld.service holdoff time over, scheduling restart. 
Dec 06 03:14:54 localhost.localdomain systemd[1]: Stopping MySQL Server... 
Dec 06 03:14:54 localhost.localdomain systemd[1]: Starting MySQL Server...   
Dec 06 03:14:54 localhost.localdomain systemd[1]: mysqld.service start request repeated too quickly, refusing to start. 
Dec 06 03:14:54 localhost.localdomain systemd[1]: Failed to start MySQL Server. 
Dec 06 03:14:54 localhost.localdomain systemd[1]: Unit mysqld.service entered failed state. The journal command reads Failed to start MySQL Server

之前没有太多的mysql运维经验,所以看到这个错误我是很奇怪的。首先我不知道mysqld有自动重启的这个配置(也就是mysql服务如果挂掉,会自动重启,但是如果超过一定的次数就会有这个错误-- start-limit)

系统提示的错误是十分有限的。我打算check一下错误日志,先看看日志放哪里。

vim /etc/my.cnf
# 结果如下。。。


# result
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock

# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0

lower_case_table_names=1
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid

max_allowed_packet=100M
interactive_timeout=28800000
wait_timeout=28800000

嗯。。看到了。"/var/log/mysqld.log",打开mysqld.log,检查一下日志。

vim /var/log/mysqld.log

惊魂的一幕出现了。

2018-02-01T10:20:32.366585Z 189 [Note] Access denied for user 'root'@'140.205.225.192' (using password: NO)
2018-02-01T10:20:32.501455Z 191 [Note] Access denied for user 'root'@'140.205.225.192' (using password: YES)
2018-02-01T10:20:32.630537Z 192 [Note] Access denied for user 'root'@'140.205.225.192' (using password: YES)
2018-02-01T10:20:32.766954Z 193 [Note] Access denied for user 'root'@'140.205.225.192' (using password: YES)
2018-02-01T10:20:32.898277Z 194 [Note] Access denied for user 'root'@'140.205.225.192' (using password: YES)
2018-02-01T10:20:33.032151Z 195 [Note] Access denied for user 'root'@'140.205.225.192' (using password: YES)
2018-02-01T10:20:33.164272Z 196 [Note] Access denied for user 'root'@'140.205.225.192' (using password: YES)
2018-02-01T10:20:33.286177Z 197 [Note] Access denied for user 'admin'@'140.205.225.192' (using password: NO)
2018-02-01T10:20:33.412762Z 198 [Note] Access denied for user 'admin'@'140.205.225.192' (using password: YES)
2018-02-01T10:20:33.546427Z 199 [Note] Access denied for user 'admin'@'140.205.225.192' (using password: YES)
2018-02-01T10:20:33.674729Z 200 [Note] Access denied for user 'admin'@'140.205.225.192' (using password: YES)
2018-02-01T10:20:33.802997Z 201 [Note] Access denied for user 'admin'@'140.205.225.192' (using password: YES)
2018-02-01T10:20:33.932708Z 202 [Note] Access denied for user 'admin'@'140.205.225.192' (using password: YES)
2018-02-01T10:20:34.064128Z 203 [Note] Access denied for user 'admin'@'140.205.225.192' (using password: YES)
2018-02-01T10:20:34.191768Z 204 [Note] Access denied for user 'test'@'140.205.225.192' (using password: NO)
2018-02-01T10:20:34.317222Z 205 [Note] Access denied for user 'test'@'140.205.225.192' (using password: YES)
2018-02-01T10:20:34.450647Z 206 [Note] Access denied for user 'test'@'140.205.225.192' (using password: YES)
2018-02-01T10:20:34.584794Z 207 [Note] Access denied for user 'test'@'140.205.225.192' (using password: YES)
2018-02-01T10:20:34.712521Z 208 [Note] Access denied for user 'test'@'140.205.225.192' (using password: YES)
2018-02-01T10:20:34.843531Z 209 [Note] Access denied for user 'test'@'140.205.225.192' (using password: YES)
2018-02-01T10:20:34.974842Z 210 [Note] Access denied for user 'test'@'140.205.225.192' (using password: YES)
2018-02-01T12:31:02.081525Z 211 [Warning] IP address '58.62.52.180' could not be resolved: Name or service not known
2018-02-01T13:07:58.557754Z 0 [Note] Giving 6 client threads a chance to die gracefully
2018-02-01T13:07:58.644307Z 0 [Note] Shutting down slave threads
2018-02-01T13:08:00.744418Z 0 [Note] Forcefully disconnecting 5 remaining clients
2018-02-01T13:08:00.744494Z 0 [Warning] /usr/sbin/mysqld: Forcing close of thread 104  user: 'root'

2018-02-01T13:08:00.744555Z 0 [Warning] /usr/sbin/mysqld: Forcing close of thread 211  user: 'root'

震惊~

有爆破经验的我意识到我的测试服务器的mysql被爆破了,但是显然没有成功。我很后悔没有对mysql端口进行ip控制(已修复)。但是日志,没有半点异常退出的记录。此时我陷入了沉思,一边思考,一边Google。

问题分析

被攻击的事情先放一放。还是得要先恢复服务。因为日志不显示出来,我怀疑是mysql的日志服务还没启动,就已经挂掉了。没有日志提示,我十分难定位。这个时候只能靠多年自己摸索得出来的经验了。

“一个大型的程序,往往日志服务是最先优先级的。但是如果连日志服务都挂掉的话,可能就是分配内存,空间。或者端口有问题了。端口问题已经被排除。“

当然了,上述只是一个猜想。通过sudo service mysqld status,我看到了两个mysqlserver的进程服务。

  Process: 12686 ExecStart=/usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid $MYSQLD_OPTS (code=exited, status=0/SUCCESS)
  Process: 12669 ExecStartPre=/usr/bin/mysqld_pre_systemd (code=exited, status=0/SUCCESS)

后来自己手动分别执行服务,惊奇的发现,有日志了。

2018-02-02T03:07:56.226901Z 0 [ERROR] InnoDB: Write to file ./ibtmp1failed at offset 9437184, 1048576 bytes should have been written, only 221184 were written. Operating system error number 28. Check that your OS and file system support files of this size. Check also that the disk is not full or a disk quota exceeded.
2018-02-02T03:07:56.226938Z 0 [ERROR] InnoDB: Error number 28 means 'No space left on device'
2018-02-02T03:07:56.226952Z 0 [Note] InnoDB: Some operating system error numbers are described at http://dev.mysql.com/doc/refman/5.7/en/operating-system-error-codes.html
2018-02-02T03:07:56.226967Z 0 [ERROR] InnoDB: Could not set the file size of './ibtmp1'. Probably out of disk space
2018-02-02T03:07:56.226980Z 0 [ERROR] InnoDB: Unable to create the shared innodb_temporary
2018-02-02T03:07:56.226994Z 0 [ERROR] InnoDB: Plugin initialization aborted with error Generic error
2018-02-02T03:07:56.829843Z 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
2018-02-02T03:07:56.829883Z 0 [ERROR] Plugin 'InnoDB' init function returned error.
2018-02-02T03:07:56.829897Z 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
2018-02-02T03:07:56.829912Z 0 [ERROR] Failed to initialize plugins.
2018-02-02T03:07:56.829922Z 0 [ERROR] Aborting

原来是空间不足,mysql启动后又关闭了。然后又自动重启,陷入了死循环。根本原因是空间不足。

不过奇怪的是,40GB的硬盘,怎么可能硬盘不足呢?Check了一下,

[root@iZ2ze0pmkgrozc32e9vq3iZ ~]# df
Filesystem         1K-blocks    Used     Available Use% Mounted on
/dev/xvda1          41152832 4140124      34899224  100% /
devtmpfs              497684       0        497684   0% /dev
tmpfs                 507292       0        507292   0% /dev/shm
tmpfs                 507292     420        506872   1% /run
tmpfs                 507292       0        507292   0% /sys/fs/cgroup
cloudfs        4398046511104       0 4398046511104   0% /root/CouldDisk
tmpfs                 101460       0        101460   0% /run/user/0

发现根目录的使用是100%,已经没有空间给mysql的Innodb分配空间了

问题解决

问题的原因是我编写的下载器的一个bug,上传后没有把文件删除导致的。然后我把那些大型文件删除了就解决了。意外的震惊是,发现mysql的爆破攻击,而且攻击者用了国内的一个ip代理群。。。

你可能感兴趣的:(惊魂未定,回忆mysql挂掉运维历程。)