系统:SMP Wed Dec 19 10:46:58 EST 2018 x86_64 x86_64 x86_64 GNU/Linux
主机:3个node 的kvm虚拟机
系统收到告警,node-2 磁盘 / 目录的占用率98%,系统开始出现问题,于是把tranffic切到standby,ssh到node-2查看:
[root@rmeda1-2 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda2 9.8G 9.6G 240M 98% /
devtmpfs 15G 0 15G 0% /dev
tmpfs 15G 0 15G 0% /dev/shm
tmpfs 15G 129M 15G 1% /run
tmpfs 15G 0 15G 0% /sys/fs/cgroup
/dev/vdb2 57G 2.3G 55G 5% /var/cassandra/data
/dev/vdb1 9.4G 44M 9.3G 1% /var/cassandra/commitlog
/dev/vda1 997M 252M 746M 26% /boot
/dev/mapper/volgroup00-lv_home 15G 1.8G 13G 12% /home
/dev/mapper/volgroup00-lv_varlog 9.8G 3.0G 6.8G 31% /var/log
/dev/mapper/volgroup00-lv_tmp 2.0G 35M 2.0G 2% /tmp
tmpfs 3.0G 0 3.0G 0% /run/user/1001
tmpfs 3.0G 0 3.0G 0% /run/user/0
可以看出系统没有误报,确实根目录使用率达到了98%.
2. 切到根目录下查看,每个目录下的大小
[root@rmeda1-2 ~]#cd /
[root@rmeda1-2 /]#du -ah --max-depth=1
219M ./boot
0 ./dev
1.7G ./home
du: cannot access ‘./proc/20709/task/20709/fd/4’: No such file or directory
du: cannot access ‘./proc/20709/task/20709/fdinfo/4’: No such file or directory
du: cannot access ‘./proc/20709/fd/3’: No such file or directory
du: cannot access ‘./proc/20709/fdinfo/3’: No such file or directory
0 ./proc
129M ./run
0 ./sys
1.9M ./tmp
5.4G ./var
39M ./etc
44M ./root
3.7G ./usr
0 ./bin
0 ./sbin
0 ./lib
0 ./lib64
0 ./media
0 ./mnt
878M ./opt
0 ./srv
0 ./.autorelabel
13G .
/dev/vdb2 57G 2.3G 55G 5% /var/cassandra/data
/dev/vdb1 9.4G 44M 9.3G 1% /var/cassandra/commitlog
/dev/mapper/volgroup00-lv_varlog 9.8G 3.0G 6.8G 31% /var/log
神奇的4.7G去哪里了呢?
经过baidu,google有以下几种猜测:
javasript [root@/]# lsof |grep delete
[root@rmeda1-2 ~]# df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/vda2 5120000 113077 5006923 3% /
devtmpfs 3832094 454 3831640 1% /dev
tmpfs 3839395 1 3839394 1% /dev/shm
tmpfs 3839395 764 3838631 1% /run
tmpfs 3839395 16 3839379 1% /sys/fs/cgroup
/dev/vdb2 29719552 6406 29713146 1% /var/cassandra/data
/dev/vdb1 4882432 34 4882398 1% /var/cassandra/commitlog
/dev/vda1 512000 332 511668 1% /boot
/dev/mapper/volgroup00-lv_varlog 5120000 798 5119202 1% /var/log
/dev/mapper/volgroup00-lv_tmp 1024000 122 1023878 1% /tmp
/dev/mapper/volgroup00-lv_home 7680000 239 7679761 1% /home
tmpfs 3839395 1 3839394 1% /run/user/1001
tmpfs 3839395 1 3839394 1% /run/user/0
3.可能挂载的有问题(检查了挂载点,人头担保没问题)
[root@rmeda1-2 data]# cat /etc/fstab
#
# /etc/fstab
# Created by anaconda on Sun Mar 3 18:26:42 2019
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
UUID=ad426056-7b1e-406e-9bb6-cdffcce1c2db / xfs defaults 0 0
UUID=469e1b49-396c-4154-966d-d444bd805046 /boot xfs defaults 0 0
/dev/mapper/volgroup00-lv_home /home xfs nodev 0 0
/dev/mapper/volgroup00-lv_tmp /tmp xfs nodev,nosuid 0 0
/dev/mapper/volgroup00-lv_varlog /var/log xfs defaults 0 0
UUID=6a9980e7-aaf5-4c1f-bb2e-0f90f8e53e24 swap swap defaults 0 0
/tmp /var/tmp none bind 0 0
/dev/vdb1 /var/cassandra/commitlog xfs defaults,nofail,comment=cloudconfig 0 0
/dev/vdb2 /var/cassandra/data xfs defaults,nofail,comment=cloudconfig 0 0
我的情况正好是情况b.没有lsof命令,只能reboot了,但是reboot了好几次,你见,或者不见我,9.6G还在那里。。。。
显然我的inode还是很多的,所以以上猜测全盘否定。
如果还有4.5.6…请大佬指教。
请来大神Leo帮忙看看
大神看完也是一脸茫然,但是大神还是见多识广
大神决定将
/dev/vdb2 57G 2.3G 55G 5% /var/cassandra/data
/dev/vdb1 9.4G 44M 9.3G 1% /var/cassandra/commitlog
/dev/mapper/volgroup00-lv_varlog 9.8G 3.0G 6.8G 31% /var/log
这些个目录umount掉,为什么是这几个目录,因为对比其他node发现这个目录的占有率变数最大,其中
57G 2.3G 55G 5% /var/cassandra/data的占用才2.3G,而别的node都30~G了,很显然Cassandra的数据存储一致性除了问题。
于是Leo将/dev/vdb2 57G 2.3G 55G 5% /var/cassandra/data umount
[root@rmeda1-2 cassandra]# df -h ./commitlog/
Filesystem Size Used Avail Use% Mounted on
/dev/vda2 9.8G 9.6G 237M 98% /
[root@rmeda1-2 cassandra]# df -h ./data/
Filesystem Size Used Avail Use% Mounted on
/dev/vda2 9.8G 9.6G 237M 98% /
[root@rmeda1-2 cassandra]# df -h ./saved_caches/
Filesystem Size Used Avail Use% Mounted on
/dev/vda2 9.8G 9.6G 237M 98% /
[root@rmeda1-2 /]# umount /var/cassandra/data/
[root@rmeda1-2 /]# umount /var/cassandra/commitlog/
[root@rmeda1-2 cassandra]# du -sh ./commitlog/
2.2M ./commitlog/
[root@rmeda1-2 cassandra]# du -sh ./data/
4.7G ./data/
真相就在:
[root@rmeda1-2 cassandra]# df -h ./data/
Filesystem Size Used Avail Use% Mounted on
/dev/vda2 9.8G 9.6G 237M 98% /
[root@rmeda1-2 cassandra]# du -sh ./data/
4.7G ./data/
嗯~这就是真凶了
总结一下:/var/cassandra/data/ 这个目录在没有挂载到新磁盘的时候是默认是和根目录同在一个分区的,产生的数据和根目录共享同一块容量,但是重新挂载后,又没有将之前产生的数据删除掉(4.7G的真凶),挂载新磁盘后数据全都写到了新磁盘,df命令可以查到它的影子却查不到它的位置,du 命令能查到新文件产生的位置,影子却找不到,所以感觉仿佛丢了4.7G,但是却又存在。删掉它后,reboot重新挂载案子就破了。。。查了百度很多文章都没有看到相似的问题,希望以后遇到相同的问题能给大家思路。
------感谢 leo大神的帮助