现场问题:
- 操作系统 CentOS Linux release 7.3.1611 (Core)
- 系统内存 16G
[root@clxcld-gateway-prod ~]# free -g total used free shared buff/cache available Mem: 15 0 0 0 13 14 Swap: 3 0 3
系统总共启动2个Java进程,一个Xmx 3G 另外一个Xmx 4G, 但发现系统使用的内存很少,所有的内存全部被cache占用,重启Java进程也不起作用。
查看lsof -i
[root@clxcld-gateway-prod log]# lsof -i COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME systemd 1 root 43u IPv6 13662 0t0 TCP *:sunrpc (LISTEN) systemd 1 root 44u IPv4 13663 0t0 TCP *:sunrpc (LISTEN) chronyd 667 chrony 1u IPv4 14114 0t0 UDP localhost:323 chronyd 667 chrony 2u IPv6 14115 0t0 UDP localhost:323 avahi-dae 712 avahi 12u IPv4 15942 0t0 UDP *:mdns avahi-dae 712 avahi 13u IPv4 15943 0t0 UDP *:56794 xinetd 1066 root 5u IPv6 19825 0t0 TCP *:nrpe (LISTEN) xinetd 1066 root 6u IPv6 19826 0t0 TCP *:nsca (LISTEN) sshd 1084 root 3u IPv4 18988 0t0 TCP *:mxxrlogin (LISTEN) sshd 1084 root 4u IPv6 18997 0t0 TCP *:mxxrlogin (LISTEN) rpc.statd 1133 rpcuser 5u IPv4 20812 0t0 UDP localhost:885 rpc.statd 1133 rpcuser 8u IPv4 21033 0t0 UDP *:41165 rpc.statd 1133 rpcuser 9u IPv4 21037 0t0 TCP *:59161 (LISTEN) rpc.statd 1133 rpcuser 10u IPv6 21041 0t0 UDP *:32879 rpc.statd 1133 rpcuser 11u IPv6 21045 0t0 TCP *:39979 (LISTEN) rpcbind 1138 rpc 4u IPv6 13662 0t0 TCP *:sunrpc (LISTEN) rpcbind 1138 rpc 5u IPv4 13663 0t0 TCP *:sunrpc (LISTEN) rpcbind 1138 rpc 8u IPv4 20895 0t0 UDP *:sunrpc rpcbind 1138 rpc 9u IPv4 20896 0t0 UDP *:iclcnet_svinfo rpcbind 1138 rpc 10u IPv6 20897 0t0 UDP *:sunrpc rpcbind 1138 rpc 11u IPv6 20898 0t0 UDP *:iclcnet_svinfo master 1250 root 13u IPv4 20394 0t0 TCP localhost:smtp (LISTEN) master 1250 root 14u IPv6 20395 0t0 TCP localhost:smtp (LISTEN) sshd 23166 root 3u IPv4 175197624 0t0 TCP clxcld-gateway-prod:mxxrlogin->172.23.46.21:45974 (ESTABLISHED) java 24608 root 138u IPv6 175242571 0t0 TCP *:40673 (LISTEN) java 24608 root 140u IPv6 175242124 0t0 TCP clxcld-gateway-prod:47240->10.13.248.15:mysql (ESTABLISHED) java 24608 root 141u IPv6 175242127 0t0 TCP clxcld-gateway-prod:47246->10.13.248.15:mysql (ESTABLISHED) java 24608 root 144u IPv6 175242130 0t0 TCP *:pcsync-https (LISTEN) java 24608 root 149u IPv6 175252852 0t0 TCP clxcld-gateway-prod:51920->10.13.248.15:mysql (ESTABLISHED) java 24610 root 108u IPv6 175242117 0t0 TCP *:41423 (LISTEN) java 24610 root 112u IPv6 175242684 0t0 TCP *:warehouse (LISTEN) java 24610 root 113u IPv6 175242691 0t0 TCP clxcld-gateway-prod:47248->10.13.248.15:mysql (ESTABLISHED) java 24610 root 114u IPv6 175243437 0t0 TCP clxcld-gateway-prod:47258->10.13.248.15:mysql (ESTABLISHED) java 24610 root 118u IPv6 175242785 0t0 TCP clxcld-gateway-prod:47260->10.13.248.15:mysql (ESTABLISHED) ssh 24748 root 3u IPv4 175243834 0t0 TCP clxcld-gateway-prod:33706->clxcld-gateway-prod:mxxrlogin (ESTABLISHED) sshd 24750 root 3u IPv4 175242791 0t0 TCP clxcld-gateway-prod:mxxrlogin->clxcld-gateway-prod:33706 (ESTABLISHED) python 24761 root 4u IPv4 175242899 0t0 TCP clxcld-gateway-prod:46418->analytics-prod.cpkzgarrnsp3.us-west-2.redshift.amazonaws.com:5439 (ESTABLISHED) zabbix_ag 25594 zabbix 4u IPv4 175270400 0t0 TCP *:zabbix-agent (LISTEN) zabbix_ag 25594 zabbix 5u IPv6 175270401 0t0 TCP *:zabbix-agent (LISTEN) zabbix_ag 25595 zabbix 4u IPv4 175270400 0t0 TCP *:zabbix-agent (LISTEN) zabbix_ag 25595 zabbix 5u IPv6 175270401 0t0 TCP *:zabbix-agent (LISTEN) zabbix_ag 25596 zabbix 4u IPv4 175270400 0t0 TCP *:zabbix-agent (LISTEN) zabbix_ag 25596 zabbix 5u IPv6 175270401 0t0 TCP *:zabbix-agent (LISTEN) zabbix_ag 25597 zabbix 4u IPv4 175270400 0t0 TCP *:zabbix-agent (LISTEN) zabbix_ag 25597 zabbix 5u IPv6 175270401 0t0 TCP *:zabbix-agent (LISTEN) zabbix_ag 25598 zabbix 4u IPv4 175270400 0t0 TCP *:zabbix-agent (LISTEN) zabbix_ag 25598 zabbix 5u IPv6 175270401 0t0 TCP *:zabbix-agent (LISTEN) zabbix_ag 25599 zabbix 4u IPv4 175270400 0t0 TCP *:zabbix-agent (LISTEN) zabbix_ag 25599 zabbix 5u IPv6 175270401 0t0 TCP *:zabbix-agent (LISTEN)
问题排查:
参考 https://www.cnblogs.com/zh94/p/11922714.html , 下载hcache工具:
github 地址:https://github.com/silenceshell/hcache 直接下载:wget https://silenceshell-1255345740.cos.ap-shanghai.myqcloud.com/hcache chmod 755 hcache mv hcache /usr/local/bin
使用hcache -top 10 查看占用最大的进程:
hcache --top 10 +-------------------------------------------------------------------------------------------------------------------------------------+----------------+------------+-----------+---------+ | Name | Size (bytes) | Pages | Cached | Percent | |-------------------------------------------------------------------------------------------------------------------------------------+----------------+------------+-----------+---------| | /run/log/journal/d14e699e8bbc43228324a169b0f855fe/system@6cfacedb39904c2499acffe16d0fd88a-00000000006ab097-000597ad5568adc2.journal | 58720256 | 14336 | 12246 | 085.421 | | /run/log/journal/d14e699e8bbc43228324a169b0f855fe/system@6cfacedb39904c2499acffe16d0fd88a-0000000000734380-000599df09bbc4a3.journal | 58720256 | 14336 | 12245 | 085.414 | | /run/log/journal/d14e699e8bbc43228324a169b0f855fe/system@6cfacedb39904c2499acffe16d0fd88a-00000000006d23a4-0005984dfc97fc32.journal | 58720256 | 14336 | 12242 | 085.393 | | /run/log/journal/d14e699e8bbc43228324a169b0f855fe/system@6cfacedb39904c2499acffe16d0fd88a-0000000000747d1e-00059a2f7faf1d70.journal | 58720256 | 14336 | 12242 | 085.393 | | /run/log/journal/d14e699e8bbc43228324a169b0f855fe/system@6cfacedb39904c2499acffe16d0fd88a-000000000075b6bb-00059a801bbaf6ca.journal | 58720256 | 14336 | 12242 | 085.393 | | /run/log/journal/d14e699e8bbc43228324a169b0f855fe/system@6cfacedb39904c2499acffe16d0fd88a-0000000000697714-0005975cf155fc13.journal | 58720256 | 14336 | 12241 | 085.386 | | /run/log/journal/d14e699e8bbc43228324a169b0f855fe/system@6cfacedb39904c2499acffe16d0fd88a-000000000070d02d-0005993e39fdff2a.journal | 58720256 | 14336 | 12239 | 085.372 | | /run/log/journal/d14e699e8bbc43228324a169b0f855fe/system@6cfacedb39904c2499acffe16d0fd88a-00000000006bea28-000597fda900f06e.journal | 58720256 | 14336 | 12239 | 085.372 | | /run/log/journal/d14e699e8bbc43228324a169b0f855fe/system@6cfacedb39904c2499acffe16d0fd88a-00000000006e5d54-0005989e300b3aa6.journal | 58720256 | 14336 | 12239 | 085.372 | | /run/log/journal/d14e699e8bbc43228324a169b0f855fe/system@6cfacedb39904c2499acffe16d0fd88a-00000000007209f2-0005998ebb1d505b.journal | 58720256 | 14336 | 12239 | 085.372 | +-------------------------------------------------------------------------------------------------------------------------------------+----------------+------------+-----------+---------+
发现systemd进程journal占用很多buffer
[root@clxcld-gateway-prod d14e699e8bbc43228324a169b0f855fe]# ls -lath * -rw-r-----+ 1 root systemd-journal 8.0M Jan 2 06:43 system.journal -rw-r-----+ 1 root systemd-journal 56M Jan 2 03:26 system@6cfacedb39904c2499acffe16d0fd88a-000000000076f05c-00059ad09a381e0a.journal -rw-r-----+ 1 root systemd-journal 56M Dec 29 05:00 system@6cfacedb39904c2499acffe16d0fd88a-000000000075b6bb-00059a801bbaf6ca.journal -rw-r-----+ 1 root systemd-journal 56M Dec 25 04:56 system@6cfacedb39904c2499acffe16d0fd88a-0000000000747d1e-00059a2f7faf1d70.journal -rw-r-----+ 1 root systemd-journal 56M Dec 21 04:47 system@6cfacedb39904c2499acffe16d0fd88a-0000000000734380-000599df09bbc4a3.journal -rw-r-----+ 1 root systemd-journal 56M Dec 17 04:48 system@6cfacedb39904c2499acffe16d0fd88a-00000000007209f2-0005998ebb1d505b.journal -rw-r-----+ 1 root systemd-journal 56M Dec 13 04:59 system@6cfacedb39904c2499acffe16d0fd88a-000000000070d02d-0005993e39fdff2a.journal -rw-r-----+ 1 root systemd-journal 56M Dec 9 04:57 system@6cfacedb39904c2499acffe16d0fd88a-00000000006f9690-000598edd5ef0719.journal -rw-r-----+ 1 root systemd-journal 56M Dec 5 05:02 system@6cfacedb39904c2499acffe16d0fd88a-00000000006e5d54-0005989e300b3aa6.journal -rw-r-----+ 1 root systemd-journal 56M Dec 1 06:01 system@6cfacedb39904c2499acffe16d0fd88a-00000000006d23a4-0005984dfc97fc32.journal -rw-r-----+ 1 root systemd-journal 56M Nov 27 06:20 system@6cfacedb39904c2499acffe16d0fd88a-00000000006bea28-000597fda900f06e.journal -rw-r-----+ 1 root systemd-journal 56M Nov 23 06:30 system@6cfacedb39904c2499acffe16d0fd88a-00000000006ab097-000597ad5568adc2.journal -rw-r-----+ 1 root systemd-journal 56M Nov 19 06:40 system@6cfacedb39904c2499acffe16d0fd88a-0000000000697714-0005975cf155fc13.journal -rw-r-----+ 1 root systemd-journal 56M Nov 15 06:45 system@6cfacedb39904c2499acffe16d0fd88a-0000000000683d77-0005970c8a737b7b.journal -rw-r-----+ 1 root systemd-journal 56M Nov 11 06:50 system@6cfacedb39904c2499acffe16d0fd88a-00000000006703c5-000596bbef4870ae.journal
参考 https://blog.steamedfish.org/post/systemd-journald/ 清理journal的内存:
journalctl --vacuum-time=10d Deleted archived journal /run/log/journal/d14e699e8bbc43228324a169b0f855fe/system@6cfacedb39904c2499acffe16d0fd88a-00000000006703c5-000596bbef4870ae.journal (56.0M). Deleted archived journal /run/log/journal/d14e699e8bbc43228324a169b0f855fe/system@6cfacedb39904c2499acffe16d0fd88a-0000000000683d77-0005970c8a737b7b.journal (56.0M). Deleted archived journal /run/log/journal/d14e699e8bbc43228324a169b0f855fe/system@6cfacedb39904c2499acffe16d0fd88a-0000000000697714-0005975cf155fc13.journal (56.0M). Deleted archived journal /run/log/journal/d14e699e8bbc43228324a169b0f855fe/system@6cfacedb39904c2499acffe16d0fd88a-00000000006ab097-000597ad5568adc2.journal (56.0M). Deleted archived journal /run/log/journal/d14e699e8bbc43228324a169b0f855fe/system@6cfacedb39904c2499acffe16d0fd88a-00000000006bea28-000597fda900f06e.journal (56.0M). Deleted archived journal /run/log/journal/d14e699e8bbc43228324a169b0f855fe/system@6cfacedb39904c2499acffe16d0fd88a-00000000006d23a4-0005984dfc97fc32.journal (56.0M). Deleted archived journal /run/log/journal/d14e699e8bbc43228324a169b0f855fe/system@6cfacedb39904c2499acffe16d0fd88a-00000000006e5d54-0005989e300b3aa6.journal (56.0M). Deleted archived journal /run/log/journal/d14e699e8bbc43228324a169b0f855fe/system@6cfacedb39904c2499acffe16d0fd88a-00000000006f9690-000598edd5ef0719.journal (56.0M). Deleted archived journal /run/log/journal/d14e699e8bbc43228324a169b0f855fe/system@6cfacedb39904c2499acffe16d0fd88a-000000000070d02d-0005993e39fdff2a.journal (56.0M). Deleted archived journal /run/log/journal/d14e699e8bbc43228324a169b0f855fe/system@6cfacedb39904c2499acffe16d0fd88a-00000000007209f2-0005998ebb1d505b.journal (56.0M). Deleted archived journal /run/log/journal/d14e699e8bbc43228324a169b0f855fe/system@6cfacedb39904c2499acffe16d0fd88a-0000000000734380-000599df09bbc4a3.journal (56.0M). Deleted archived journal /run/log/journal/d14e699e8bbc43228324a169b0f855fe/system@6cfacedb39904c2499acffe16d0fd88a-0000000000747d1e-00059a2f7faf1d70.journal (56.0M). Vacuuming done, freed 672.0M of archived journals on disk. [root@clxcld-gateway-prod d14e699e8bbc43228324a169b0f855fe]# ls system@6cfacedb39904c2499acffe16d0fd88a-000000000075b6bb-00059a801bbaf6ca.journal system.journal system@6cfacedb39904c2499acffe16d0fd88a-000000000076f05c-00059ad09a381e0a.journal
继续通过hcache -top 查询,发现journal已经减少了很多
[root@clxcld-gateway-prod d14e699e8bbc43228324a169b0f855fe]# hcache --top 10 +-------------------------------------------------------------------------------------------------------------------------------------+----------------+------------+-----------+---------+ | Name | Size (bytes) | Pages | Cached | Percent | |-------------------------------------------------------------------------------------------------------------------------------------+----------------+------------+-----------+---------| | /run/log/journal/d14e699e8bbc43228324a169b0f855fe/system@6cfacedb39904c2499acffe16d0fd88a-000000000075b6bb-00059a801bbaf6ca.journal | 58720256 | 14336 | 12242 | 085.393 | | /run/log/journal/d14e699e8bbc43228324a169b0f855fe/system@6cfacedb39904c2499acffe16d0fd88a-000000000076f05c-00059ad09a381e0a.journal | 58720256 | 14336 | 12226 | 085.282 | | /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.111-2.b15.el7_3.x86_64/jre/lib/rt.jar | 72964441 | 17814 | 10463 | 058.735 | | /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.111-2.b15.el7_3.x86_64/jre/lib/amd64/server/libjvm.so | 13942784 | 3404 | 3216 | 094.477 | | /run/log/journal/d14e699e8bbc43228324a169b0f855fe/system.journal | 8388608 | 2048 | 1311 | 064.014 | | /usr/lib64/dri/swrast_dri.so | 9597216 | 2344 | 1143 | 048.763 | | /usr/lib64/libmozjs-24.so | 5987032 | 1462 | 1076 | 073.598 | | /usr/lib64/libgtk-3.so.0.1400.13 | 7116800 | 1738 | 1024 | 058.918 | | /usr/lib/locale/locale-archive | 106070960 | 25897 | 1024 | 003.954 | | /usr/lib64/gnome-shell/libgnome-shell.so | 2671456 | 653 | 653 | 100.000 | +-------------------------------------------------------------------------------------------------------------------------------------+----------------+------------+-----------+---------+
由于journal默认存储方式是auto,并且如果存在目录/var/log/journal则将日志cache到磁盘,否则会缓存到内存中。 因此创建/var/log/journal目录,同时重启journal进程
systemctl restart systemd-journal.service
[root@clxcld-gateway-prod journal]# hcache --top 10
+---------------------------------------------------------------------------------------------+----------------+------------+-----------+---------+
| Name | Size (bytes) | Pages | Cached | Percent |
|---------------------------------------------------------------------------------------------+----------------+------------+-----------+---------|
| /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.111-2.b15.el7_3.x86_64/jre/lib/rt.jar | 72964441 | 17814 | 10463 | 058.735 |
| /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.111-2.b15.el7_3.x86_64/jre/lib/amd64/server/libjvm.so | 13942784 | 3404 | 3216 | 094.477 |
| /var/log/journal/d14e699e8bbc43228324a169b0f855fe/system.journal | 8388608 | 2048 | 2048 | 100.000 |
| /usr/lib64/dri/swrast_dri.so | 9597216 | 2344 | 1143 | 048.763 |
| /usr/lib64/libmozjs-24.so | 5987032 | 1462 | 1076 | 073.598 |
| /usr/lib64/libgtk-3.so.0.1400.13 | 7116800 | 1738 | 1024 | 058.918 |
| /usr/lib/locale/locale-archive | 106070960 | 25897 | 1024 | 003.954 |
| /usr/lib64/gnome-shell/libgnome-shell.so | 2671456 | 653 | 653 | 100.000 |
| /root/azkaban-3.33.0/azkaban-exec-server-3.33.0/lib/hadoop-common-2.6.1.jar | 3318727 | 811 | 652 | 080.395 |
| /root/azkaban-3.33.0/azkaban-web-server-3.33.0/lib/hadoop-common-2.6.1.jar | 3318727 | 811 | 652 | 080.395 |
+---------------------------------------------------------------------------------------------+----------------+------------+-----------+---------+
重新查询,参考 https://blog.csdn.net/liuxiao723846/article/details/72628847 , system cache没有被立即回收
[root@clxcld-gateway-prod ~]# echo 1 > /proc/sys/vm/drop_caches [root@clxcld-gateway-prod ~]#free -g total used free shared buff/cache available Mem: 15 1 14 0 0 14 Swap: 3 0 3
强制将cache回收。
参考
- https://blog.csdn.net/liuxiao723846/article/details/72628847 linux内存占用问题调查——cached
- https://www.jianshu.com/p/8b3fba13fcad systemd攻略
- https://www.jianshu.com/p/3320bc84f227 Systemd
- https://www.ibm.com/developerworks/cn/linux/1407_liuming_init3/ Systemd
- http://www.jinbuguo.com/systemd/systemd-journald.service.html systemd-journald.service 中文手册
- https://wiki.archlinux.org/index.php/Systemd/Journal_(%E7%AE%80%E4%BD%93%E4%B8%AD%E6%96%87) systemd/Journal (简体中文)
- https://github.com/silenceshell/hcache
- https://www.cnblogs.com/zh94/p/11922714.html Linux查看哪些进程占用的系统 buffer/cache 较高 (hcache,lsof)命令