早上机器突然出现负载持续飙升,都到300多了,ps或者top都无法出来。
syslog也没报错,以为是某个进程引起的,把所有在跑的业务进程都kill掉,结果还是负载上升。
后来一个个log查看,发现kernel.log里有报错:
Jun 30 10:16:37 localhost kernel: [20789571.526981] ------------[ cut here ]------------
Jun 30 10:16:37 localhost kernel: [20789571.646113] kernel BUG at /build/buildd/linux-3.13.0/mm/memory.c:3756!
Jun 30 10:16:37 localhost kernel: [20789571.766023] invalid opcode: 0000 [#4] SMP
Jun 30 10:16:37 localhost kernel: [20789571.886232] Modules linked in: dccp_diag dccp udp_diag unix_diag tcp_diag inet_diag xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_fil
ter ip_tables x_tables dcdbas x86_pkg_temp_thermal coretemp kvm_intel kvm joydev crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel mxm_wmi aes_x86_64 lrw gf128mul glue_helper ablk_helper crypt
d shpchp lpc_ich mei_me mei ipmi_si binfmt_misc lp parport wmi acpi_power_meter mac_hid hid_generic tg3 usbhid ahci ptp hid libahci megaraid_sas pps_core
Jun 30 10:16:37 localhost kernel: [20789572.405783] CPU: 0 PID: 156980 Comm: java Tainted: G D 3.13.0-24-generic #46-Ubuntu
Jun 30 10:16:37 localhost kernel: [20789572.539454] Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS 2.1.5 04/11/2016
Jun 30 10:16:37 localhost kernel: [20789572.673237] task: ffff8804363c17f0 ti: ffff880157214000 task.ti: ffff880157214000
Jun 30 10:16:37 localhost kernel: [20789572.807615] RIP: 0010:[
Jun 30 10:16:37 localhost kernel: [20789572.943814] RSP: 0018:ffff880157215d98 EFLAGS: 00010246
Jun 30 10:16:37 localhost kernel: [20789573.080011] RAX: 0000000000000100 RBX: 0000000706005648 RCX: ffff880157215b18
Jun 30 10:16:37 localhost kernel: [20789573.217593] RDX: ffff8804363c17f0 RSI: 0000000000000000 RDI: 80000008a46009e6
Jun 30 10:16:37 localhost kernel: [20789573.355562] RBP: ffff880157215e20 R08: 0000000000000000 R09: 00000000000000a9
Jun 30 10:16:37 localhost kernel: [20789573.493361] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880c8bf3d180
Jun 30 10:16:37 localhost kernel: [20789573.631188] R13: ffff88084f6d5e00 R14: ffff88010126d400 R15: 0000000000000080
Jun 30 10:16:37 localhost kernel: [20789573.769177] FS: 00007f2421699700(0000) GS:ffff88085f200000(0000) knlGS:0000000000000000
Jun 30 10:16:37 localhost kernel: [20789573.908225] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 30 10:16:37 localhost kernel: [20789574.045182] CR2: 00007f91da300018 CR3: 0000000100a64000 CR4: 00000000003407f0
Jun 30 10:16:37 localhost kernel: [20789574.180729] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 30 10:16:37 localhost kernel: [20789574.312936] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jun 30 10:16:37 localhost kernel: [20789574.441451] Stack:
Jun 30 10:16:37 localhost kernel: [20789574.566031] ffff880157215e20 ffff880677705780 0000000000000039 00000007ca1800d0
Jun 30 10:16:37 localhost kernel: [20789574.690149] 00000007aaa80000 00007f2464081050 0000000000000005 0000000000000000
Jun 30 10:16:37 localhost kernel: [20789574.813562] 0000000000000000 0000000000000004 ffff8800000000a9 ffffffffffffff03
Jun 30 10:16:37 localhost kernel: [20789574.936459] Call Trace:
Jun 30 10:16:37 localhost kernel: [20789575.057978] [
Jun 30 10:16:37 localhost kernel: [20789575.180307] [
Jun 30 10:16:37 localhost kernel: [20789575.300245] [
Jun 30 10:16:37 localhost kernel: [20789575.416978] [
Jun 30 10:16:37 localhost kernel: [20789575.530483] [
Jun 30 10:16:37 localhost kernel: [20789575.640950] [
Jun 30 10:16:37 localhost kernel: [20789575.749945] Code: ff 48 89 d9 4c 89 e2 4c 89 ee 4c 89 f7 44 89 4d c8 e8 34 c1 ff ff 85 c0 0f 85 94 f5 ff ff 49 8b 3c 24 44 8b 4d c8 e9 68 f3 ff ff <0f> 0b be 8e 00
00 00 48 c7 c7 18 25 a6 81 44 89 4d c8 e8 18 e7
Jun 30 10:16:37 localhost kernel: [20789575.975569] RIP [
Jun 30 10:16:37 localhost kernel: [20789576.086848] RSP
Jun 30 10:16:37 localhost kernel: [20789576.474980] ---[ end trace cb921fcdfc336f01 ]---
kernel BUG at /build/buildd/linux-3.13.0/mm/memory.c:3756!
Ubuntu 14.0内核有问题,但是业务需要继续,所以重启机器,恢复正常。
现在查原因,可能是由于大内存原因导致的,很多文章都尝试关闭透明大页的使用,结果再未出现此情况:
# cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
[always]表示透明大页启用了。[never]表示透明大页禁用
现在禁用大内存
# echo "never" | tee /sys/kernel/mm/transparent_hugepage/enabled
never
# cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]