INFO: task java:27465 blocked for more than 120 seconds不一定是cache太大的问题

这几天,老有几个环境在中午收盘后者下午收盘后那一会儿,系统打不开,然后过了一会儿,进程就消失不见了,查看了下/var/log/message,有如下信息:

Dec 12 11:35:38 iZ23nn1p4mjZ kernel: INFO: task java:27465 blocked for more than 120 seconds.
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: Not tainted 2.6.32-431.23.3.el6.x86_64 #1
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: java D 0000000000000002 0 27465 27457 0x00000000
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: ffff8801ab8378d8 0000000000000082 ffff8801ab8378a0 ffff8801ab83789c
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: ffff8801ab837a54 ffff88023fc23480 ffff880028396840 0000000000000400
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: ffff88017480f058 ffff8801ab837fd8 000000000000fbc8 ffff88017480f058
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: Call Trace:
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] do_get_write_access+0x29d/0x520 [jbd2]
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ? wake_bit_function+0x0/0x50
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] __ext4_journal_get_write_access+0x38/0x80 [ext4]
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ext4_reserve_inode_write+0x73/0xa0 [ext4]
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ? jbd2_journal_start+0xb5/0x100 [jbd2]
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ext4_dirty_inode+0x40/0x60 [ext4]
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] __mark_inode_dirty+0x3b/0x160
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] file_update_time+0xf2/0x170
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ? __sb_start_write+0x80/0x120
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ? ext4_da_get_block_prep+0x0/0x3c0 [ext4]
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] __block_page_mkwrite+0x3b/0x140
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ext4_page_mkwrite+0x121/0x360 [ext4]
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] __do_fault+0xd0/0x530
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] handle_pte_fault+0xf7/0xb00
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ? futex_wake+0x10e/0x120
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] handle_mm_fault+0x22a/0x300
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] __do_page_fault+0x138/0x480
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ? pvclock_clocksource_read+0x58/0xd0
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ? kvm_clock_read+0x1c/0x20
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] do_page_fault+0x3e/0xa0
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] page_fault+0x25/0x30
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: INFO: task java:27585 blocked for more than 120 seconds.
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: Not tainted 2.6.32-431.23.3.el6.x86_64 #1
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: java D 0000000000000003 0 27585 1 0x00000000
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: ffff88023808d8d8 0000000000000086 0000000000000000 ffffffff812830b9
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: ffff88023808da54 0000000000000000 ffff88023808d9c8 ffffffff810598e4
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: ffff88023aa1c5f8 ffff88023808dfd8 000000000000fbc8 ffff88023aa1c5f8
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: Call Trace:
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ? cpumask_next_and+0x29/0x50
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ? find_busiest_group+0x244/0x9e0
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] do_get_write_access+0x29d/0x520 [jbd2]
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ? wake_bit_function+0x0/0x50
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] __ext4_journal_get_write_access+0x38/0x80 [ext4]
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ext4_reserve_inode_write+0x73/0xa0 [ext4]
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ? jbd2_journal_start+0xb5/0x100 [jbd2]
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ext4_dirty_inode+0x40/0x60 [ext4]
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] __mark_inode_dirty+0x3b/0x160
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] file_update_time+0xf2/0x170
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ? __sb_start_write+0x80/0x120
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ? ext4_da_get_block_prep+0x0/0x3c0 [ext4]
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] __block_page_mkwrite+0x3b/0x140
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ext4_page_mkwrite+0x121/0x360 [ext4]
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] __do_fault+0xd0/0x530
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] handle_pte_fault+0xf7/0xb00
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ? native_smp_send_reschedule+0x49/0x60
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ? resched_task+0x68/0x80
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ? check_preempt_curr+0x6d/0x90
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ? try_to_wake_up+0x24e/0x3e0
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] handle_mm_fault+0x22a/0x300
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ? wake_futex+0x40/0x60
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] __do_page_fault+0x138/0x480
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ? pvclock_clocksource_read+0x58/0xd0
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ? pvclock_clocksource_read+0x58/0xd0
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] ? kvm_clock_read+0x1c/0x20
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] do_page_fault+0x3e/0xa0
Dec 12 11:35:38 iZ23nn1p4mjZ kernel: [] page_fault+0x25/0x30

就本身而言,这个警告对数据并没有什么破坏性影响,只不过同OOM的严重性一样,它会导致受影响的进行处于hang状态,甚至最后被killed了。所以,需要找到发生它的根本原因,否则风险始终存在。

这个问题(关于他的解释可以参考http://www.ttlsa.com/linux/kernel-blocked-for-more-than-120-seconds/)很早以前发生过,很久没出这个问题了,最近这两个服务器平时负载就比较高,然后又出现了,关于“INFO: task java:27465 blocked for more than 120 seconds”这个警告,简单直白的解释就是刷新cache的速度太慢了,所以这个问题可能大部分情况下是出现在内存64GB以上、磁盘速度10K RPM以下的系统中,对于内存8GB的系统,一般应该是较少出现的,而我们出问题的这个环境就是低配的。所以不应该出现cached太多导致的,实际上看sar -r也能看出不是cached的问题,如下:

INFO: task java:27465 blocked for more than 120 seconds不一定是cache太大的问题_第1张图片

再看cpu历史:

INFO: task java:27465 blocked for more than 120 seconds不一定是cache太大的问题_第2张图片

对比了上一周的负载情况下,所以应该是该时间段io太高以至于根本无法再进行更多的io,所以导致了这个问题。后面只要找到这个时间段是哪个进程(pidstat -d)在执行大量IO以及什么操作导致即可。

你可能感兴趣的:(INFO: task java:27465 blocked for more than 120 seconds不一定是cache太大的问题)