全文共10789字,阅读约需3min。
如果你运维过一个大数据集群,你就能够明白内存对于集群主节点来说是一种稀缺资源,特别是集群越大,机器内存问题给集群带来的问题就越明显且越频繁。
在我管理的这么多大数据集群中,该问题现象就层出不穷。下面对其中一次现象进行了总结。
问题排查思路是:
其实正确的处理问题路径是:重启主机->启动nn->查看日志定位原因。这里不对问题处理思路作深究,仅重点阐述本案中主机死机的原因。
主机死机,一般是要查看/var/log/message
日志,发现提示:echo 0 > /proc/sys/kernel/hung_task_timeout_secs.
完整的日志如下:
Mar 28 03:12:47 namenode01 collectd[1731]: write_graphite plugin: send to epc-graphite-server:2003 (tcp) failed with status 110 (Connection timed out)
Mar 28 03:12:47 namenode01 collectd[1731]: Filter subsystem: Built-in target `write': Dispatching value to all write plugins failed with status -1.
Mar 28 03:12:47 namenode01 collectd[1731]: Filter subsystem: Built-in target `write': Some write plugin is back to normal operation. `write' succeeded.
Mar 28 03:14:31 namenode01 kernel: INFO: task kworker/15:2:22246 blocked for more than 120 seconds.
Mar 28 03:14:31 namenode01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 28 03:14:31 namenode01 kernel: kworker/15:2 D ffff88208618bdd8 0 22246 2 0x00000080
Mar 28 03:14:31 namenode01 kernel: ffff88208618bbf0 0000000000000046 ffff8828e3130000 ffff88208618bfd8
Mar 28 03:14:31 namenode01 kernel: ffff88208618bfd8 ffff88208618bfd8 ffff8828e3130000 ffff88208618bd60
Mar 28 03:14:31 namenode01 kernel: 7fffffffffffffff ffff88208618bd58 ffff8828e3130000 ffff88208618bdd8
Mar 28 03:14:31 namenode01 kernel: Call Trace:
Mar 28 03:14:31 namenode01 kernel: [] schedule+0x29/0x70
Mar 28 03:14:31 namenode01 kernel: [] schedule_timeout+0x239/0x2c0
Mar 28 03:14:31 namenode01 kernel: [] ? physflat_send_IPI_mask+0xe/0x10
Mar 28 03:14:31 namenode01 kernel: [] ? native_smp_send_reschedule+0x4c/0x70
Mar 28 03:14:31 namenode01 kernel: [] ? resched_curr+0xa8/0xc0
Mar 28 03:14:31 namenode01 kernel: [] ? check_preempt_curr+0x78/0xa0
Mar 28 03:14:31 namenode01 kernel: [] ? ttwu_do_wakeup+0x19/0xd0
Mar 28 03:14:31 namenode01 kernel: [] wait_for_completion+0xfd/0x140
Mar 28 03:14:31 namenode01 kernel: [] ? wake_up_state+0x20/0x20
Mar 28 03:14:31 namenode01 kernel: [] kthread_create_on_node+0xaa/0x140
Mar 28 03:14:31 namenode01 kernel: [] ? manage_workers.isra.24+0x2a0/0x2a0
Mar 28 03:14:31 namenode01 kernel: [] create_worker+0xeb/0x200
Mar 28 03:14:31 namenode01 kernel: [] manage_workers.isra.24+0xf6/0x2a0
Mar 28 03:14:31 namenode01 kernel: [] worker_thread+0x383/0x3c0
Mar 28 03:14:31 namenode01 kernel: [] ? manage_workers.isra.24+0x2a0/0x2a0
Mar 28 03:14:31 namenode01 kernel: [] kthread+0xcf/0xe0
Mar 28 03:14:31 namenode01 kernel: [] ? insert_kthread_work+0x40/0x40
Mar 28 03:14:31 namenode01 kernel: [] ret_from_fork+0x58/0x90
Mar 28 03:14:31 namenode01 kernel: [] ? insert_kthread_work+0x40/0x40
Mar 28 03:16:31 namenode01 kernel: INFO: task kworker/2:2:11678 blocked for more than 120 seconds.
Mar 28 03:16:31 namenode01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 28 03:16:31 namenode01 kernel: kworker/2:2 D ffff88003922bdd8 0 11678 2 0x00000080
Mar 28 03:16:31 namenode01 kernel: ffff88003922bbf0 0000000000000046 ffff881fffd86eb0 ffff88003922bfd8
Mar 28 03:16:31 namenode01 kernel: ffff88003922bfd8 ffff88003922bfd8 ffff881fffd86eb0 ffff88003922bd60
Mar 28 03:16:31 namenode01 kernel: 7fffffffffffffff ffff88003922bd58 ffff881fffd86eb0 ffff88003922bdd8
Mar 28 03:16:31 namenode01 kernel: Call Trace:
Mar 28 03:16:31 namenode01 kernel: [] schedule+0x29/0x70
Mar 28 03:16:31 namenode01 kernel: [] schedule_timeout+0x239/0x2c0
Mar 28 03:16:31 namenode01 kernel: [] ? __enqueue_entity+0x78/0x80
Mar 28 03:16:31 namenode01 kernel: [] ? enqueue_entity+0x26c/0xb60
Mar 28 03:16:31 namenode01 kernel: [] wait_for_completion+0xfd/0x140
Mar 28 03:16:31 namenode01 kernel: [] ? wake_up_state+0x20/0x20
Mar 28 03:16:31 namenode01 kernel: [] kthread_create_on_node+0xaa/0x140
Mar 28 03:16:31 namenode01 kernel: [] ? manage_workers.isra.24+0x2a0/0x2a0
Mar 28 03:16:31 namenode01 kernel: [] create_worker+0xeb/0x200
Mar 28 03:16:31 namenode01 kernel: [] manage_workers.isra.24+0xf6/0x2a0
Mar 28 03:16:31 namenode01 kernel: [] worker_thread+0x383/0x3c0
Mar 28 03:16:31 namenode01 kernel: [] ? manage_workers.isra.24+0x2a0/0x2a0
Mar 28 03:16:31 namenode01 kernel: [] kthread+0xcf/0xe0
Mar 28 03:16:31 namenode01 kernel: [] ? insert_kthread_work+0x40/0x40
Mar 28 03:16:31 namenode01 kernel: [] ret_from_fork+0x58/0x90
Mar 28 03:16:31 namenode01 kernel: [] ? insert_kthread_work+0x40/0x40
Mar 28 03:16:31 namenode01 kernel: INFO: task kworker/15:2:22246 blocked for more than 120 seconds.
Mar 28 03:16:31 namenode01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 28 03:16:31 namenode01 kernel: kworker/15:2 D ffff88208618bdd8 0 22246 2 0x00000080
Mar 28 03:16:31 namenode01 kernel: ffff88208618bbf0 0000000000000046 ffff8828e3130000 ffff88208618bfd8
Mar 28 03:16:31 namenode01 kernel: ffff88208618bfd8 ffff88208618bfd8 ffff8828e3130000 ffff88208618bd60
Mar 28 03:16:31 namenode01 kernel: 7fffffffffffffff ffff88208618bd58 ffff8828e3130000 ffff88208618bdd8
原因在于,默认情况下, Linux会最多使用40%的可用内存作为文件系统缓存。当超过这个阈值后,文件系统会把将缓存中的内存全部写入磁盘, 导致后续的IO请求都是同步的。将缓存写入磁盘时,有一个默认120秒的超时时间。
出现上面的问题的原因是IO子系统的处理速度不够快,不能在120秒将缓存中的数据全部写入磁盘。IO系统响应缓慢,导致越来越多的请求堆积,最终系统内存全部被占用,导致系统失去响应。
这个Linux延迟写机制带来的问题,并且在主机内存越大时,出现该问题的可能性更大。研发发现This is a know bug
。
This is a know bug. By default Linux uses up to 40% of the available memory for file system caching. After this mark has been reached the file system flushes all outstanding data to disk causing all following IOs going synchronous. For flushing out this data to disk this there is a time limit of 120 seconds by default. In the case here the IO subsystem is not fast enough to flush the data withing 120 seconds. This especially happens on systems with a lof of memory.
The problem is solved in later kernels and there is not “fix” from Oracle. I fixed this by lowering the mark for flushing the cache from 40% to 10% by setting “vm.dirty_ratio=10” in /etc/sysctl.conf. This setting does not influence overall database performance since you hopefully use Direct IO and bypass the file system cache completely.
链接:nfo-task-blocked-for-more-than-120-seconds
关于脏数据,有几个配置:
vm.dirty_background_ratio
是内存可以填充“脏数据”的百分比。这些“脏数据”在稍后是会写入磁盘的,pdflush/flush/kdmflush这些后台进程会稍后清理脏数据。举一个例子,我有32G内存,那么有3.2G的内存可以待着内存里,超过3.2G的话就会有后来进程来清理它。
vm.dirty_ratio
是绝对的脏数据限制,内存里的脏数据百分比不能超过这个值,如果超过,将强制刷写到磁盘。如果脏数据超过这个数量,新的IO请求将会被阻挡,直到脏数据被写进磁盘。这是造成IO卡顿的重要原因,但这也是保证内存中不会存在过量脏数据的保护机制。
vm.dirty_expire_centisecs
指定脏数据能存活的时间。在这里它的值是30秒。当 pdflush/flush/kdmflush 进行起来时,它会检查是否有数据超过这个时限,如果有则会把它异步地写到磁盘中。毕竟数据在内存里待太久也会有丢失风险。
vm.dirty_writeback_centisecs
指定多长时间
pdflush/flush/kdmflush 这些进程会起来一次。
我们可以从以下思路进行调优:
修改相应参数
临时修改
# sysctl -w vm.dirty_ratio=10
# sysctl -w vm.dirty_background_ratio=5
# sysctl -p
永久修改
#vi /etc/sysctl.conf
写入
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10
sysctl -p
关于脏数据,这里简单说一下:
脏数据 由于Linux内核实现的一种主要磁盘缓存的存在,也就是页高速缓存(cache)。页高速缓存的缓存作用,写操作实际上会被延迟。当页高速缓存中的数据比后台存储的数据更新时,那么该数据就被称做脏数据。