记一次Linux主机内存脏数据引发的NameNode故障,主机提示echo 0 > /proc/sys/kernel/hung_task_timeout_secs。

全文共10789字,阅读约需3min。

如果你运维过一个大数据集群,你就能够明白内存对于集群主节点来说是一种稀缺资源,特别是集群越大,机器内存问题给集群带来的问题就越明显且越频繁。
在我管理的这么多大数据集群中,该问题现象就层出不穷。下面对其中一次现象进行了总结。

问题现象

  1. 一台主机无缘无故死机;
  2. 其上namenode服务无响应其发生准备切换。

问题排查思路是:

  1. 接收到主机网络不通告警;
  2. 接收到其上namenode服务STOP告警;
  3. 查看备节点是否成功切换为主节点;
  4. 查看zkfc日志确定failover原因;
  5. 重启(如果需要)故障主机,备份日志,启动nn服务,查看主机日志确认主机故障原因。

其实正确的处理问题路径是:重启主机->启动nn->查看日志定位原因。这里不对问题处理思路作深究,仅重点阐述本案中主机死机的原因。

主机死机,一般是要查看/var/log/message日志,发现提示:echo 0 > /proc/sys/kernel/hung_task_timeout_secs.

完整的日志如下:

Mar 28 03:12:47 namenode01 collectd[1731]: write_graphite plugin: send to epc-graphite-server:2003 (tcp) failed with status 110 (Connection timed out)
Mar 28 03:12:47 namenode01 collectd[1731]: Filter subsystem: Built-in target `write': Dispatching value to all write plugins failed with status -1.
Mar 28 03:12:47 namenode01 collectd[1731]: Filter subsystem: Built-in target `write': Some write plugin is back to normal operation. `write' succeeded.
Mar 28 03:14:31 namenode01 kernel: INFO: task kworker/15:2:22246 blocked for more than 120 seconds.
Mar 28 03:14:31 namenode01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 28 03:14:31 namenode01 kernel: kworker/15:2    D ffff88208618bdd8     0 22246      2 0x00000080
Mar 28 03:14:31 namenode01 kernel: ffff88208618bbf0 0000000000000046 ffff8828e3130000 ffff88208618bfd8
Mar 28 03:14:31 namenode01 kernel: ffff88208618bfd8 ffff88208618bfd8 ffff8828e3130000 ffff88208618bd60
Mar 28 03:14:31 namenode01 kernel: 7fffffffffffffff ffff88208618bd58 ffff8828e3130000 ffff88208618bdd8
Mar 28 03:14:31 namenode01 kernel: Call Trace:
Mar 28 03:14:31 namenode01 kernel: [] schedule+0x29/0x70
Mar 28 03:14:31 namenode01 kernel: [] schedule_timeout+0x239/0x2c0
Mar 28 03:14:31 namenode01 kernel: [] ? physflat_send_IPI_mask+0xe/0x10
Mar 28 03:14:31 namenode01 kernel: [] ? native_smp_send_reschedule+0x4c/0x70
Mar 28 03:14:31 namenode01 kernel: [] ? resched_curr+0xa8/0xc0
Mar 28 03:14:31 namenode01 kernel: [] ? check_preempt_curr+0x78/0xa0
Mar 28 03:14:31 namenode01 kernel: [] ? ttwu_do_wakeup+0x19/0xd0
Mar 28 03:14:31 namenode01 kernel: [] wait_for_completion+0xfd/0x140
Mar 28 03:14:31 namenode01 kernel: [] ? wake_up_state+0x20/0x20
Mar 28 03:14:31 namenode01 kernel: [] kthread_create_on_node+0xaa/0x140
Mar 28 03:14:31 namenode01 kernel: [] ? manage_workers.isra.24+0x2a0/0x2a0
Mar 28 03:14:31 namenode01 kernel: [] create_worker+0xeb/0x200
Mar 28 03:14:31 namenode01 kernel: [] manage_workers.isra.24+0xf6/0x2a0
Mar 28 03:14:31 namenode01 kernel: [] worker_thread+0x383/0x3c0
Mar 28 03:14:31 namenode01 kernel: [] ? manage_workers.isra.24+0x2a0/0x2a0
Mar 28 03:14:31 namenode01 kernel: [] kthread+0xcf/0xe0
Mar 28 03:14:31 namenode01 kernel: [] ? insert_kthread_work+0x40/0x40
Mar 28 03:14:31 namenode01 kernel: [] ret_from_fork+0x58/0x90
Mar 28 03:14:31 namenode01 kernel: [] ? insert_kthread_work+0x40/0x40
Mar 28 03:16:31 namenode01 kernel: INFO: task kworker/2:2:11678 blocked for more than 120 seconds.
Mar 28 03:16:31 namenode01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 28 03:16:31 namenode01 kernel: kworker/2:2     D ffff88003922bdd8     0 11678      2 0x00000080
Mar 28 03:16:31 namenode01 kernel: ffff88003922bbf0 0000000000000046 ffff881fffd86eb0 ffff88003922bfd8
Mar 28 03:16:31 namenode01 kernel: ffff88003922bfd8 ffff88003922bfd8 ffff881fffd86eb0 ffff88003922bd60
Mar 28 03:16:31 namenode01 kernel: 7fffffffffffffff ffff88003922bd58 ffff881fffd86eb0 ffff88003922bdd8
Mar 28 03:16:31 namenode01 kernel: Call Trace:
Mar 28 03:16:31 namenode01 kernel: [] schedule+0x29/0x70
Mar 28 03:16:31 namenode01 kernel: [] schedule_timeout+0x239/0x2c0
Mar 28 03:16:31 namenode01 kernel: [] ? __enqueue_entity+0x78/0x80
Mar 28 03:16:31 namenode01 kernel: [] ? enqueue_entity+0x26c/0xb60
Mar 28 03:16:31 namenode01 kernel: [] wait_for_completion+0xfd/0x140
Mar 28 03:16:31 namenode01 kernel: [] ? wake_up_state+0x20/0x20
Mar 28 03:16:31 namenode01 kernel: [] kthread_create_on_node+0xaa/0x140
Mar 28 03:16:31 namenode01 kernel: [] ? manage_workers.isra.24+0x2a0/0x2a0
Mar 28 03:16:31 namenode01 kernel: [] create_worker+0xeb/0x200
Mar 28 03:16:31 namenode01 kernel: [] manage_workers.isra.24+0xf6/0x2a0
Mar 28 03:16:31 namenode01 kernel: [] worker_thread+0x383/0x3c0
Mar 28 03:16:31 namenode01 kernel: [] ? manage_workers.isra.24+0x2a0/0x2a0
Mar 28 03:16:31 namenode01 kernel: [] kthread+0xcf/0xe0
Mar 28 03:16:31 namenode01 kernel: [] ? insert_kthread_work+0x40/0x40
Mar 28 03:16:31 namenode01 kernel: [] ret_from_fork+0x58/0x90
Mar 28 03:16:31 namenode01 kernel: [] ? insert_kthread_work+0x40/0x40
Mar 28 03:16:31 namenode01 kernel: INFO: task kworker/15:2:22246 blocked for more than 120 seconds.
Mar 28 03:16:31 namenode01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 28 03:16:31 namenode01 kernel: kworker/15:2    D ffff88208618bdd8     0 22246      2 0x00000080
Mar 28 03:16:31 namenode01 kernel: ffff88208618bbf0 0000000000000046 ffff8828e3130000 ffff88208618bfd8
Mar 28 03:16:31 namenode01 kernel: ffff88208618bfd8 ffff88208618bfd8 ffff8828e3130000 ffff88208618bd60
Mar 28 03:16:31 namenode01 kernel: 7fffffffffffffff ffff88208618bd58 ffff8828e3130000 ffff88208618bdd8

原因在于,默认情况下, Linux会最多使用40%的可用内存作为文件系统缓存。当超过这个阈值后,文件系统会把将缓存中的内存全部写入磁盘, 导致后续的IO请求都是同步的。将缓存写入磁盘时,有一个默认120秒的超时时间。

出现上面的问题的原因是IO子系统的处理速度不够快,不能在120秒将缓存中的数据全部写入磁盘。IO系统响应缓慢,导致越来越多的请求堆积,最终系统内存全部被占用,导致系统失去响应。

这个Linux延迟写机制带来的问题,并且在主机内存越大时,出现该问题的可能性更大。研发发现This is a know bug

This is a know bug. By default Linux uses up to 40% of the available memory for file system caching. After this mark has been reached the file system flushes all outstanding data to disk causing all following IOs going synchronous. For flushing out this data to disk this there is a time limit of 120 seconds by default. In the case here the IO subsystem is not fast enough to flush the data withing 120 seconds. This especially happens on systems with a lof of memory.

The problem is solved in later kernels and there is not “fix” from Oracle. I fixed this by lowering the mark for flushing the cache from 40% to 10% by setting “vm.dirty_ratio=10” in /etc/sysctl.conf. This setting does not influence overall database performance since you hopefully use Direct IO and bypass the file system cache completely.

链接:nfo-task-blocked-for-more-than-120-seconds

关于脏数据,有几个配置:

vm.dirty_background_ratio是内存可以填充“脏数据”的百分比。这些“脏数据”在稍后是会写入磁盘的,pdflush/flush/kdmflush这些后台进程会稍后清理脏数据。举一个例子,我有32G内存,那么有3.2G的内存可以待着内存里,超过3.2G的话就会有后来进程来清理它。

vm.dirty_ratio 是绝对的脏数据限制,内存里的脏数据百分比不能超过这个值,如果超过,将强制刷写到磁盘。如果脏数据超过这个数量,新的IO请求将会被阻挡,直到脏数据被写进磁盘。这是造成IO卡顿的重要原因,但这也是保证内存中不会存在过量脏数据的保护机制。

vm.dirty_expire_centisecs 指定脏数据能存活的时间。在这里它的值是30秒。当 pdflush/flush/kdmflush 进行起来时,它会检查是否有数据超过这个时限,如果有则会把它异步地写到磁盘中。毕竟数据在内存里待太久也会有丢失风险。

vm.dirty_writeback_centisecs 指定多长时间
pdflush/flush/kdmflush 这些进程会起来一次。

调优

我们可以从以下思路进行调优:

  1. 减少脏数据的比例,避免刷写超时
  2. 减小脏数据在内存中的存放时间,避免积少成多

修改相应参数

临时修改

# sysctl -w vm.dirty_ratio=10
# sysctl -w vm.dirty_background_ratio=5
# sysctl -p

永久修改

#vi /etc/sysctl.conf
写入
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10
sysctl -p

关于脏数据,这里简单说一下:

脏数据 由于Linux内核实现的一种主要磁盘缓存的存在,也就是页高速缓存(cache)。页高速缓存的缓存作用,写操作实际上会被延迟。当页高速缓存中的数据比后台存储的数据更新时,那么该数据就被称做脏数据。

参考

  • 关于Linux内核延迟写机制,可以学习以下文章:传送门:http://ilinuxkernel.com/?p=1578
  • 关于页高速缓存和脏数据等其他IO术语不清楚的同学,参考:几个IO术语:https://blog.51cto.com/qixue/1906775
  • dirty_ratio与dirty_background_ratio的区别:https://feichashao.com/dirty_ratio_and_dirty_background_ratio/ 也提到了其他的一些情景的调优策略。

记一次Linux主机内存脏数据引发的NameNode故障,主机提示echo 0 > /proc/sys/kernel/hung_task_timeout_secs。_第1张图片


你可能感兴趣的:(Linux,Hadoop,hadoop,linux)