Hadoop datanode隔一段时间就挂掉 报错:java.lang.OutOfMemoryError: GC overhead limit exceeded

背景:

       之前数据量增加,集群开始报错10000 millis timeout while waiting for channel to be ready for write,后来发现是hdfs写数据超时,通过修改配置文件,集群正常。

https://blog.csdn.net/dehu_zhou/article/details/81533802

 

集群运行几天后,发现又出现新的问题了,datanode经常一批批的挂掉,每次还不是一样的节点,继续查看日志;

发现datanode日志一直打印:

2018-08-09 09:49:42,830 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1102ms
GC pool 'PS MarkSweep' had collection(s): count=1 time=1450ms
2018-08-09 09:49:44,371 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1040ms
GC pool 'PS MarkSweep' had collection(s): count=1 time=1396ms
2018-08-09 09:49:47,343 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1169ms
GC pool 'PS MarkSweep' had collection(s): count=1 time=1542ms
2018-08-09 09:49:50,286 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1042ms
GC pool 'PS MarkSweep' had collection(s): count=1 time=1444ms
2018-08-09 09:50:15,703 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 2197ms
GC pool 'PS MarkSweep' had collection(s): count=1 time=2679ms
2018-08-09 09:50:18,903 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1595ms

由于本人对JVM也不懂,只能怀疑是内存不足引起,GC超时,好吧,笨人使用笨方法,增加datanode的内存

在datanode的haoop-env.sh里面配置(根据自己集群调整参数)

export HADOOP_DATANODE_OPTS="-Xmx3g -Xms3g -Xmn2g -Xloggc:$HADOOP_LOG_DIR/datanode_gc.log"

配置完成,拷贝到所有节点,刚好集群没有任务要执行,重启集群,然后等待集群重启。

。。。。。。。。。。。。。。。。。

等了20分钟,发现集群还没有起来,还在加载edit文件,平时没有那么慢,,,赶紧去看namenode日志,发现namenode也报错了:

java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
        at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.getDiskReport(DirectoryScanner.java:549)
        at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:422)
        at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:403)
        at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:359)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:192)
        at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.getDiskReport(DirectoryScanner.java:545)
        ... 10 more
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.io.UnixFileSystem.resolve(UnixFileSystem.java:108)
        at java.io.File.(File.java:262)
        at java.io.File.listFiles(File.java:1212)
        at org.apache.hadoop.fs.FileUtil.listFiles(FileUtil.java:1162)
        at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.compileReport(DirectoryScanner.java:595)
        at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.compileReport(DirectoryScanner.java:610)
        at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.compileReport(DirectoryScanner.java:610)
        at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.compileReport(DirectoryScanner.java:610)
        at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.call(DirectoryScanner.java:585)
        at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.call(DirectoryScanner.java:570)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        ... 3 more

纳尼,又是JVM问题,欺负我不懂JVM啊,,,好吧,继续使用笨方法,

修改namendoe节点的hadoop-env.sh

# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE=10240
export HADOOP_NAMENODE_INIT_HEAPSIZE="100000"

继续重启namenode,等3分钟左右,发现集群成功启动,截止到现在,已经成功运行100多个任务,运行3天了,说明集群暂时正常。

 

以上纯属个人愚见,而且都是使用的笨方法,欢迎有更好方法的大神交流,谢谢!

 

附hadoop env可以配置的一些参数:

感谢大神分享,原文链接:https://blog.csdn.net/zxln007/article/details/79356654

环境参数 参数设置 参数说明 备注
HADOOP_OPTS -Djava.net.preferIPv4Stack=true 优先使用IPV4,禁用IPV6  
HADOOP_NAMENODE_OPTS Xms140G 初始堆内存大小  
  Xmx140G 最大堆内存大小  
  XX:NewRatio=8 设置年轻代和年老代的比值。如:为3,表示年轻代与年老代比值为1:3,年轻代占整个年轻代年老代和的1/4  
  XX:SurvivorRatio=4 年轻代中Eden区与两个Survivor区的比值。注意Survivor区有两个。如:3,表示Eden:Survivor=3:2,一个Survivor区占整个年轻代的1/5  
  XX:MaxPermSize=200M 设置持久代大小  
  XX:+UseParNewGC ParNew收集器其实就是Serial收集器的多线程版本。新生代并行,老年代串行;新生代复制算法、老年代标记-压缩  
  XX:+UseConcMarkSweepGC 设置CMS收集器(并发收集器)  
  XX:+CMSParallelRemarkEnabled 为了减少第二次暂停的时间,开启并行remark  
  XX:MaxDirectMemorySize=512M 此参数的含义是当Direct ByteBuffer分配的堆外内存到达指定大小后,即触发Full GC  
  XX:CMSInitiatingOccupancyFraction=75 设置CMS收集器在老年代空间被使用多少后触发垃圾收集,默认值为68%,仅在使用CMS时生效  
  XX:ConcGCThreads=8    
  verbose:gc    
  XX:+PrintGCDetails 打印GC详情  
  XX:+PrintGCDateStamps 打印GC时间戳  
  Xloggc:/home/bigdata/hadoop/logs/namenode.gc.log gc日志的目录及文件名  
  Dcom.sun.management.jmxremote    
HADOOP_DATANODE_OPTS Xms3G
初始堆内存
 
  Xmx3G 最大堆内存  
  Xmn512M 设置年轻代大小  
  XX:MaxDirectMemorySize=512M 此参数的含义是当Direct ByteBuffer分配的堆外内存到达指定大小后,即触发Full GC  
  XX:+UseParNewGC    
  XX:+UseConcMarkSweepGC    
  XX:CMSInitiatingOccupancyFraction=75  

附:网上找到的大神的一个配置,感谢大神分享

原文链接: http://blog.csdn.net/maijiyouzou/article/details/23740225 

JVM_OPTS="-server -verbose:gc
  -XX:+PrintGCDateStamps
  -XX:+PrintGCDetails
  -XX:+UseGCLogFileRotation
  -XX:NumberOfGCLogFiles=9
  -XX:GCLogFileSize=256m
  -XX:+DisableExplicitGC
  -XX:+UseCompressedOops
  -XX:SoftRefLRUPolicyMSPerMB=0
  -XX:+UseFastAccessorMethods
  -XX:+UseParNewGC
  -XX:+UseConcMarkSweepGC
  -XX:+CMSParallelRemarkEnabled
  -XX:CMSInitiatingOccupancyFraction=70
  -XX:+UseCMSCompactAtFullCollection
  -XX:CMSFullGCsBeforeCompaction=0
  -XX:+CMSClassUnloadingEnabled
  -XX:CMSMaxAbortablePrecleanTime=301
  -XX:+CMSScavengeBeforeRemark
  -XX:PermSize=160m
  -XX:GCTimeRatio=19
  -XX:SurvivorRatio=2
  -XX:MaxTenuringThreshold=100" //这个参数很重要,MaxTenuringThreshold这个参数用于控制对象能经历多少次Minor GC才晋升到旧生代,默认值是15


# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="$JVM_OPTS -Xmx80g -Xms80g -Xmn21g -Xloggc:$HADOOP_LOG_DIR/namenode_gc.log"
export HADOOP_SECONDARYNAMENODE_OPTS="$JVM_OPTS -Xmx80g -Xms80g -Xmn21g"
export HADOOP_DATANODE_OPTS="$JVM_OPTS -Xmx3g -Xms3g -Xmn2g -Xloggc:$HADOOP_LOG_DIR/datanode_gc.log"
export HADOOP_BALANCER_OPTS="$JVM_OPTS -Xmx1g -Xms1g -Xmn512m -Xloggc:$HADOOP_LOG_DIR/balancer_gc.log"


export HADOOP_JOBTRACKER_OPTS="$JVM_OPTS -Xmx1g -Xms1g -Xmn512m -Xloggc:$HADOOP_LOG_DIR/jobtracker_gc.log"
export HADOOP_TASKTRACKER_OPTS="$JVM_OPTS -Xmx1g -Xms1g -Xmn512m -Xloggc:$HADOOP_LOG_DIR/tasktracker_gc.log"
export HADOOP_CLIENT_OPTS="$JVM_OPTS -Xmx512m -Xms512m -Xmn256m"

 

你可能感兴趣的:(大数据踩过的坑)