背景:
之前数据量增加,集群开始报错10000 millis timeout while waiting for channel to be ready for write,后来发现是hdfs写数据超时,通过修改配置文件,集群正常。
https://blog.csdn.net/dehu_zhou/article/details/81533802
集群运行几天后,发现又出现新的问题了,datanode经常一批批的挂掉,每次还不是一样的节点,继续查看日志;
发现datanode日志一直打印:
2018-08-09 09:49:42,830 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1102ms
GC pool 'PS MarkSweep' had collection(s): count=1 time=1450ms
2018-08-09 09:49:44,371 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1040ms
GC pool 'PS MarkSweep' had collection(s): count=1 time=1396ms
2018-08-09 09:49:47,343 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1169ms
GC pool 'PS MarkSweep' had collection(s): count=1 time=1542ms
2018-08-09 09:49:50,286 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1042ms
GC pool 'PS MarkSweep' had collection(s): count=1 time=1444ms
2018-08-09 09:50:15,703 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 2197ms
GC pool 'PS MarkSweep' had collection(s): count=1 time=2679ms
2018-08-09 09:50:18,903 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1595ms
由于本人对JVM也不懂,只能怀疑是内存不足引起,GC超时,好吧,笨人使用笨方法,增加datanode的内存
在datanode的haoop-env.sh里面配置(根据自己集群调整参数)
export HADOOP_DATANODE_OPTS="-Xmx3g -Xms3g -Xmn2g -Xloggc:$HADOOP_LOG_DIR/datanode_gc.log"
配置完成,拷贝到所有节点,刚好集群没有任务要执行,重启集群,然后等待集群重启。
。。。。。。。。。。。。。。。。。
等了20分钟,发现集群还没有起来,还在加载edit文件,平时没有那么慢,,,赶紧去看namenode日志,发现namenode也报错了:
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.getDiskReport(DirectoryScanner.java:549)
at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:422)
at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:403)
at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:359)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.getDiskReport(DirectoryScanner.java:545)
... 10 more
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.io.UnixFileSystem.resolve(UnixFileSystem.java:108)
at java.io.File.(File.java:262)
at java.io.File.listFiles(File.java:1212)
at org.apache.hadoop.fs.FileUtil.listFiles(FileUtil.java:1162)
at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.compileReport(DirectoryScanner.java:595)
at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.compileReport(DirectoryScanner.java:610)
at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.compileReport(DirectoryScanner.java:610)
at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.compileReport(DirectoryScanner.java:610)
at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.call(DirectoryScanner.java:585)
at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.call(DirectoryScanner.java:570)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
纳尼,又是JVM问题,欺负我不懂JVM啊,,,好吧,继续使用笨方法,
修改namendoe节点的hadoop-env.sh
# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE=10240
export HADOOP_NAMENODE_INIT_HEAPSIZE="100000"
继续重启namenode,等3分钟左右,发现集群成功启动,截止到现在,已经成功运行100多个任务,运行3天了,说明集群暂时正常。
以上纯属个人愚见,而且都是使用的笨方法,欢迎有更好方法的大神交流,谢谢!
附hadoop env可以配置的一些参数:
感谢大神分享,原文链接:https://blog.csdn.net/zxln007/article/details/79356654
环境参数 | 参数设置 | 参数说明 | 备注 |
HADOOP_OPTS | -Djava.net.preferIPv4Stack=true | 优先使用IPV4,禁用IPV6 | |
HADOOP_NAMENODE_OPTS | Xms140G | 初始堆内存大小 | |
Xmx140G | 最大堆内存大小 | ||
XX:NewRatio=8 | 设置年轻代和年老代的比值。如:为3,表示年轻代与年老代比值为1:3,年轻代占整个年轻代年老代和的1/4 | ||
XX:SurvivorRatio=4 | 年轻代中Eden区与两个Survivor区的比值。注意Survivor区有两个。如:3,表示Eden:Survivor=3:2,一个Survivor区占整个年轻代的1/5 | ||
XX:MaxPermSize=200M | 设置持久代大小 | ||
XX:+UseParNewGC | ParNew收集器其实就是Serial收集器的多线程版本。新生代并行,老年代串行;新生代复制算法、老年代标记-压缩 | ||
XX:+UseConcMarkSweepGC | 设置CMS收集器(并发收集器) | ||
XX:+CMSParallelRemarkEnabled | 为了减少第二次暂停的时间,开启并行remark | ||
XX:MaxDirectMemorySize=512M | 此参数的含义是当Direct ByteBuffer分配的堆外内存到达指定大小后,即触发Full GC | ||
XX:CMSInitiatingOccupancyFraction=75 | 设置CMS收集器在老年代空间被使用多少后触发垃圾收集,默认值为68%,仅在使用CMS时生效 | ||
XX:ConcGCThreads=8 | |||
verbose:gc | |||
XX:+PrintGCDetails | 打印GC详情 | ||
XX:+PrintGCDateStamps | 打印GC时间戳 | ||
Xloggc:/home/bigdata/hadoop/logs/namenode.gc.log | gc日志的目录及文件名 | ||
Dcom.sun.management.jmxremote | |||
HADOOP_DATANODE_OPTS | Xms3G | 初始堆内存 |
|
Xmx3G | 最大堆内存 | ||
Xmn512M | 设置年轻代大小 | ||
XX:MaxDirectMemorySize=512M | 此参数的含义是当Direct ByteBuffer分配的堆外内存到达指定大小后,即触发Full GC | ||
XX:+UseParNewGC | |||
XX:+UseConcMarkSweepGC | |||
XX:CMSInitiatingOccupancyFraction=75 |
附:网上找到的大神的一个配置,感谢大神分享
原文链接: http://blog.csdn.net/maijiyouzou/article/details/23740225
JVM_OPTS="-server -verbose:gc
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=9
-XX:GCLogFileSize=256m
-XX:+DisableExplicitGC
-XX:+UseCompressedOops
-XX:SoftRefLRUPolicyMSPerMB=0
-XX:+UseFastAccessorMethods
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:CMSInitiatingOccupancyFraction=70
-XX:+UseCMSCompactAtFullCollection
-XX:CMSFullGCsBeforeCompaction=0
-XX:+CMSClassUnloadingEnabled
-XX:CMSMaxAbortablePrecleanTime=301
-XX:+CMSScavengeBeforeRemark
-XX:PermSize=160m
-XX:GCTimeRatio=19
-XX:SurvivorRatio=2
-XX:MaxTenuringThreshold=100" //这个参数很重要,MaxTenuringThreshold这个参数用于控制对象能经历多少次Minor GC才晋升到旧生代,默认值是15
# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="$JVM_OPTS -Xmx80g -Xms80g -Xmn21g -Xloggc:$HADOOP_LOG_DIR/namenode_gc.log"
export HADOOP_SECONDARYNAMENODE_OPTS="$JVM_OPTS -Xmx80g -Xms80g -Xmn21g"
export HADOOP_DATANODE_OPTS="$JVM_OPTS -Xmx3g -Xms3g -Xmn2g -Xloggc:$HADOOP_LOG_DIR/datanode_gc.log"
export HADOOP_BALANCER_OPTS="$JVM_OPTS -Xmx1g -Xms1g -Xmn512m -Xloggc:$HADOOP_LOG_DIR/balancer_gc.log"
export HADOOP_JOBTRACKER_OPTS="$JVM_OPTS -Xmx1g -Xms1g -Xmn512m -Xloggc:$HADOOP_LOG_DIR/jobtracker_gc.log"
export HADOOP_TASKTRACKER_OPTS="$JVM_OPTS -Xmx1g -Xms1g -Xmn512m -Xloggc:$HADOOP_LOG_DIR/tasktracker_gc.log"
export HADOOP_CLIENT_OPTS="$JVM_OPTS -Xmx512m -Xms512m -Xmn256m"