1 现象一个节点dead,日志如下:
GC pool 'PS MarkSweep' had collection(s): count=13 time=24256ms
2018-03-28 17:52:51,624 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-229667812-10.100.208.74-1520931759525:blk_1078749177_5008401 received exception java.io.IOException: Premature EOF from inputStream
2018-03-28 17:52:59,065 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 6944ms
GC pool 'PS MarkSweep' had collection(s): count=15 time=27935ms
2018-03-28 17:54:59,834 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 14932ms
GC pool 'PS MarkSweep' had collection(s): count=20 time=37701ms
2018-03-28 17:56:10,452 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: tf73:50010:DataXceiver error processing WRITE_BLOCK operation src: /10.100.208.68:46164 dst: /10.100.208.73:50010
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:202)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:503)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:903)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:805)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
at java.lang.Thread.run(Thread.java:748)
2018-03-28 17:58:03,025 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow flushOrSync took 13437ms (threshold=300ms), isSync:false, flushTotalNanos=13436879837ns
2018-03-28 17:56:42,980 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 25844ms
GC pool 'PS MarkSweep' had collection(s): count=63 time=118826ms
2018-03-28 17:59:02,118 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3395ms
GC pool 'PS MarkSweep' had collection(s): count=108 time=206400ms
2018-03-28 17:59:07,882 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3312ms
GC pool 'PS MarkSweep' had collection(s): count=4 time=7664ms
2018-03-28 17:59:13,599 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-229667812-10.100.208.74-1520931759525:blk_1078749174_5008400 src: /10.100.208.74:51732 dest: /10.100.208.73:50010
2018-03-28 17:59:21,115 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 6989ms
GC pool 'PS MarkSweep' had collection(s): count=7 time=13229ms
2018-03-28 17:59:40,030 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow flushOrSync took 5667ms (threshold=300ms), isSync:false, flushTotalNanos=5666402795ns
2018-03-28 17:59:41,916 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-229667812-10.100.208.74-1520931759525:blk_1078749180_5008405 src: /10.100.208.74:51746 dest: /10.100.208.73:50010
2018-03-28 17:59:49,420 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3333ms
GC pool 'PS MarkSweep' had collection(s): count=10 time=18909ms
2018-03-28 18:00:01,035 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-229667812-10.100.208.74-1520931759525:blk_1078749179_5008404 src: /10.100.208.68:46194 dest: /10.100.208.73:50010
2018-03-28 18:00:55,777 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-229667812-10.100.208.74-1520931759525:blk_1078749174_5008400 received exception org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-229667812-10.100.208.74-1520931759525:blk_1078749174_5008400 already exists in state TEMPORARY and thus cannot be created.
2018-03-28 18:01:37,902 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:13334ms (threshold=300ms)
2018-03-28 18:02:26,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: tf73:50010:DataXceiver error processing WRITE_BLOCK operation src: /10.100.208.74:51732 dst: /10.100.208.73:50010; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-229667812-10.100.208.74-1520931759525:blk_1078749174_5008400 already exists in state TEMPORARY and thus cannot be created.
2018-03-28 18:02:26,014 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 7111ms
GC pool 'PS MarkSweep' had collection(s): count=80 time=152336ms
2018-03-28 18:03:38,774 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1366ms
GC pool 'PS MarkSweep' had collection(s): count=11 time=21098ms
2018-03-28 18:03:53,946 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:13320ms (threshold=300ms)
2018-03-28 18:04:32,159 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 13024ms
GC pool 'PS MarkSweep' had collection(s): count=61 time=116740ms
2018-03-28 18:06:50,214 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-229667812-10.100.208.74-1520931759525:blk_1078749180_5008405
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:202)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:503)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:903)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:805)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
at java.lang.Thread.run(Thread.java:748)
2018-03-28 18:07:43,834 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:11756ms (threshold=300ms)
2018-03-28 18:07:43,834 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-229667812-10.100.208.74-1520931759525:blk_1078749182_5008407 src: /10.100.208.74:51760 dest: /10.100.208.73:50010
2018-03-28 18:07:17,288 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 5282ms
2 原因:
3 解决办法