错误一:Java.io.IOException: Incompatible clusterIDs 时常出现在namenode重新格式化之后
' V0 h# C5 a8 o+ ^1 n
* w- t k( ]$ }+ p6 d G2014-04-29 14:32:53,877 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421) service to Hadoop-master/192.168.1.181:9000
0 X- V* {) S1 z+ s2 O) F( |java.io.IOException: Incompatible clusterIDs in /data/dfs/data: namenode clusterID = CID-d1448b9e-da0f-499e-b1d4-78cb18ecdebb; datanode clusterID = CID-ff0faa40-2940-4838-b321-98272eb0dee3; B: L* y6 ]/ ~! k7 S1 A5 r; X
at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:391)( ~7 a- j) {- I! S! _
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:191)0 C7 s U/ j) c$ h! v% G
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:219)( g- `) h9 h2 Q9 w
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:837)5 ]' Q, f% v! I, a ^
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:808)
! y2 J; j' l0 g* { at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)
* w% E- ]- Q$ F at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)- J2 f9 R$ |* c* i% [: \$ T
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)
! E7 K& b0 i' M, q at java.lang.Thread.run(Thread.java:722)
" {3 u, B& D$ u. w; S. a8 n- H2014-04-29 14:32:53,885 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421) service to hadoop-master/192.168.1.181:9000
g4 R# [4 n, h: M# p9 Z7 F0 B1 _; \2014-04-29 14:32:53,889 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool BP-1480406410-192.168.1.181-1398701121586 (storage id DS-167510828-192.168.1.191-50010-1398750515421)
. Z: Q1 s- [2 @& l2014-04-29 14:32:55,897 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode8 O% w0 f# O1 p% u; H2 H* j
. {# ?) L( G$ U4 r
. u6 w/ y% X9 F- W3 N' v ]; k原因:每次namenode format会重新创建一个namenodeId,而data目录包含了上次format时的id,namenode format清空了namenode下的数据,但是没有清空datanode下的数据,导致启动时失败,所要做的就是每次fotmat前,清空data下的所有目录.
* A8 D% r" T- j; O; [- \ h: |7 \. r$ i
解决办法:停掉集群,删除问题节点的data目录下的所有内容。即hdfs-site.xml文件中配置的dfs.data.dir目录。重新格式化namenode。# K+ I6 l. J' B8 U" U6 |: p
8 u0 m5 u& @- J% C
( D9 J* p# P* Q* q/ f4 g( K: Z另一个更省事的办法:先停掉集群,然后将datanode节点目录/dfs/data/current/VERSION中的修改为与namenode一致即可。
1 Q w* G7 Z o' q0 @: c/ m, t: r$ f6 `
错误二:org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start Container
- y7 \$ e4 a4 {. y
% u$ V0 R* }1 r4 ~- m- X14/04/29 02:45:07 INFO mapreduce.Job: Job job_1398704073313_0021 failed with state FAILED due to: Application application_1398704073313_0021 failed 2 times due to Error launching appattempt_1398704073313_0021_000002. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
/ B; {7 {, {" F; y9 Y3 e8 a" c& zThis token is expired. current time is 1398762692768 found 1398711306590
+ P" `1 _& e: c at sun.reflect.GeneratedConstructorAccessor30.newInstance(Unknown Source) ~- }8 Y1 }* q B4 e
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
8 d1 y. v5 }" s/ O+ o at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
' m D2 O- l+ t( i at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
6 a; {( s, i4 u6 k* {, G6 z' { at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
2 F7 t$ F1 X3 w; O at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122)
2 @; S% P( ~+ } at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:249)
" u6 ]/ ]* O# k$ n at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
D3 \. i- j1 i- ]% M. b2 ~' o \ at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
" ~& m" }* K2 t% F# }/ W# v5 C at java.lang.Thread.run(Thread.java:722)
: p) _4 i k' s! B3 f, l3 G. Failing the application.
! D* z% R5 j# [( t, w) J3 c14/04/29 02:45:07 INFO mapreduce.Job: Counters: 05 ]# |- B x2 d/ X* |; r
7 W/ r8 U1 v n& a# P' P问题原因:namenode,datanode时间同步问题9 d5 X q l" h; D8 w
1 [9 C( g9 ~3 _% ~& `( _. q解决办法:多个datanode与namenode进行时间同步,在每台服务器执行:ntpdate time.nist.gov,确认时间同步成功。$ J3 w6 I4 p1 k# |* ~
最好在每台服务器的 /etc/crontab 中加入一行:" m: Z! N6 C2 H
0 2 * * * root ntpdate time.nist.gov && hwclock -w
错误:java.NET.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write' L; r8 G1 W: P. F0 S# }5 s
2014-05-06 14:28:09,386 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-datanode1:50010ataXceiver error processing READ_BLOCK operation src: /192.168.1.191:48854 dest: /192.168.1.191:50010, M5 B- {' e' V% H/ e' L
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.1.191:50010 remote=/192.168.1.191:48854]; h6 K. d: O+ |. f% F* g: k
at org.apache.hadoop.Net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
! v8 \4 F( t+ }' S at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
C. k0 S7 a1 u. L" p& X( E at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
0 P9 O: \. c8 R. |7 f at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:546)
3 t8 d, J" [3 N$ Z$ |) x at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:710)
8 H% c; o$ ?$ z0 V* U6 N at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:340)
# T) ~* H; c! v4 b at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:101)& z: T1 ]9 z$ @" j1 ^+ @
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:65)% ]! ]7 g" U) v1 H
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
1 E( r e' u4 u d8 v! U at java.lang.Thread.run(Thread.java:722)2 H# B. D K8 I. F9 a
, F- F4 y2 {/ z0 U
原因:IO超时9 W. N3 j; A4 q
3 X' Y$ u2 z' J% N$ _& P解决方法:
) C( i0 S- e: L- W# O* O修改hadoop配置文件hdfs-site.xml,增加dfs.datanode.socket.write.timeout和dfs.socket.timeout两个属性的设置。! f$ D" a' l7 [, }" h
5 _% |) {$ t2 L* [/ j
! y3 O! w$ |: r. [, G1 S& ~
7 @. m- j) g: ~6 {# ]5 U3 M
6 [! m3 {! V$ |% F3 X) U* p
4 t; ~1 P# N( W" Q) k) ^
- u8 K8 U( b$ ^6 ?
' N# ]2 Q+ _ y! Z注意: 超时上限值以毫秒为单位。0表示无限制。) z: _
错误:DataXceiver error processing WRITE_BLOCK operation+ \# Z- e; a" W' g
2014-05-06 15:21:30,378 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hadoop-datanode1:50010ataXceiver error processing WRITE_BLOCK operation src: /192.168.1.193:34147 dest: /192.168.1.191:50010
: `' O; I1 e0 I5 s0 s) `, W& t% ~0 Bjava.io.IOException: Premature EOF from inputStream
! y) K# @' ?0 s. v/ X" @ at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
( f3 E/ B2 k, N5 w/ I at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)- t) y4 g+ x7 m" {5 ^& D1 _0 G4 K
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
! I' Z: b1 ^) z1 ~6 @) b: i at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
) |. M: }( C/ g6 |6 p" t! w at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:435)# Y% V, }; s1 s6 @, V% |1 [- G
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:693)
8 y* ]( ~( j9 K5 d at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:569)* T; M# o( D+ o, K
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
/ ?" v B$ [8 c: |; D: L3 I$ I at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)7 m6 z0 _, x9 m
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221). }9 x3 ?8 U3 S4 S
at java.lang.Thread.run(Thread.java:722)
/ K* I5 u" L( P+ i, m* C
) k. s$ ~9 j0 E- y* Q! q原因:文件操作超租期,实际上就是data stream操作过程中文件被删掉了。
]' u$ N% R+ c# J, c3 ^ C& y2 I3 s6 m! f* z& y8 `
解决办法:; S# h/ ~7 u9 W' G0 m. O* ~
修改hdfs-site.xml (针对2.x版本,1.x版本属性名应该是:dfs.datanode.max.xcievers):
' l4 f) `; y4 B, l: K- ^+ D: W
$ B& N5 N* k( v& F2 [( p E0 M拷贝到各datanode节点并重启datanode即可
错误:java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try.1 r6 m& X# [: S3 e) i& G# \! p6 K
2014-05-07 12:21:41,820 WARN [Thread-115] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Graceful stop failed 4 L( [) w: N" v, K5 O1 i7 B! c( C
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.1.191:50010, 192.168.1.192:50010], original=[192.168.1.191:50010, 192.168.1.192:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.) J- d% x/ Q, c! Z9 G- W: B2 {
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:514)( g# y5 T4 q/ G) @5 [; m q
at org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.serviceStop(JobHistoryEventHandler.java:332)
6 Y" m/ G+ n& c! J. C at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)$ H3 w- d4 `# X
at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
1 i- e( ]) m' N# e, S at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
6 j. p. M* v8 h3 U% a2 u at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:159)& n: m% a% R' _% e/ ~ K1 n
at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:132)! @& p& G5 C8 ^+ ~; `) \+ j; k) W
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
! p6 C8 a# d: p" v. @, \; V2 S+ w: I at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.shutDownJob(MRAppMaster.java:548)
0 k; n% d1 h/ k7 n3 ?6 B at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler$1.run(MRAppMaster.java:599)9 f% ]0 i1 ^) _; b
Caused by: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[192.168.1.191:50010, 192.168.1.192:50010], original=[192.168.1.191:50010, 192.168.1.192:50010]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.) v! J: P9 O/ u0 G
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:860)
/ r+ ?2 c- H8 { at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:925)
* Y$ [9 y9 i! y. A) Z at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1031)
! v1 h7 E( K$ X5 ^ at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:823)
5 C# d0 f$ ?( A& \+ a: E at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:475); h" V6 }% k4 N" k& h. I! E: q
( t1 Y# l3 R1 u! W原因:无法写入;我的环境中有3个datanode,备份数量设置的是3。在写操作时,它会在pipeline中写3个机器。默认replace-datanode-on-failure.policy是DEFAULT,如果系统中的datanode大于等于3,它会找另外一个datanode来拷贝。目前机器只有3台,因此只要一台datanode出问题,就一直无法写入成功。) G8 `4 c; D7 A$ k/ A3 o7 @
+ }" U7 b* c/ S z9 K
解决办法:修改hdfs-site.xml文件,添加或者修改如下两项:
* }% z' Z7 `6 b w$ M- \6 ?
; w) e' Q; R, }- K. j
5 }6 c' c% }1 D; ^0 b
1 ^9 D4 d; t4 ^' i, K" M) i
6 ?2 g4 h" F5 ^$ ^
- t- ?+ w. t2 p& ~
9 H, {* ]0 z! y" W7 Y, O" I# P0 z: J+ p i4 c' o: h
1 m' H: |4 S7 A7 M
对于dfs.client.block.write.replace-datanode-on-failure.enable,客户端在写失败的时候,是否使用更换策略,默认是true没有问题。
6 T4 f+ L, a4 C a/ a. t对于,dfs.client.block.write.replace-datanode-on-failure.policy,default在3个或以上备份的时候,是会尝试更换结点尝试写入datanode。而在两个备份的时候,不更换datanode,直接开始写。对于3个datanode的集群,只要一个节点没响应写入就会出问题,所以可以关掉。
错误:org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for : n# b7 O/ H5 x1 l, M
14/05/08 18:24:59 INFO mapreduce.Job: Task Id : attempt_1399539856880_0016_m_000029_2, Status : FAILED: q' I9 z6 w2 W( k
Error: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1399539856880_0016_m_000029_2_spill_0.out' C/ H, v+ }, B4 H2 F
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)# r" w u4 Y( N# H2 ]# }
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
S$ ^! `- s W; D8 s5 L I( A& a at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)( I! [1 e' D4 M6 \ D4 q
at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159)$ V8 D+ h# C6 G8 b* f5 y1 v" [
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)) g9 ~+ Q. @8 h& _" X6 d% o6 K
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1467): W/ \* N9 X* i9 N" }* B
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
1 w3 G1 @, c9 _& f8 g/ I" E( P at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:769)( C, C/ Y" l8 F0 H+ _* v
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
+ l: C7 \% B: p. F) f1 Z* r7 a0 D at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)) W R4 h7 o. r b% Z) Y- |
at java.security.AccessController.doPrivileged(Native Method)1 j- v' e' `: l3 Z- [8 B' i
at javax.security.auth.Subject.doAs(Subject.java:415)$ H; K, A" T$ {# Y& b. V
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)1 |. f- k0 V% ]# J: [6 a% m
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157) f7 C& i8 j( m- |: H( U) U, c
) T7 t; Q8 U8 ]+ P; @6 H0 V
Container killed by the ApplicationMaster./ P& L' \6 M, J2 d$ J/ b5 I7 L
* A! [1 ?6 |" d9 _0 N1 D原因:两种可能,hadoop.tmp.dir或者data目录存储空间不足。
5 y0 Z/ W/ o0 C3 ]( f% @4 ?& X# g( _2 X
解决办法:看了一下我的dfs状态,data使用率不到40%,所以推测是hadoop.tmp.dir空间不足,导致无法创建Jog临时文件。查看core-site.xml发现没有配置hadoop.tmp.dir,因此使用的是默认的/tmp目录,在这目录一旦服务器重启数据就会丢失,因此需要修改。添加:
5 x5 }5 K8 V" |/ z Y! j. z/ A8 r. O, v0 Z
* S5 k$ C; @9 s3 {% _( x
然后重新格式化:hadoop namenode -format
4 p" T1 S& s7 o- F* c: R% r重启。
2014-06-19 10:00:32,181 INFO [org.apache.hadoop.mapred.MapTask] - Ignoring exception during close for org.apache.hadoop.mapred.MapTask$NewOutputCollector@17bda0f2( X: i6 h$ U: Z6 z. y5 Y& u/ `) f
java.io.IOException: Spill failed
3 W, `! J" @2 w, S at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540)
$ M' f1 i0 K# Z$ z/ e% B1 z; m at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1447)" x* C& E) p- t
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)$ b( @; I* H4 D8 U$ [ ]
at org.apache.hadoop.mapred.MapTask.closeQuietly(MapTask.java:1997)
& P6 T' a) S7 i" x3 b2 N) K$ V at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:773)
: t. r! h6 }, L1 o; X! g* T6 T: X1 ` at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339) T" ~* z- u+ ~
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235)! e/ ~. [3 E4 \; Y* L
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
) `2 c* v C) l9 o, n- N z at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)1 r! t, I. V: x) g% x/ H
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
! Y: j( F% G! ^! E- d- V5 ] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
0 H7 m; d) k m) q( [: Z at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
I1 b( z1 H. B7 N0 V6 Q at java.lang.Thread.run(Thread.java:722)
% g& i4 A' l" R0 i, z' e& hCaused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/spill0.out
6 B# F) X; @( U; d) N at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)
5 T y8 U; n; `# F at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
1 M @9 V0 F. A: { at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)& Q. D9 N/ C0 k/ l7 h
at org.apache.hadoop.mapred.MROutputFiles.getSpillFileForWrite(MROutputFiles.java:146)
" m4 {3 s9 D' J& C/ S- C& W at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573)
5 p2 Z4 B+ R5 i5 E4 W& M7 r at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852): P0 N) S. j0 o% t3 ^. Q) K
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)4 f- X I& k# [3 M" I4 M
: S9 S: D3 s* F, k5 ^: k: [: I
. `9 ^+ [8 T0 B# ~* @( F' x错误原因:本地磁盘空间不足非hdfs (我是在myeclipse中调试程序,本地tmp目录占满) ~6 ]6 m- i- R6 O1 x4 {+ j/ [
解决办法:清理、增加空间
2014-06-23 10:21:01,479 INFO [IPC Server handler 3 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1403488126955_0002_m_000000_0 is : 0.308017162014-06-23 10:21:01,512 FATAL [IPC Server handler 2 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1403488126955_0002_m_000000_0 - exited : java.io.IOException: Spill failed at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112) at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180) at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131) at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,513 INFO [IPC Server handler 2 on 45207] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from attempt_1403488126955_0002_m_000000_0: Error: java.io.IOException: Spill failed at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112) at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180) at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131) at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,514 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1403488126955_0002_m_000000_0: Error: java.io.IOException: Spill failed at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1540) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1063) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:691) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112) at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:180) at com.mediadc.hadoop.MediaIndex$SecondMapper.map(MediaIndex.java:1) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_1403488126955_0002_m_000000_0_spill_53.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131) at org.apache.hadoop.mapred.YarnOutputFiles.getSpillFileForWrite(YarnOutputFiles.java:159) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1573) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:852) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1510)2014-06-23 10:21:01,516 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1403488126955_0002_m_000000_0 TaskAttempt Transitioned from RUNNING to FAIL_CONTAINER_CLEANUP8 T- ?6 g: ~/ K4 p# R B+ F
错误很明显,磁盘空间不足,但郁闷的是,进各节点查看,磁盘空间使用不到40%,还有很多空间。
4 w* ?+ R: q. x6 Y) e9 a# S) z! ?' l& Q$ _- { W2 k
郁闷很长时间才发现,原来有个map任务运行时输出比较多,运行出错前,硬盘空间一路飙升,直到100%不够时报错。随后任务执行失败,释放空间,把任务分配给其它节点。正因为空间被释放,因此虽然报空间不足的错误,但查看当时磁盘还有很多剩余空间。
- @# h6 ~& F4 ]! ^ t& a' t# f; Y6 @& W' g& A, F. O+ s" l8 _
这个问题告诉我们,运行过程中的监控很重要。