转自http://f.dataguru.cn/thread-56030-1-1.html
我使用的是hadoop1.0.2
这两天在写一个MR遍历hbase的小代码,以及一个合并的小代码;在后台执行MR的时候,后台会出一个比较诡异的错误,此错误有时候在刚开始MR任务报出,有时候在执行的过程中报出,
java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:251) 13/01/24 10:35:09 WARN mapred.JobClient: Error reading task outputhttp://master:50060/tasklog?plaintext=true&attemptid=attempt_attempt_201301241037_0001_r_000002_0&filter=stdout 13/01/24 10:35:09 WARN mapred.JobClient: Error reading task outputhttp://master:50060/tasklog?plaintext=true&attemptid=attempt_attempt_201301241037_0001_r_000002_0&filter=stderr 13/01/24 10:35:13 INFO mapred.JobClient: Task Id : attempt_attempt_201301241037_0001_r_000002_0, Status : FAILED
在网上查了不少资料,各大名家众说纷纭,大多都是说${HADOOP_HOME}/log/userlogs里的文件太多了,导致的;有人的确是这样解决掉了,其实这个实际的错误问题还是参看错误日志中的灰色部分的显示;直接打开这个地址,其实看不到什么东西,要到${HADOOP_HOME}/log/userlogs/{jobid}/{attemptid}下去找,一般还是比较准确;
我的这个错误对应的是
launchJvmAndWait(setupCmds, vargs, stdout, stderr, logSize, workDir); tracker.getTaskTrackerInstrumentation().reportTaskEnd(t.getTaskID()); if (exitCodeSet) { if (!killed && exitCode != 0) { if (exitCode == 65) { tracker.getTaskTrackerInstrumentation().taskFailedPing(t.getTaskID()); } throw new IOException("Task process exit with nonzero status of " + exitCode + "."); } }
public void launchJvm(TaskRunner t, JvmEnv env ) throws IOException, InterruptedException { if (t.getTask().isMapTask()) { mapJvmManager.reapJvm(t, env); } else { reduceJvmManager.reapJvm(t, env); } }
private synchronized void reapJvm( TaskRunner t, JvmEnv env) throws IOException, InterruptedException { if (t.getTaskInProgress().wasKilled()) { //the task was killed in-flight //no need to do the rest of the operations return; } boolean spawnNewJvm = false; JobID jobId = t.getTask().getJobID(); //Check whether there is a free slot to start a new JVM. //,or, Kill a (idle) JVM and launch a new one //When this method is called, we *must* // (1) spawn a new JVM (if we are below the max) // (2) find an idle JVM (that belongs to the same job), or, // (3) kill an idle JVM (from a different job) // (the order of return is in the order above) int numJvmsSpawned = jvmIdToRunner.size(); JvmRunner runnerToKill = null; if (numJvmsSpawned >= maxJvms) { //go through the list of JVMs for all jobs. Iterator<Map.Entry<JVMId, JvmRunner>> jvmIter = jvmIdToRunner.entrySet().iterator(); while (jvmIter.hasNext()) { JvmRunner jvmRunner = jvmIter.next().getValue(); JobID jId = jvmRunner.jvmId.getJobId(); //look for a free JVM for this job; if one exists then just break if (jId.equals(jobId) && !jvmRunner.isBusy() && !jvmRunner.ranAll()){ setRunningTaskForJvm(jvmRunner.jvmId, t); //reserve the JVM LOG.info("No new JVM spawned for jobId/taskid: " + jobId+"/"+t.getTask().getTaskID() + ". Attempting to reuse: " + jvmRunner.jvmId); return; } //Cases when a JVM is killed: // (1) the JVM under consideration belongs to the same job // (passed in the argument). In this case, kill only when // the JVM ran all the tasks it was scheduled to run (in terms // of count). // (2) the JVM under consideration belongs to a different job and is // currently not busy //But in both the above cases, we see if we can assign the current //task to an idle JVM (hence we continue the loop even on a match) if ((jId.equals(jobId) && jvmRunner.ranAll()) || (!jId.equals(jobId) && !jvmRunner.isBusy())) { runnerToKill = jvmRunner; spawnNewJvm = true; } } } else { spawnNewJvm = true; } if (spawnNewJvm) { if (runnerToKill != null) { LOG.info("Killing JVM: " + runnerToKill.jvmId); killJvmRunner(runnerToKill); } spawnNewJvm(jobId, env, t); return; } //*MUST* never reach this LOG.fatal("Inconsistent state!!! " + "JVM Manager reached an unstable state " + "while reaping a JVM for task: " + t.getTask().getTaskID()+ " " + getDetails() + ". Aborting. "); System.exit(-1); }最终方法落实到了spawnNewJvm上,此方法里面是去启动Thread,一个JvmRunner的Thread,里面让childJvmstart,我们看到了thread的run方法,里面有个runChild方法,
exitCode = tracker.getTaskController().launchTask(user, jvmId.jobId.toString(), taskAttemptIdStr, env.setup, env.vargs, env.workDir, env.stdout.toString(), env.stderr.toString());
String commandFile = writeCommand(cmdLine, rawFs, p); rawFs.setPermission(p, TaskController.TASK_LAUNCH_SCRIPT_PERMISSION); shExec = new ShellCommandExecutor(new String[]{ "bash", "-c", commandFile}, currentWorkDirectory); shExec.execute(); } catch (Exception e) { if (shExec == null) { return -1; } int exitCode = shExec.getExitCode(); LOG.warn("Exit code from task is : " + exitCode); LOG.info("Output from DefaultTaskController's launchTask follows:"); logOutput(shExec.getOutput());
2013-01-24 10:41:36,101 WARN org.apache.hadoop.mapred.DefaultTaskController: Exit code from task is : 1 2013-01-24 10:41:36,101 INFO org.apache.hadoop.mapred.DefaultTaskController: Output from DefaultTaskController's launchTask follows: 2013-01-24 10:41:36,101 INFO org.apache.hadoop.mapred.TaskController: 2013-01-24 10:41:36,101 INFO org.apache.hadoop.mapred.JvmManager: JVM Not killed jvm_201301241037_0001_r_-1615671564 but just removed 2013-01-24 10:41:36,101 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201301241037_0001_r_-1615671564 exited with exit code 1. Number of tasks it ran: 0 2013-01-24 10:41:36,101 WARN org.apache.hadoop.mapred.TaskRunner: attempt_201301241037_0001_r_000002_0 : Child Error java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)