nutch搏斗之一

问题描述:
在用nutch1.0做generate 包括5亿url的crawldb时,它默认按照64M分块,分成777个map task,在运行的后期出现
Could not find taskTracker/jobcache/job_200903231519_0017/attempt_200903231519_0017_r_000051_0/output/file.out in any of the configured local directories
异常。
解决办法:
减小task数目,改成按照crawldb里面文件个数划分的策略:
  public static class InputFormat extends SequenceFileInputFormat<WritableComparable, Writable> {
	    /** Don't split inputs, to keep things polite. */
	    public InputSplit[] getSplits(JobConf job, int nSplits)
	      throws IOException {
	      FileStatus[] files = listStatus(job);
	      FileSystem fs = FileSystem.get(job);
	      InputSplit[] splits = new InputSplit[files.length];
	      for (int i = 0; i < files.length; i++) {
	        FileStatus cur = files[i];
	        splits[i] = new FileSplit(cur.getPath(), 0,
	            cur.getLen(), (String[])null);
	      }
	      return splits;
	    }
	  }


这次出现了新问题,有数个task因为十分钟无反应而导致整个任务failed
解决办法:
修改hadoop-site.xml
<property>
  <name>mapred.task.timeout</name>
  <value>3600000</value>
  <description>The number of milliseconds before a task will be
  terminated if it neither reads an input, writes an output, nor
  updates its status string.
  </description>
</property>


总结:
大与小,多与少,长与短,在不同的情况下是不断变化的,对于大数据量而言,更要跟具实际情况灵活变化,所谓运用之刀,存乎一心是也!

你可能感兴趣的:(java,apache,jsp,互联网,servlet)