在win7通过myeclipse部署nutch1.2源码,报如下异常:
2011-10-28 00:09:37,784 WARN mapred.LocalJobRunner (LocalJobRunner.java:run(256)) - job_local_0001
java.io.IOException: Expecting a line not the end of stream
at org.apache.hadoop.fs.DF.parseExecResult(DF.java:109)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:179)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:329)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1221)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1129)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
2011-10-28 00:09:38,174 INFO mapred.JobClient (JobClient.java:monitorAndPrintJob(1288)) - map 0% reduce 0%
Exception in thread "main" java.io.IOException: Job failed!
2011-10-28 00:09:38,174 INFO mapred.JobClient (JobClient.java:monitorAndPrintJob(1343)) - Job complete: job_local_0001
2011-10-28 00:09:38,174 INFO mapred.JobClient (Counters.java:log(514)) - Counters: 0
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)
在网上找了很多资料,有说改cygwin语言环境的,有说是权限问题的但是实验了都不行,只好自己追溯问题,首先找到异常抛出方法:
DF类中的parseExecResult
protected void parseExecResult(BufferedReader lines) throws IOException {
lines.readLine(); // skip headings
String line = lines.readLine();
if (line == null) {
throw new IOException( "Expecting a line not the end of stream" );
}
StringTokenizer tokens =
new StringTokenizer(line, " \t\n\r\f%");
this.filesystem = tokens.nextToken();
if (!tokens.hasMoreTokens()) { // for long filesystem name
line = lines.readLine();
if (line == null) {
throw new IOException( "Expecting a line not the end of stream" );//这就是105行了
}
tokens = new StringTokenizer(line, " \t\n\r\f%");
}
this.capacity = Long.parseLong(tokens.nextToken()) * 1024;
this.used = Long.parseLong(tokens.nextToken()) * 1024;
this.available = Long.parseLong(tokens.nextToken()) * 1024;
this.percentUsed = Integer.parseInt(tokens.nextToken());
this.mount = tokens.nextToken();
}
打印103行,是能取到值的,但是乱码,发生错行,将第二行的数据放入了第一行,导致了105的错误,
按照http://hi.baidu.com/amdkings/blog/item/b589a5f56c1ddae17609d78f.html博文中的设置了myeclipse的编译环境还是不行,继续
往前追溯错误抛出在
at org.apache.hadoop.util.Shell.at org.apache.hadoop.util.Shell.runCommand(Shell.java:179)(Shell.java:179)
即是shell类中的runCommand方法调用
parseExecResult(inReader); // parse the output
在该方法中找到inReader变量的定义及初始化位置如下:
BufferedReader inReader = new BufferedReader(new InputStreamReader(process .getInputStream()));
很明显因为inReader 初始化没有进行charset设置,设置charset如下:
BufferedReader inReader = new BufferedReader(new InputStreamReader(process .getInputStream(),"utf-8"));
然后再运行,至此可正确往后运行
根据分析过程可得
临时解决办法:
将shell.java类中inReader变量进行编码设置,就是
BufferedReader inReader = new BufferedReader(new InputStreamReader(process .getInputStream(),"utf-8"));
较好实践思路:
cygwin中设置英文环境export LANG="en.UTF-8",df是变成英文显示了,但是在myeclipse里是不起作用的,还是中文乱码,可以考虑下载或者用什么方式
将cygwin改成英文环境,使得myeclipse读到英文环境,这样nutch1.2的源码就不需要调整即可运行了