1.1.1 Crawl抓取出现hadoop出错提示
配置完成nutch在cygwin中运行nutch的crawl命令时:
[Fatal Error] hadoop-site.xml:15:7: The content of elements must consist of well
-formed character data or markup.
Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseExcep
tion: The content of elements must consist of well-formed character data or mark
up.
问题解决:
hadoop-site.xml、hadoop-site.xml:其中一个标签前面多了一个尖括号
1.1.2 运行crawl报错Job failed
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java
:439)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
问题解决:
此多为crawl-urlfilter.txt:MY.DOMAIN.NAME的修改不正确
1.1.3 又一个Job failed
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java
:439)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
问题解决:
多为crawl-urlfilter.txt的MY.DOMAIN.NAME修改不正确
1.1.4 Eclipse中运行nutch:Job failed
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
问题解决:
此问题是eclipse的java版本设置问题,解决方法:
如原来使用java1.4,需要改为1.6
project-》properties-》java compiler
右 jdk compliance
compiler compliance level:改为6.0