nutch2.3版本官网默认支持的hbase是0.94.14版本,如需升级hbase到更新版本,则需要修改gora的版本到0.6或以上。由于nutch2.3版本还是比较新的,网上安装nutch2.3的教程并不是很多,接下来就是根据网上教程安装nutch2.3版本的一个整合描述:
1、修改ivy/ivy.xml
<dependency org="org.apache.gora" name="gora-core" rev="0.6" conf="*->default"/> <!--取消该注释--> <dependency org="org.apache.gora" name="gora-hbase" rev="0.6" conf="*->default" /> <dependency org="org.apache.gora" name="gora-compiler-cli" rev="0.6" conf="*->default"/> <dependency org="org.apache.gora" name="gora-compiler" rev="0.6" conf="*->default"/> <!--将hadoop1.2相关的去掉,然后添加--> <dependency org="org.apache.hadoop" name="hadoop-client" rev="2.5.2" conf="*->default"/>
2、修改ivysetting.xml
编译时部分jar包可能不能下载,需要修改如下配置:
<property name="repository.apache.org" value="http://maven.restlet.org/" override="false"/>
3、修改nutch-site.xml
<configuration> <property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</description> </property> <property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property> <property> <name>plugin.folders</name> <value>plugins</value> </property> </configuration>
4、修改gora.properties
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
5、修改jdk版本
如果用的是1.7版本,可以修改default.properties
javac.version= 1.7
6、编译
ant runtime
编译通过之后,就可以使用命令逐步抓取:
1、injector job将url注入抓取队列中进行初始化
cd runtime/local
mkdir urls
echo "http://nutch.apache.org/" > urls/seed.txt
bin/nutch inject urls -crawlId test
以上测试都没有问题,在hbase中新建了一个表test_webpage,有相应的数据写入
2、generate
bin/nutch generate -crawlId test
执行以上命令报错:
2015-12-03 16:19:43,423 WARN mapred.LocalJobRunner - job_local246507986_0001 java.lang.Exception: java.io.EOFException at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) Caused by: java.io.EOFException at org.apache.avro.io.BinaryDecoder.ensureBounds(BinaryDecoder.java:473) at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:128) at org.apache.avro.io.BinaryDecoder.readIndex(BinaryDecoder.java:423) at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155) at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142) at org.apache.hadoop.io.serializer.avro.AvroSerialization$AvroDeserializer.deserialize(AvroSerialization.java:127) at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:146) at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121) at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:302) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389) at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 2015-12-03 16:19:43,493 ERROR crawl.GeneratorJob - GeneratorJob: java.lang.RuntimeException: job failed: name=generate: 1449130778-1365697545, jobid=job_local246507986_0001 at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54) at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:213) at org.apache.nutch.crawl.GeneratorJob.generate(GeneratorJob.java:241) at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:308) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.crawl.GeneratorJob.main(GeneratorJob.java:316)定位问题期间,有怀疑过avro的版本不兼容问题,因为hadoop2.5.2中的avro是1.7.4版本的,而nutch2.3中的avro是1.7.6版本的,但是版本保持同步之后,还是碰到同样的错误。在另外一篇 文章中也提到了这个问题,说是avro的bug,在apache的 jira上也有该问题的记录。这个问题是2011年就提的,到现在2015年还没有解决,想想也是不应该的。仔细想想,可能只是表面现象一样而已,继续定位问题。根据日志的描述,应该是跟序列化这块有关系的,后来找到一篇 文章,在nutch-site.xml中要加入以下配置:
<property> <name>io.serializations</name> <value>org.apache.hadoop.io.serializer.WritableSerialization</value> <description>A list of serialization classes that can be used for obtaining serializers and deserializers.</description> </property>
重新ant runtime之后,报错问题解决了。由于解决该问题还是花了好几天时间的,觉得有必要记录一下,希望对碰到同样问题的朋友有所帮助。