编译安装nutch2.3和hbase0.98.8集成

nutch2.3版本官网默认支持的hbase是0.94.14版本,如需升级hbase到更新版本,则需要修改gora的版本到0.6或以上。由于nutch2.3版本还是比较新的,网上安装nutch2.3的教程并不是很多,接下来就是根据网上教程安装nutch2.3版本的一个整合描述:

1、修改ivy/ivy.xml

<dependency org="org.apache.gora" name="gora-core" rev="0.6" conf="*->default"/>  
<!--取消该注释--> 
<dependency org="org.apache.gora" name="gora-hbase" rev="0.6" conf="*->default" /> 
<dependency org="org.apache.gora" name="gora-compiler-cli" rev="0.6" conf="*->default"/> 
<dependency org="org.apache.gora" name="gora-compiler" rev="0.6" conf="*->default"/>      
<!--将hadoop1.2相关的去掉,然后添加-->
<dependency org="org.apache.hadoop" name="hadoop-client" rev="2.5.2" conf="*->default"/>


2、修改ivysetting.xml 
编译时部分jar包可能不能下载,需要修改如下配置:

<property name="repository.apache.org" value="http://maven.restlet.org/" override="false"/>


3、修改nutch-site.xml

<configuration>
	<property>  
		<name>storage.data.store.class</name>   
		<value>org.apache.gora.hbase.store.HBaseStore</value>   
		<description>Default class for storing data</description>   
	</property>   
	<property>   
		<name>http.agent.name</name>   
		<value>My Nutch Spider</value>   
	</property>   
	<property>
        	<name>plugin.folders</name>
        	<value>plugins</value>
    	</property>
</configuration>


4、修改gora.properties

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore


5、修改jdk版本

如果用的是1.7版本,可以修改default.properties

javac.version= 1.7


6、编译

ant runtime


编译通过之后,就可以使用命令逐步抓取:

1、injector job将url注入抓取队列中进行初始化

cd runtime/local

mkdir urls

echo "http://nutch.apache.org/" > urls/seed.txt

bin/nutch inject urls -crawlId test

以上测试都没有问题,在hbase中新建了一个表test_webpage,有相应的数据写入

2、generate

bin/nutch generate -crawlId test

执行以上命令报错:

2015-12-03 16:19:43,423 WARN  mapred.LocalJobRunner - job_local246507986_0001
java.lang.Exception: java.io.EOFException
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.io.EOFException
    at org.apache.avro.io.BinaryDecoder.ensureBounds(BinaryDecoder.java:473)
    at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:128)
    at org.apache.avro.io.BinaryDecoder.readIndex(BinaryDecoder.java:423)
    at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
    at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
    at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
    at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155)
    at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
    at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
    at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
    at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
    at org.apache.hadoop.io.serializer.avro.AvroSerialization$AvroDeserializer.deserialize(AvroSerialization.java:127)
    at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKeyValue(ReduceContextImpl.java:146)
    at org.apache.hadoop.mapreduce.task.ReduceContextImpl.nextKey(ReduceContextImpl.java:121)
    at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.nextKey(WrappedReducer.java:302)
    at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:170)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
2015-12-03 16:19:43,493 ERROR crawl.GeneratorJob - GeneratorJob: java.lang.RuntimeException: job failed: name=generate: 1449130778-1365697545, jobid=job_local246507986_0001
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
    at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:213)
    at org.apache.nutch.crawl.GeneratorJob.generate(GeneratorJob.java:241)
    at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:308)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.GeneratorJob.main(GeneratorJob.java:316)
定位问题期间,有怀疑过avro的版本不兼容问题,因为hadoop2.5.2中的avro是1.7.4版本的,而nutch2.3中的avro是1.7.6版本的,但是版本保持同步之后,还是碰到同样的错误。在另外一篇 文章中也提到了这个问题,说是avro的bug,在apache的 jira上也有该问题的记录。这个问题是2011年就提的,到现在2015年还没有解决,想想也是不应该的。仔细想想,可能只是表面现象一样而已,继续定位问题。根据日志的描述,应该是跟序列化这块有关系的,后来找到一篇 文章,在nutch-site.xml中要加入以下配置:

<property>   
	<name>io.serializations</name>   
	<value>org.apache.hadoop.io.serializer.WritableSerialization</value>
	<description>A list of serialization classes that can be used for obtaining serializers and deserializers.</description> 
</property>

重新ant runtime之后,报错问题解决了。由于解决该问题还是花了好几天时间的,觉得有必要记录一下,希望对碰到同样问题的朋友有所帮助。






你可能感兴趣的:(编译安装nutch2.3和hbase0.98.8集成)