Precondition:
hadoop 2.7.1
hbase 0.98.13
solr 5.2.1 / Apache Solr 4.8.1
http://archive.apache.org/dist/lucene/solr/4.8.1/
gora 0.6.1
gora编译和Nutch编译部署
1. Gora下载
最新版本呢gora是0.6.1,下载或者直接通过git获取 git clonehttps://github.com/apache/gora.git
2. 修改gora pom.xml
以下可能是Nutch2.3能最终运行的关键,没有1.0.1.1-hadoop2:)
<hadoop-1.version>1.2.1</hadoop-1.version> <hadoop-2.version>2.7.1</hadoop-2.version> <hadoop-1.test.version>1.2.1</hadoop-1.test.version> <hadoop-2.test.version>2.7.1</hadoop-2.test.version> <hbase.version>0.98.13-hadoop2</hbase.version> <hbase.test.version>0.98.13-hadoop2</hbase.test.version>
3. 编译gora
mvn clean install -DskipTests
mvn install -DskipTests
4. 修改$NUTCH_HOME/conf/nutch-site.xml
<configuration> <property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</description> </property> <property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property> <property> <name>plugin.includes</name> <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value> </property> </configuration>
5. 修改$NUTCH_HOME/ivy/ivy.xml
所有"org.apache.gora"涉及到的rev修改为0.6,例如:
<dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" /> => <dependency org="org.apache.gora" name="gora-hbase" rev="0.6" conf="*->default" />
删除"org.apache.hadoop",添加:
<dependency org="org.apache.hadoop" name="hadoop-client" rev="2.7.1" conf="*->default"/>
6.修改$NUTCH_HOME/ivy/ivysettings.xml
<ivysettings> <settings defaultResolver="default"/> <property name="m2-pattern" value="${user.home}/.m2/repository/[organisation]/[module]/[revision]/[module]-[revision](-[classifier]).[ext]" override="false" /> <resolvers> <chain name="default"> <filesystem name="local-maven2" m2compatible="true" > <artifact pattern="${m2-pattern}"/> <ivy pattern="${m2-pattern}"/> </filesystem> <ibiblio name="central" m2compatible="true"/> </chain> </resolvers> </ivysettings>
7. $NUTCH_HOME/conf/gora.properties 添加
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
8. 根据需要修改 $NUTCH_HOME/conf/regex-urlfilter.txt $NUTCH_HOME/conf/nutch-default.xml
可以不用改
9. 编译,要很长时间
ant runtime
10. 将gora下面的hadoop*.jar拷贝到runtime/local/lib/
cp /disk/gora/gora-core/lib/hadoop* /disk2/nutch/nutch-2.3/runtime/local/lib/
11. 建立搜索url
mkdir urls
echo http://nutch.apache.org/ >> urls/seek.txt
12. 测试运行
cd runtime/local/
bin/nutch inject urls/seek.txt
solr5.2.1 部署运行
1. 下载解压
2. example/example-DIH 包含了完整的solr home配置,拷贝到server/solr
cp -rf /disk2/solr/solr-5.2.1/example/example-DIH/solr/* /disk2/solr/solr-5.2.1/server/solr/
3. 解决Nutch运行中可能遇到的Error 404: Prob accessing /solr/solr/update. Reason: Not Found
cd /disk2/solr/solr-5.2.1/server/solr
cp /disk2/solr/solr-5.2.1/example/exampledocs/monitor.xml .
curl http://127.0.0.1:8983/solr/solr/update --data-binary @monitor.xml -H 'Content-type:application/xml'
3. 为nutch crawl运行,还要修改/disk2/solr/solr-5.2.1/server/solr/solr/conf/schema.xml,加上:
<field name="host" type="string" stored="false" indexed="true"/> <field name="site" type="string" stored="false" indexed="true"/> <field name="cache" type="string" stored="true" indexed="false"/> <field name="digest" type="string" stored="true" indexed="false"/> <field name="segment" type="string" stored="true" indexed="false"/> <field name="boost" type="float" stored="true" indexed="false"/> <field name="tstamp" type="date" stored="true" indexed="false"/> <field name="stamp" type="date" stored="true" indexed="false"/> <field name="anchor" type="string" stored="true" indexed="true" multiValued="true"/>4. bin/solr start
5. http://192.168.1.106:8983/solr
6. bin/crawl urls/seek.txt TestCrawl http://192.168.1.106:8983/solr/solr 2
FAQ
下面是过程中遇到的让人愤怒的。。。
1. 错误: 找不到或无法加载主类 org.apache.nutch.crawl.InjectorJob:
没有ant runtime
2. Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration
nutch2.3 需要使用hbase 0.98.13 的几个hbase-comm*.jar / hbase-client*.jar / hbase-protocol*.jar,千万不要用hbase1.0.1.1的。
cd /disk2/hbase/hbase-0.98.13-hadoop2/lib
cp hbase-common* /disk2/nutch/nutch-2.3/runtime/local/lib/
cp hbase-client-0.98.13-hadoop2.jar /disk2/nutch/nutch-2.3/runtime/local/lib/
cp hbase-protocol* /disk2/nutch/nutch-2.3/runtime/local/lib/
3. Exception in thread "main" java.lang.NoSuchFieldError: HBASE_CLIENT_PREFETCH_LIMIT
原因同上,hbase 和 nutch不匹配
4. 2015-07-21 13:53:53,238 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
就让他native好了
mkdir -p /disk2/nutch/nutch-2.3/runtime/local/lib/native/Linux-amd64-64/
cd /disk2/hadoop/hadoop-2.7.1/lib/native/
cp * /disk2/nutch/nutch-2.3/runtime/local/lib/native/Linux-amd64-64/
cp * /disk2/nutch/nutch-2.3/runtime/local/lib/native/