操作系统:CentOS 6.4
HBase
HBase可以使用本地文件系统,但不能保证数据持久性. 正式环境需要使用HDFS作为后端存储.
修改主机名和hosts文件
设置计算机的主机名,将解析添加到hosts文件
编辑 /etc/sysconfig/network 文件
NETWORKING=yes
HOSTNAME=nutch.bis.com.cn
修改完重启才能生效,可以使用执行下面命令,然后重新登录即生效.
hostname nutch.bis.com.cn
编辑/etc/hosts文件
[root@nutch apache-nutch-2.2.1]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
172.35.0.118 nutch.bis.com.cn
下载安装HBase
Nutch 2.X支持的HBase版本为 0.90.4, 理论上0.90.x分支版本都支持. 其余的版本会因jar包不兼容出错.
官网 hbase-0.90.4.tar.gz
下载到本地后,解压压缩包
tar xvf hbase-0.90.4.tar.gz
cd hbase-0.90.4
编辑 conf/hbase-site.xml, 设置hbase.rootdir和hbase.zookeeper.property.dataDir. 如果没设置, hbase将数据保存至/tmp目录下,重启后丢失.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///opt/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/opt/zookeeper</value>
</property>
</configuration>
将value的值替换为需要存储数据的路径.
启动HBase
./bin/start-hbase.sh
Nutch
下载nutch最新版本 2.2.1
官网 apache-nutch-2.2.1-src.tar.gz
解压nutch
tar xvf apache-nutch-2.2.1-src.tar.gz
cd apache-nutch-2.2.1
编辑conf/nutch-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available:
org.apache.gora.sql.store.SqlStore
Default store. A DataStore implementation for RDBMS with a SQL interface.
SqlStore uses JDBC drivers to communicate with the DB. As explained in
ivy.xml, currently >= gora-core 0.3 is not backwards compatable with
SqlStore.
org.apache.gora.cassandra.store.CassandraStore
Gora class for storing data in Apache Cassandra.
org.apache.gora.hbase.store.HBaseStore
Gora class for storing data in Apache HBase.
org.apache.gora.accumulo.store.AccumuloStore
Gora class for storing data in Apache Accumulo.
org.apache.gora.avro.store.AvroStore
Gora class for storing data in Apache Avro.
org.apache.gora.avro.store.DataFileAvroStore
Gora class for storing data in Apache Avro. DataFileAvroStore is
a file based store which uses Avro's DataFile{Writer,Reader}'s as a backend.
This datastore supports mapreduce.
org.apache.gora.memory.store.MemStore
Gora class for storing data in a Memory based implementation for tests.
</description>
</property>
<property>
<name>http.content.limit</name>
<value>6553600</value>
<description>The length limit for downloaded content using the http
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property>
<property>
<name>http.agent.name</name>
<value>nbot</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.robots.agents</name>
<value>nbot,*</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>
<property>
<name>http.accept.language</name>
<value>zh-cn,ja-jp,en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the "Accept-Language" request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>
</configuration>
编辑 ivy/ivy.xml 将下行的gora-hbase的注释去掉 ,注意name为“gora-hbase”,org相同的有多行。
<dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />
conf/gora.properties 中添加下面一行, 设置默认存储为HBase
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
编译nutch
安装ant
yum install -y ant
执行编译命令
ant runtime
需要下载很多jar包,需要等待一段时间.
编译成功之后,会生成runtime目录.现在可以使用nutch.
cd runtime/local/bin
./nutch inject /path/to/urls_folder
./nutch readdb -stats
默认log文件是runtime/local/logs/hadoop.log