目前官方2.x只提供了源码下载,不再提供编译的版本,需要用户自己去编译。
2.1 下载地址:
http://www.apache.org/dyn/closer.lua/nutch/2.2.1/apache-nutch-2.2.1-src.tar.gz
tar -zxvf apache-nutch-2.2.1-bin.tar.gz
2.2 下载sonar的jar包,将jar包放到apache-nutch-2.2.1目录下
修改build.xml,引入上面添加的jar包:
<taskdef uri="antlib:org.sonar.ant" resource="org/sonar/ant/antlib.xml">
<classpath path="${ant.library.dir}" />
<classpath path="${mysql.library.dir}" />
<classpath><fileset dir="." includes="sonar*.jar" />classpath>
taskdef>
2.3 nutch编译安装时需要从maven资源库下载jar包,将其修改
vi apache-nutch-2.2.1/ivy/ivy.setting.xml
"repo.maven.org" value="http://repo1.maven.org/maven2/" override="false"/>
将value=http://repo1.maven.org/maven2/修改成value= http://repo2.maven.org/maven2/
修改apache-nutch-2.2.1/ivy/ivy.xml文件,取消注释
<dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>
<dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />
修改
<dependency org="org.apache.gora" name="gora-core" rev="0.3" conf="*->default"/>
为
<dependency org="org.apache.gora" name="gora-core" rev="0.2.1" conf="*->default"/>
修改 apache-nutch-2.2.1/conf/gora.properties文件,注释掉默认的数据库连接配置,同时添加以下配置内容:
###############################
# Default SqlStore properties #
###############################
gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver
gora.sqlstore.jdbc.url=jdbc:mysql://10.202.13.175:3306/nutch-test
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=sf123456
CREATE TABLE `webpage` (
`id` varchar(767) NOT NULL,
`headers` blob,
`text` longtext DEFAULT NULL,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`prevModifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`batchId` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(767) DEFAULT NULL,
`content` longblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(767) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
在 中添加以下内容
<property>
<name>http.agent.namename>
YourNutchSpider
property>
<property>
<name>http.accept.languagename>
ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3
Value of the Accept-Language request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
property>
<property>
<name>storage.data.store.classname>
org.apache.gora.sql.store.SqlStore
The Gora DataStore class for storing and retrieving data.
Currently the following stores are available:.
property>
<property>
<name>parser.character.encoding.defaultname>
utf-8
The character encoding to fall back to when no other information
is available
property>
<property>
<name>generate.batch.idname>
*
property>
切换到apache-nutch.2.2.1主目录下,运行ant命令
用ivy下载jar包会有几个下载不了,我们可以直接去http://repo2.maven.org/maven2/org/bouncycastle/bcprov-jdk15on/1.52/bcprov-jdk15on-1.52.jar下载
然后将jar包拷贝到ivy仓库,然后重新编译
9.1设置抓取的网站
cd apache-nutch-2.2.1/runtime/local
mkdir -p urls
echo 'http://www.sina.com' > urls/seed.txt
9.2执行爬取操作
cd apache-nutch-2.2.1\runtime\deploy\bin
bin/nutch crawl urls -depth 3 -topN 5