操作系统:Ubuntu 16.04 LTS
nutch版本:2.2.1
配置nutch之前,要先配置ant,不会的可以看我的另一篇文章UBUNTU环境配置ANT
然后去nutch官网下载nutch,不过2.3.1的版本编译时有问题,切换maven2库也没用,会一直卡在以下界面:
root@ubuntu:/opt/apache-nutch-2.3.1# ant runtime
Buildfile: /opt/apache-nutch-2.3.1/build.xml
ivy-probe-antlib:
ivy-download:
ivy-download-unchecked:
ivy-init-antlib:
ivy-init:
init:
[mkdir] Created dir: /opt/apache-nutch-2.3.1/build
[mkdir] Created dir: /opt/apache-nutch-2.3.1/build/classes
[mkdir] Created dir: /opt/apache-nutch-2.3.1/build/release
[mkdir] Created dir: /opt/apache-nutch-2.3.1/build/test
[mkdir] Created dir: /opt/apache-nutch-2.3.1/build/test/classes
clean-lib:
resolve-default:
[ivy:resolve] :: Apache Ivy 2.3.0 - 20130110142753 :: http://ant.apache.org/ivy/ ::
[ivy:resolve] :: loading settings :: file = /opt/apache-nutch-2.3.1/ivy/ivysettings.xml
于是我放弃了,决定采用nutch2.2.1版本进行安装,nutch2.2.1下载地址:http://archive.apache.org/dist/nutch/2.2.1/
Ubuntu环境下的firefox默认下载存储路径为~/Downloads
1、用命令cd ~/Downloads切换路径,然后使用tar -xvf apache-nutch-2.2.1-src-tar-gz解压文件
然后移动到/opt目录下,用命令sudo mv apache-nutch-2.2.1 /opt/移动到/opt文件夹下
2、配置nutch对mysql的支持,修改 ${NUTCH_HOME}/ivy/ivy.xml文件
先取消以下行的注释
default”/>
然后修改以下行,从默认的
改成
最后取消掉以下行的注释
3、数据库连接配置编辑 ${NUTCH_HOME}/conf/gora.properties文件,注释掉默认的数据库连接配置,同时添加以下配置内容:
###############################
# Default MySQL properties #
###############################
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=xxxx(MySQL用户名)
gora.sqlstore.jdbc.password=xxxx(MySQL密码)
4、数据表映射配置
修改 ${NUTCH_HOME}/conf/gora-sql-mapping.xml 文件
将primarykey 的长度从512修改成767,即
5、修改nutch-site.xml配置文件
可直接将nutch-default.xml保存为nutch-site.xml,使用命令sudo mv nutch-default-xml nutch-size.xml
然后sudo gedit nutch-site,在末尾的前添加以下代码
http.agent.name
YourNutchSpider
http.accept.language
ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3
Value of the Accept-Language request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
storage.data.store.class
org.apache.gora.sql.store.SqlStore
The Gora DataStore class for storing and retrieving data.
Currently the following stores are available:.
parser.character.encoding.default
utf-8
The character encoding to fall back to when no other information
is available
generate.batch.id
*
6、使用ant编译
切换到NUTCH目录
cd ${NUTCH_HOME}
ant runtime
1)权限不足,创建文件夹例如build文件夹失败,使用命令sudo -i切换到root权限再进行ant编译
2)提示:
Trying to override old definition of task javac [taskdef]
Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.
先下载 sonar-ant-task-2.2.jar
,
将其拷贝到
${NUTCH_HOME}/lib
目录下面
然后使用命令sudo gedit /${NUTCH_HOME}/build.xml
通过ctrl+F打开搜索功能,输入antlib:org,sonar.ant定位到以下代码,添加红色部分的代码即可
3)build failed,提示如
[ivy:resolve] :: com.google.code.findbugs#jsr305;1.3.9!jsr305.jar
[ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::
[ivy:resolve]
[ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
BUILD FAILED
/opt/apache-nutch-2.2.1/build.xml:444: impossible to resolve dependencies:
resolve failed - see output for details
或者是其他的依赖性问题导致BUILD FAILED的,可通过修改maven中央库地址来解决
sudo gedit ${NUTCH_HOME} /ivy/ivysettings.xml,找到以下代码
将maven中央库地址 http://repo1.maven.org/maven2/
替换成国内OSC提供的镜像:http://maven.oschina.net/content/groups/public/
4)卡在以下界面
resolve-default:
[ivy:resolve] :: Apache Ivy 2.3.0 - 20130110142753 :: http://ant.apache.org/ivy/ ::
[ivy:resolve] :: loading settings :: file = /opt/apache-nutch-2.3.1/ivy/ivysettings.xml
解决方案:耐心等待,加载需要时间,如果超过10分钟没反应就放弃吧,可以换个maven(见问题3)。
一般编译时间为半个小时左右!上个我成功的界面截图
7、网站抓取测试7.1 设置抓取网站
cd ${NUTCH_HOME}/runtime/local
sudo mkdir -p urls
cd urls
sudo gedit seed.txt
然后输入冒号:wq保存
7.2 执行爬虫操作bin/nutch crawl urls -depth 3 -topN 5