http://lucene.apache.org/nutch/tutorial8.html 有如下的介绍:
所以需要的准备工作如下:
1、下载nutch,使用最新的0.9版本,放在D:\nutch\nutch-0.9下;
2、在环境变量中设置NUTCH_JAVA_HOME为jdk的安装路径;
3、安装tomcat服务器,不作介绍;
4、因为是在windows环境下,所以需要下载安装cygwin来运行shell command。
准备工作完毕。
First, you need to get a copy of the Nutch code. You can download a release from http://lucene.apache.org/nutch/release/ . Unpack the release and connect to its top-level directory. Or, check out the latest source code from subversion and build it with Ant .
Try the following command:
bin/nutch
This will display the documentation for the Nutch command script.
这部分工作有如下几步:
1、运行cygwin
安装完成cygwin后运行,执行命令:
cd d:nutch
cd nutch-0.9
cygwin所示的当前目录为:
/cygdrive/d/nutch/nutch-0.9
在此目录下执行命令:bin/nutch,如果正确的话,会有Usage:nutch COMMAND提示
To configure things for intranet crawling you must:
http://lucene.apache.org/nutch/
+^http://([a-z0-9]*\.)*apache.org/This will include any url in the domain apache.org .
对于第一条应在d:\nutch\nutch-0.9下建文件夹urls,在此文件夹下建文本文件nutch.txt,其中的内容为:http://lucene.apache.org/nutch/
对于第二条,打开conf/crawl-urlfilter.txt ,找到MY.DOMAIN.NAME ,修改为:
+^http://([a-z0-9]*\.)*apache.org/
对于第三条,此次实验使用nutch-default.xml, 修改如下属性:
http.agent.name
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
例如:
修改完成后保存。
Once things are configured, running the crawl is easy. Just use the crawl command. Its options include:
For example, a typical call might be:
bin/nutch crawl urls -dir crawl -depth 3 -topN 50
Typically one starts testing one's configuration by crawling at shallow depths, sharply limiting the number of pages fetched at each level (-topN ), and watching the output to check that desired pages are fetched and undesirable pages are not. Once one is confident of the configuration, then an appropriate depth for a full crawl is around 10. The number of pages per level (-topN ) for a full crawl can be from tens of thousands to millions, depending on your resources.
Once crawling has completed, one can skip to the Searching section below.
此处只需运行如下命令即可:
bin/nutch crawl urls -dir crawled-depth 3 -topN 50 >&crawl.log
运行完成后,会生成crawled文件夹和crawl.log日志文件。
在日志文件中会发现抛pdf文件错误,那是因为默认情况下不支持对pdf文件的索引,要想对pdf文件也进行正确的索上,找到nutch-default.xml中的plugin.includes属性,添加上pdf,即为parse-(text|html|js|pdf)。
crawled中包含有segment, linkdb, indexed, index, crawldb文件夹。
到此为止,索引数据准备完毕。
下面是如何在tomcat中运行。
将nutch-0.9.war拷到tomcat的webapps目录下,并改名为nutch.war;
进入conf\Catalina\localhost目录下,创建文件nutch.xml,内容如下:
Context path="/nutch" debug="0" privileged="true" /contect
启运tomcat;
进入解压后的webapps\nutch\WEB-INF\classes目录,将nutch-default.xml的search.dir设置为D:\nutch\nutch-0.9\crawled;
打开浏览器,运行http://localhost:8080/ nutch;
现就可以进行搜索了,输入apache,就可以查询得到相关的结果。