本文为solomon@javaeye原创,如有转载,注明出处(作者solomon与链接
http://zolomon.iteye.com).
本专题使用中文分词为ikanalyzer,感谢其作者为java中文事业做出的巨大贡献.
我的个人资料
http://www.google.com/profiles/solomon.royarr
好不容易有空闲一天的时间写点东西,
可是发现这离开已久(其实只有几天)的办公环境已经没有所需足够的资料.
这边的网速连下载一个nutch也会影响到同事工作,
所以先拿一个网上找到的过去的nutch版本的配置文件讲解一下,
回头再修改成对应nutch 1.0的版本.
先跟读者说声抱歉了.
现在的版本来自
http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.7/conf/nutch-default.xml
声明:
此篇
1)以翻译为主(主要是nutch-default.xml),
2)外加笔者个人使用nutch的经验,
3)外加官方nutch wiki上的FAQ中
http://wiki.apache.org/nutch/FAQ的内容,
4)结合过去网友的nutch配置文件讲解,
主要由以上4部分构成.
此文档中带有<!--begin 这里边是注释 begin-->和<!--end 这里边是注释 end-->这种注释为笔者提供的非翻译额外讲解.前者为在一段属性翻译之前提供的说明,后者为在一段属性翻译之后提供的解释.这两种注释不一定成对出现.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
<!--begin 首先是一些说明,描述了这篇文档该怎样使用之类的 begin-->
<!-- 不可以直接修改此文档. 但是可以复制所需的属性(这里把entry翻译成了属性,原文中的entry就是指<property></property>之间的内容,或者更准确的说是不包括<value></value>等可变内容的)到nutch-site.xml并修改其值来使用.如果nutch-site.xml不存在的话请自己创建它. -->
<!--end 创建nutch-site.xml的样式可以有几种,指定不同的xsl即可使用不同的样式,如果网上出现了不同样是的nutch配置文件请读者朋友不要见怪.关于每个xsl所指定的样式到底是什么,这里不对其进行描述,请读者自己查阅nutch的压缩包里提供的xsl文件 end-->
<!--begin nutch配置文件根元素 begin-->
<nutch-conf>
<!--begin nutch配置文件中的属性配置是分块的,每一块配置一部分属性,结构可以清晰的区分出来,如果想修改什么内容直接到那一块地方去找相关属性即可.比如下面这个HTTP properties就是http相关设置的属性,后面还有ftp相关设置,searcher相关设置等等 begin-->
<!-- HTTP properties -->
<property>
<name>http.agent.name</name>
<value>NutchCVS</value>
<description>我们的 HTTP 'User-Agent' 请求头.</description>
</property>
<!--end 笔者也不是很明确这个属性到底是做什么用的,但是它是nutch 1.0配置文件中3个必须属性中的一个.有可能是apache搜集nutch用户信息所用. end-->
<property>
<name>http.robots.agents</name>
<value>NutchCVS,Nutch,*</value>
<description>我们要寻找 robots.txt 文件的目标 agent 字符串,可多个,
以逗号分隔, 按优先度降序排列.</description>
</property>
<!--end 要去读取robots.txt文件是搜索引擎的协议规则, 我们的搜索引擎协定会去遵守robots.txt中所做的要求.关于robots.txt,可以参见
http://www.robotstxt.org/ end-->
<property>
<name>http.robots.403.allow</name>
<value>true</value>
<description>在/robots.txt不存在时,有些服务器返回 HTTP status 403 (Forbidden). 这一般也许意味着我们仍然对该网站进行抓取. 如果此属性设为false, 我们会认为该网站不允许抓取而不去抓它.</description>
</property>
<property>
<name>http.agent.description</name>
<value>Nutch</value>
<description>同样用在User-Agent头中. 对bot- 更深入的解释. 它(这个value中的字符串)将出现在agent.name后的括号中.
</description>
</property>
<property>
<name>http.agent.url</name>
<value>http://lucene.apache.org/nutch/bot.html</value>
<description>同样用在User-Agent中. 它(指这个value中的字符串)将出现在agent.name后的字符串中, 只是个用于宣传等的url地址.
</description>
</property>
<property>
<name>http.agent.email</name>
<value>
[email protected]</value>
<description>在 HTTP 'From' 请求头和 User-Agent 头中, 用于宣传的电子邮件地址.</description>
</property>
<property>
<name>http.agent.version</name>
<value>0.7.2</value>
<description>在 User-Agent 头中用于宣传的版本号.</description>
</property>
<property>
<name>http.timeout</name>
<value>10000</value>
<description>默认网络超时, 单位毫秒.</description>
</property>
<property>
<name>http.max.delays</name>
<value>3</value>
<description>抓取一个页面的推迟次数. 每次发现一个host很忙的时候, nutch会推迟fetcher.server.delay这么久. 在http.max.delays次推迟发生过之后, 这次抓取就会放弃该页.</description>
</property>
<property>
<name>http.content.limit</name>
<value>65536</value>
<description>下载内容最大限制, 单位bytes.
如果value中的值非零(>=0), 那么大于这个值的部分将被截断; 否则不截.
</description>
</property>
<!--end 这里的下载不是指我们手工去点下载一个软件.有些入门级读者会误把这个"下载"当做网页上存在下载项(比如一个附件)的情况.我们所说的下载,是指只要我们在访问一个网页的时候,都会从网络上把这个网页下载下来,才能在自己的浏览器里查看,打开一个网页,或者访问一个网页的情况,就存在一次对这个网页的下载过程 end-->
<property>
<name>http.proxy.host</name>
<value></value>
<description>代理主机名. 如果为空, 则不使用代理.</description>
</property>
<property>
<name>http.proxy.port</name>
<value></value>
<description>代理主机端口.</description>
</property>
<property>
<name>http.verbose</name>
<value>false</value>
<description>If true, HTTP will log more verbosely.</description>
</property>
<!--end 具体效果不明, 有待进一步尝试. 翻译的结果大概是, 如果这个值为真, 那么会对HTTP活动进行非常冗长的log. end-->
<property>
<name>http.redirect.max</name>
<value>3</value>
<description>抓取时候最大redirect数, 如果网页有超过这个数的redirect, fetcher就会尝试下一个网页(放弃这个网页).</description>
</property>
<!-- FILE properties -->
<property>
<name>file.content.limit</name>
<value>65536</value>
<description>下载内容的长度, 单位是bytes.
如果值不为零, 大于这个值的内容会被截掉; 否则 (零或负数), 不会有内容被截掉.
</description>
</property>
<property>
<name>file.content.ignored</name>
<value>true</value>
<description>如果为true, 在fetch过程中没有文件内容会被存储.
一般情况我们都是希望这样做的, 因为 file:// 协议的 URL 通常意味着它在本地, 我们可以直接对它执行抓取与建立索引工作. 否则(如果不为真), 文件内容将被存储.
!! NO IMPLEMENTED YET !! (!! 还没实现 !!)
</description>
</property>
<!-- FTP properties -->
<property>
<name>ftp.username</name>
<value>anonymous</value>
<description>ftp登陆用户名.</description>
</property>
<property>
<name>ftp.password</name>
<value>
[email protected]</value>
<description>ftp登陆密码.</description>
</property>
<property>
<name>ftp.content.limit</name>
<value>65536</value>
<description>文件内容长度上限, 单位是bytes.
如果这个值大于零, 大于这个值的内容会被截掉; 否则 (零或负数), 什么都不会截. 注意: 传统的
ftp RFCs从未提供部分传输 而且, 实际上, 有些ftp服务器无法处理客户端强行关闭
我们努力尝试去处理了这种情况, 让它可以运行流畅.
</description>
</property>
<property>
<name>ftp.timeout</name>
<value>60000</value>
<description>默认ftp客户端socket超时, 单位是毫秒. 也请查阅下边的ftp.keep.connection属性.</description>
</property>
<property>
<name>ftp.server.timeout</name>
<value>100000</value>
<description>一个对ftp服务器idle time的估计, 单位是毫秒. 对于多数fgp服务器来讲120000毫秒是很典型的.
这个设置最好保守一点. 与ftp.timeout属性一起, 它们用来决定我们是否需要删除 (干掉) 当前 ftp.client 实例并强制重新启动另一个 ftp.client 实例. 这是需要的,因为一个fetcher线程也许不会在ftp client远程超时断开前按时进行下一个request
(可能会无所事事). 只有在ftp.keep.connection(参见下边)是真的时候使用.
</description>
</property>
<property>
<name>ftp.keep.connection</name>
<value>false</value>
<description>是否保持ftp连接.在同一个主机上一遍又一遍反复抓取时候很有用. 如果设为真, 它会避开连接, 登陆和目录列表为子序列url安装(原文用的setup,此处意思不同于install)解析器. 如果设为真, 那么, 你必须保证(应该):
(1) ftp.timeout必须比ftp.server.timeout小
(2) ftp.timeout必须比(fetcher.threads.fetch * fetcher.server.delay)大
否则在线程日志中会出现大量"delete client because idled too long"消息.</description>
</property>
<property>
<name>ftp.follow.talk</name>
<value>false</value>
<description>是否记录我们的客户端与远程服务器之间的dialogue. 调试(debug)时候很有用.</description>
</property>
<!-- web db properties -->
<property>
<name>db.default.fetch.interval</name>
<value>30</value>
<description>默认重抓一个网页的(间隔)天数.
</description>
</property>
<property>
<name>db.ignore.internal.links</name>
<value>true</value>
<description>如果是真, 在给一个新网页增加链接时, 从同一个主机的链接会被忽略. 这是一个非常有效的方法用来限制链接数据库的大小, 只保持质量最高的一个链接.
</description>
</property>
<!--end 这个属性对影响搜索引擎展示页面的效果非常有用 end-->
<property>
<name>db.score.injected</name>
<value>1.0</value>
<description>注入新页面所需分数injector.
</description>
</property>
<!--end end-->
<property>
<name>db.score.link.external</name>
<value>1.0</value>
<description>添加新页面时, 来自新主机页面与原因热面的分数因子 added due to a link from
another host relative to the referencing page's score.
</description>
</property>
<property>
<name>db.score.link.internal</name>
<value>1.0</value>
<description>The score factor for pages added due to a link from the
same host, relative to the referencing page's score.
</description>
</property>
<property>
<name>db.max.outlinks.per.page</name>
<value>100</value>
<description>我们会解析的从一个一页面出发的外部链接的最大数量.</description>
</property>
<property>
<name>db.max.anchor.length</name>
<value>100</value>
<description>链接最大长度.</description>
</property>
<property>
<name>db.fetch.retry.max</name>
<value>3</value>
<description>抓取时最大重试次数.</description>
</property>
<!-- fetchlist tool properties -->
<property>
<name>fetchlist.score.by.link.count</name>
<value>true</value>
<description>If true, set page scores on fetchlist entries based on
log(number of anchors), instead of using original page scores. This
results in prioritization of pages with many incoming links.
</description>
</property>
<!-- fetcher properties -->
<property>
<name>fetcher.server.delay</name>
<value>5.0</value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server.</description>
</property>
<property>
<name>fetcher.threads.fetch</name>
<value>10</value>
<description>同时使用的抓取线程数.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection).</description>
</property>
<property>
<name>fetcher.threads.per.host</name>
<value>1</value>
<description>每主机允许的同时抓取最大线程数.</description>
</property>
<property>
<name>fetcher.verbose</name>
<value>false</value>
<description>如果为真, fetcher会做更多的log.</description>
</property>
<!-- parser properties -->
<property>
<name>parser.threads.parse</name>
<value>10</value>
<description>ParseSegment同时应该使用的解析线程数.</description>
</property>
<!-- i/o properties -->
<property>
<name>io.sort.factor</name>
<value>100</value>
<description>The number of streams to merge at once while sorting
files. This determines the number of open file handles.</description>
</property>
<property>
<name>io.sort.mb</name>
<value>100</value>
<description>The total amount of buffer memory to use while sorting
files, in megabytes. By default, gives each merge stream 1MB, which
should minimize seeks.</description>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
<description>The size of buffer for use in sequence files.
The size of this buffer should probably be a multiple of hardware
page size (4096 on Intel x86), and it determines how much data is
buffered during read and write operations.</description>
</property>
<!-- file system properties -->
<property>
<name>fs.default.name</name>
<value>local</value>
<description>The name of the default file system. Either the
literal string "local" or a host:port for NDFS.</description>
</property>
<property>
<name>ndfs.name.dir</name>
<value>/tmp/nutch/ndfs/name</value>
<description>Determines where on the local filesystem the NDFS name node
should store the name table.</description>
</property>
<property>
<name>ndfs.data.dir</name>
<value>/tmp/nutch/ndfs/data</value>
<description>Determines where on the local filesystem an NDFS data node
should store its blocks.</description>
</property>
<!-- map/reduce properties -->
<property>
<name>mapred.job.tracker</name>
<value>localhost:8010</value>
<description>The host and port that the MapReduce job tracker runs at.
</description>
</property>
<property>
<name>mapred.local.dir</name>
<value>/tmp/nutch/mapred/local</value>
<description>The local directory where MapReduce stores temprorary files
related to tasks and jobs.
</description>
</property>
<!-- indexer properties -->
<property>
<name>indexer.score.power</name>
<value>0.5</value>
<description>Determines the power of link analyis scores. Each
pages's boost is set to <i>score<sup>scorePower</sup></i> where
<i>score</i> is its link analysis score and <i>scorePower</i> is the
value of this parameter. This is compiled into indexes, so, when
this is changed, pages must be re-indexed for it to take
effect.</description>
</property>
<property>
<name>indexer.boost.by.link.count</name>
<value>true</value>
<description>When true scores for a page are multipled by the log of
the number of incoming links to the page.</description>
</property>
<property>
<name>indexer.max.title.length</name>
<value>100</value>
<description>The maximum number of characters of a title that are indexed.
</description>
</property>
<property>
<name>indexer.max.tokens</name>
<value>10000</value>
<description>
The maximum number of tokens that will be indexed for a single field
in a document. This limits the amount of memory required for
indexing, so that collections with very large files will not crash
the indexing process by running out of memory.
Note that this effectively truncates large documents, excluding
from the index tokens that occur further in the document. If you
know your source documents are large, be sure to set this value
high enough to accomodate the expected size. If you set it to
Integer.MAX_VALUE, then the only limit is your memory, but you
should anticipate an OutOfMemoryError.
</description>
</property>
<property>
<name>indexer.mergeFactor</name>
<value>50</value>
<description>The factor that determines the frequency of Lucene segment
merges. This must not be less than 2, higher values increase indexing
speed but lead to increased RAM usage, and increase the number of
open file handles (which may lead to "Too many open files" errors).
NOTE: the "segments" here have nothing to do with Nutch segments, they
are a low-level data unit used by Lucene.
</description>
</property>
<property>
<name>indexer.minMergeDocs</name>
<value>50</value>
<description>This number determines the minimum number of Lucene
Documents buffered in memory between Lucene segment merges. Larger
values increase indexing speed and increase RAM usage.
</description>
</property>
<property>
<name>indexer.maxMergeDocs</name>
<value>2147483647</value>
<description>This number determines the maximum number of Lucene
Documents to be merged into a new Lucene segment. Larger values
increase indexing speed and reduce the number of Lucene segments,
which reduces the number of open file handles; however, this also
increases RAM usage during indexing.
</description>
</property>
<property>
<name>indexer.termIndexInterval</name>
<value>128</value>
<description>Determines the fraction of terms which Lucene keeps in
RAM when searching, to facilitate random-access. Smaller values use
more memory but make searches somewhat faster. Larger values use
less memory but make searches somewhat slower.
</description>
</property>
<!-- analysis properties -->
<property>
<name>analysis.common.terms.file</name>
<value>common-terms.utf8</value>
<description>The name of a file containing a list of common terms
that should be indexed in n-grams.</description>
</property>
<!-- searcher properties -->
<property>
<name>searcher.dir</name>
<value>.</value>
<description>
Path to root of index directories. This directory is searched (in
order) for either the file search-servers.txt, containing a list of
distributed search servers, or the directory "index" containing
merged indexes, or the directory "segments" containing segment
indexes.
</description>
</property>
<property>
<name>searcher.filter.cache.size</name>
<value>16</value>
<description>
Maximum number of filters to cache. Filters can accelerate certain
field-based queries, like language, document format, etc. Each
filter requires one bit of RAM per page. So, with a 10 million page
index, a cache size of 16 consumes two bytes per page, or 20MB.
</description>
</property>
<property>
<name>searcher.filter.cache.threshold</name>
<value>0.05</value>
<description>
Filters are cached when their term is matched by more than this
fraction of pages. For example, with a threshold of 0.05, and 10
million pages, the term must match more than 1/20, or 50,000 pages.
So, if out of 10 million pages, 50% of pages are in English, and 2%
are in Finnish, then, with a threshold of 0.05, searches for
"lang:en" will use a cached filter, while searches for "lang:fi"
will score all 20,000 finnish documents.
</description>
</property>
<property>
<name>searcher.hostgrouping.rawhits.factor</name>
<value>2.0</value>
<description>
A factor that is used to determine the number of raw hits
initially fetched, before host grouping is done.
</description>
</property>
<property>
<name>searcher.summary.context</name>
<value>5</value>
<description>
The number of context terms to display preceding and following
matching terms in a hit summary.
</description>
</property>
<property>
<name>searcher.summary.length</name>
<value>20</value>
<description>
The total number of terms to display in a hit summary.
</description>
</property>
<!-- URL normalizer properties -->
<property>
<name>urlnormalizer.class</name>
<value>org.apache.nutch.net.BasicUrlNormalizer</value>
<description>Name of the class used to normalize URLs.</description>
</property>
<property>
<name>urlnormalizer.regex.file</name>
<value>regex-normalize.xml</value>
<description>Name of the config file used by the RegexUrlNormalizer class.</description></property>
<!-- mime properties -->
<property>
<name>mime.types.file</name>
<value>mime-types.xml</value>
<description>Name of file in CLASSPATH containing filename extension and
magic sequence to mime types mapping information</description>
</property>
<property>
<name>mime.type.magic</name>
<value>true</value>
<description>Defines if the mime content type detector uses magic resolution.
</description>
</property>
<!-- ipc properties -->
<property>
<name>ipc.client.timeout</name>
<value>10000</value>
<description>Defines the timeout for IPC calls in milliseconds. </description>
</property>
<!-- plugin properties -->
<property>
<name>plugin.folders</name>
<value>plugins</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>
<property>
<name>plugin.includes</name>
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>
<property>
<name>plugin.excludes</name>
<value></value>
<description>Regular expression naming plugin directory names to exclude.
</description>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>windows-1252</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>
<property>
<name>parser.html.impl</name>
<value>neko</value>
<description>HTML Parser implementation. Currently the following keywords
are recognized: "neko" uses NekoHTML, "tagsoup" uses TagSoup.
</description>
</property>
<!-- urlfilter plugin properties -->
<property>
<name>urlfilter.regex.file</name>
<value>regex-urlfilter.txt</value>
<description>Name of file on CLASSPATH containing regular expressions
used by urlfilter-regex (RegexURLFilter) plugin.</description>
</property>
<property>
<name>urlfilter.prefix.file</name>
<value>prefix-urlfilter.txt</value>
<description>Name of file on CLASSPATH containing url prefixes
used by urlfilter-prefix (PrefixURLFilter) plugin.</description>
</property>
<property>
<name>urlfilter.order</name>
<value></value>
<description>The order by which url filters are applied.
If empty, all available url filters (as dictated by properties
plugin-includes and plugin-excludes above) are loaded and applied in system
defined order. If not empty, only named filters are loaded and applied
in given order. For example, if this property has value:
org.apache.nutch.net.RegexURLFilter org.apache.nutch.net.PrefixURLFilter
then RegexURLFilter is applied first, and PrefixURLFilter second.
Since all filters are AND'ed, filter ordering does not have impact
on end result, but it may have performance implication, depending
on relative expensiveness of filters.
</description>
</property>
<!-- clustering extension properties -->
<property>
<name>extension.clustering.hits-to-cluster</name>
<value>100</value>
<description>Number of snippets retrieved for the clustering extension
if clustering extension is available and user requested results
to be clustered.</description>
</property>
<property>
<name>extension.clustering.extension-name</name>
<value></value>
<description>Use the specified online clustering extension. If empty,
the first available extension will be used. The "name" here refers to an 'id'
attribute of the 'implementation' element in the plugin descriptor XML
file.</description>
</property>
<!-- ontology extension properties -->
<property>
<name>extension.ontology.extension-name</name>
<value></value>
<description>Use the specified online ontology extension. If empty,
the first available extension will be used. The "name" here refers to an 'id'
attribute of the 'implementation' element in the plugin descriptor XML
file.</description>
</property>
<property>
<name>extension.ontology.urls</name>
<value>
</value>
<description>Urls of owl files, separated by spaces, such as
http://www.example.com/ontology/time.owl
http://www.example.com/ontology/space.owl
http://www.example.com/ontology/wine.owl
Or
file:/ontology/time.owl
file:/ontology/space.owl
file:/ontology/wine.owl
You have to make sure each url is valid.
By default, there is no owl file, so query refinement based on ontology
is silently ignored.
</description>
</property>
<!-- query-basic plugin properties -->
<property>
<name>query.url.boost</name>
<value>4.0</value>
<description> Used as a boost for url field in Lucene query.
</description>
</property>
<property>
<name>query.anchor.boost</name>
<value>2.0</value>
<description> Used as a boost for anchor field in Lucene query.
</description>
</property>
<property>
<name>query.title.boost</name>
<value>1.5</value>
<description> Used as a boost for title field in Lucene query.
</description>
</property>
<property>
<name>query.host.boost</name>
<value>2.0</value>
<description> Used as a boost for host field in Lucene query.
</description>
</property>
<property>
<name>query.phrase.boost</name>
<value>1.0</value>
<description> Used as a boost for phrase in Lucene query.
Multiplied by boost for field phrase is matched in.
</description>
</property>
<!-- language-identifier plugin properties -->
<property>
<name>lang.ngram.min.length</name>
<value>1</value>
<description> The minimum size of ngrams to uses to identify
language (must be between 1 and lang.ngram.max.length).
The larger is the range between lang.ngram.min.length and
lang.ngram.max.length, the better is the identification, but
the slowest it is.
</description>
</property>
<property>
<name>lang.ngram.max.length</name>
<value>4</value>
<description> The maximum size of ngrams to uses to identify
language (must be between lang.ngram.min.length and 4).
The larger is the range between lang.ngram.min.length and
lang.ngram.max.length, the better is the identification, but
the slowest it is.
</description>
</property>
<property>
<name>lang.analyze.max.length</name>
<value>2048</value>
<description> The maximum bytes of data to uses to indentify
the language (0 means full content analysis).
The larger is this value, the better is the analysis, but the
slowest it is.
</description>
</property>
</nutch-conf>