本专题使用中文分词为ikanalyzer,感谢其作者为java中文事业做出的巨大贡献.
我的个人资料 http://www.google.com/profiles/solomon.royarr
好不容易有空闲一天的时间写点东西,
可是发现这离开已久(其实只有几天)的办公环境已经没有所需足够的资料.
这边的网速连下载一个nutch也会影响到同事工作,
所以先拿一个网上找到的过去的nutch版本的配置文件讲解一下,
回头再修改成对应nutch 1.0的版本.
先跟读者说声抱歉了.
现在的版本来自 http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.7/conf/nutch-default.xml
声明:
此篇
1)以翻译为主(主要是nutch-default.xml),
2)外加笔者个人使用nutch的经验,
3)外加官方nutch wiki上的FAQ中 http://wiki.apache.org/nutch/FAQ的内容,
4)结合过去网友的nutch配置文件讲解,
主要由以上4部分构成.
此文档中带有和这种注释为笔者提供的非翻译额外讲解.前者为在一段属性翻译之前提供的说明,后者为在一段属性翻译之后提供的解释.这两种注释不一定成对出现.
以逗号分隔, 按优先度降序排列.
如果value中的值非零(>=0), 那么大于这个值的部分将被截断; 否则不截.
如果值不为零, 大于这个值的内容会被截掉; 否则 (零或负数), 不会有内容被截掉.
一般情况我们都是希望这样做的, 因为 file:// 协议的 URL 通常意味着它在本地, 我们可以直接对它执行抓取与建立索引工作. 否则(如果不为真), 文件内容将被存储.
!! NO IMPLEMENTED YET !! (!! 还没实现 !!)
如果这个值大于零, 大于这个值的内容会被截掉; 否则 (零或负数), 什么都不会截. 注意: 传统的
ftp RFCs从未提供部分传输 而且, 实际上, 有些ftp服务器无法处理客户端强行关闭
我们努力尝试去处理了这种情况, 让它可以运行流畅.
这个设置最好保守一点. 与ftp.timeout属性一起, 它们用来决定我们是否需要删除 (干掉) 当前 ftp.client 实例并强制重新启动另一个 ftp.client 实例. 这是需要的,因为一个fetcher线程也许不会在ftp client远程超时断开前按时进行下一个request
(可能会无所事事). 只有在ftp.keep.connection(参见下边)是真的时候使用.
(1) ftp.timeout必须比ftp.server.timeout小
(2) ftp.timeout必须比(fetcher.threads.fetch * fetcher.server.delay)大
否则在线程日志中会出现大量"delete client because idled too long"消息.
another host relative to the referencing page's score.
same host, relative to the referencing page's score.
log(number of anchors), instead of using original page scores. This
results in prioritization of pages with many incoming links.
successive requests to the same server.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection).
files. This determines the number of open file handles.
files, in megabytes. By default, gives each merge stream 1MB, which
should minimize seeks.
The size of this buffer should probably be a multiple of hardware
page size (4096 on Intel x86), and it determines how much data is
buffered during read and write operations.
literal string "local" or a host:port for NDFS.
should store the name table.
should store its blocks.
related to tasks and jobs.
pages's boost is set to scorescorePower where
score is its link analysis score and scorePower is the
value of this parameter. This is compiled into indexes, so, when
this is changed, pages must be re-indexed for it to take
effect.
the number of incoming links to the page.
The maximum number of tokens that will be indexed for a single field
in a document. This limits the amount of memory required for
indexing, so that collections with very large files will not crash
the indexing process by running out of memory.
Note that this effectively truncates large documents, excluding
from the index tokens that occur further in the document. If you
know your source documents are large, be sure to set this value
high enough to accomodate the expected size. If you set it to
Integer.MAX_VALUE, then the only limit is your memory, but you
should anticipate an OutOfMemoryError.
merges. This must not be less than 2, higher values increase indexing
speed but lead to increased RAM usage, and increase the number of
open file handles (which may lead to "Too many open files" errors).
NOTE: the "segments" here have nothing to do with Nutch segments, they
are a low-level data unit used by Lucene.
Documents buffered in memory between Lucene segment merges. Larger
values increase indexing speed and increase RAM usage.
Documents to be merged into a new Lucene segment. Larger values
increase indexing speed and reduce the number of Lucene segments,
which reduces the number of open file handles; however, this also
increases RAM usage during indexing.
RAM when searching, to facilitate random-access. Smaller values use
more memory but make searches somewhat faster. Larger values use
less memory but make searches somewhat slower.
that should be indexed in n-grams.
Path to root of index directories. This directory is searched (in
order) for either the file search-servers.txt, containing a list of
distributed search servers, or the directory "index" containing
merged indexes, or the directory "segments" containing segment
indexes.
Maximum number of filters to cache. Filters can accelerate certain
field-based queries, like language, document format, etc. Each
filter requires one bit of RAM per page. So, with a 10 million page
index, a cache size of 16 consumes two bytes per page, or 20MB.
Filters are cached when their term is matched by more than this
fraction of pages. For example, with a threshold of 0.05, and 10
million pages, the term must match more than 1/20, or 50,000 pages.
So, if out of 10 million pages, 50% of pages are in English, and 2%
are in Finnish, then, with a threshold of 0.05, searches for
"lang:en" will use a cached filter, while searches for "lang:fi"
will score all 20,000 finnish documents.
A factor that is used to determine the number of raw hits
initially fetched, before host grouping is done.
The number of context terms to display preceding and following
matching terms in a hit summary.
The total number of terms to display in a hit summary.
magic sequence to mime types mapping information
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
is available
are recognized: "neko" uses NekoHTML, "tagsoup" uses TagSoup.
used by urlfilter-regex (RegexURLFilter) plugin.
used by urlfilter-prefix (PrefixURLFilter) plugin.
If empty, all available url filters (as dictated by properties
plugin-includes and plugin-excludes above) are loaded and applied in system
defined order. If not empty, only named filters are loaded and applied
in given order. For example, if this property has value:
org.apache.nutch.net.RegexURLFilter org.apache.nutch.net.PrefixURLFilter
then RegexURLFilter is applied first, and PrefixURLFilter second.
Since all filters are AND'ed, filter ordering does not have impact
on end result, but it may have performance implication, depending
on relative expensiveness of filters.
if clustering extension is available and user requested results
to be clustered.
the first available extension will be used. The "name" here refers to an 'id'
attribute of the 'implementation' element in the plugin descriptor XML
file.
the first available extension will be used. The "name" here refers to an 'id'
attribute of the 'implementation' element in the plugin descriptor XML
file.
http://www.example.com/ontology/time.owl
http://www.example.com/ontology/space.owl
http://www.example.com/ontology/wine.owl
Or
file:/ontology/time.owl
file:/ontology/space.owl
file:/ontology/wine.owl
You have to make sure each url is valid.
By default, there is no owl file, so query refinement based on ontology
is silently ignored.
Multiplied by boost for field phrase is matched in.
language (must be between 1 and lang.ngram.max.length).
The larger is the range between lang.ngram.min.length and
lang.ngram.max.length, the better is the identification, but
the slowest it is.
language (must be between lang.ngram.min.length and 4).
The larger is the range between lang.ngram.min.length and
lang.ngram.max.length, the better is the identification, but
the slowest it is.
the language (0 means full content analysis).
The larger is this value, the better is the analysis, but the
slowest it is.