四,nutch 1.0 网站与爬虫的属性配置文件研究

阅读更多
本文为solomon@javaeye原创,如有转载,注明出处(作者solomon与链接 http://zolomon.iteye.com).
本专题使用中文分词为ikanalyzer,感谢其作者为java中文事业做出的巨大贡献.
我的个人资料 http://www.google.com/profiles/solomon.royarr

好不容易有空闲一天的时间写点东西,
可是发现这离开已久(其实只有几天)的办公环境已经没有所需足够的资料.
这边的网速连下载一个nutch也会影响到同事工作,
所以先拿一个网上找到的过去的nutch版本的配置文件讲解一下,
回头再修改成对应nutch 1.0的版本.
先跟读者说声抱歉了.
现在的版本来自 http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.7/conf/nutch-default.xml

声明:
此篇
1)以翻译为主(主要是nutch-default.xml),
2)外加笔者个人使用nutch的经验,
3)外加官方nutch wiki上的FAQ中 http://wiki.apache.org/nutch/FAQ的内容,
4)结合过去网友的nutch配置文件讲解,
主要由以上4部分构成.

此文档中带有这种注释为笔者提供的非翻译额外讲解.前者为在一段属性翻译之前提供的说明,后者为在一段属性翻译之后提供的解释.这两种注释不一定成对出现.




















  http.agent.name
  NutchCVS
  我们的 HTTP 'User-Agent' 请求头.






  http.robots.agents
  NutchCVS,Nutch,*
  我们要寻找 robots.txt 文件的目标 agent 字符串,可多个,
  以逗号分隔, 按优先度降序排列.





  http.robots.403.allow
  true
  在/robots.txt不存在时,有些服务器返回 HTTP status 403 (Forbidden). 这一般也许意味着我们仍然对该网站进行抓取. 如果此属性设为false, 我们会认为该网站不允许抓取而不去抓它.



  http.agent.description
  Nutch
  同样用在User-Agent头中. 对bot- 更深入的解释. 它(这个value中的字符串)将出现在agent.name后的括号中.
 




  http.agent.url
  http://lucene.apache.org/nutch/bot.html
  同样用在User-Agent中. 它(指这个value中的字符串)将出现在agent.name后的字符串中, 只是个用于宣传等的url地址.
 




  http.agent.email
  [email protected]
  在 HTTP 'From' 请求头和 User-Agent 头中, 用于宣传的电子邮件地址.



  http.agent.version
  0.7.2
  在 User-Agent 头中用于宣传的版本号.



  http.timeout
  10000
  默认网络超时, 单位毫秒.



  http.max.delays
  3
  抓取一个页面的推迟次数. 每次发现一个host很忙的时候, nutch会推迟fetcher.server.delay这么久. 在http.max.delays次推迟发生过之后, 这次抓取就会放弃该页.



  http.content.limit
  65536
  下载内容最大限制, 单位bytes.
  如果value中的值非零(>=0), 那么大于这个值的部分将被截断; 否则不截.
 





  http.proxy.host
 
  代理主机名. 如果为空, 则不使用代理.



  http.proxy.port
 
  代理主机端口.



  http.verbose
  false
  If true, HTTP will log more verbosely.




  http.redirect.max
  3
  抓取时候最大redirect数, 如果网页有超过这个数的redirect, fetcher就会尝试下一个网页(放弃这个网页).





  file.content.limit
  65536
  下载内容的长度, 单位是bytes.
  如果值不为零, 大于这个值的内容会被截掉; 否则 (零或负数), 不会有内容被截掉.
 




  file.content.ignored
  true
  如果为true, 在fetch过程中没有文件内容会被存储.
  一般情况我们都是希望这样做的, 因为 file:// 协议的 URL 通常意味着它在本地, 我们可以直接对它执行抓取与建立索引工作. 否则(如果不为真), 文件内容将被存储.
  !! NO IMPLEMENTED YET !! (!!  还没实现  !!)
 






  ftp.username
  anonymous
  ftp登陆用户名.



  ftp.password
  [email protected]
  ftp登陆密码.



  ftp.content.limit
  65536
  文件内容长度上限, 单位是bytes.
  如果这个值大于零, 大于这个值的内容会被截掉; 否则 (零或负数), 什么都不会截. 注意: 传统的
  ftp RFCs从未提供部分传输 而且, 实际上, 有些ftp服务器无法处理客户端强行关闭
  我们努力尝试去处理了这种情况, 让它可以运行流畅.
 




  ftp.timeout
  60000
  默认ftp客户端socket超时, 单位是毫秒. 也请查阅下边的ftp.keep.connection属性.



  ftp.server.timeout
  100000
  一个对ftp服务器idle time的估计, 单位是毫秒. 对于多数fgp服务器来讲120000毫秒是很典型的.
  这个设置最好保守一点. 与ftp.timeout属性一起, 它们用来决定我们是否需要删除 (干掉) 当前 ftp.client 实例并强制重新启动另一个 ftp.client 实例. 这是需要的,因为一个fetcher线程也许不会在ftp client远程超时断开前按时进行下一个request
  (可能会无所事事). 只有在ftp.keep.connection(参见下边)是真的时候使用.
 




  ftp.keep.connection
  false
  是否保持ftp连接.在同一个主机上一遍又一遍反复抓取时候很有用. 如果设为真, 它会避开连接, 登陆和目录列表为子序列url安装(原文用的setup,此处意思不同于install)解析器. 如果设为真, 那么, 你必须保证(应该):
  (1) ftp.timeout必须比ftp.server.timeout小
  (2) ftp.timeout必须比(fetcher.threads.fetch * fetcher.server.delay)大
  否则在线程日志中会出现大量"delete client because idled too long"消息.




  ftp.follow.talk
  false
  是否记录我们的客户端与远程服务器之间的dialogue. 调试(debug)时候很有用.





  db.default.fetch.interval
  30
  默认重抓一个网页的(间隔)天数.
 




  db.ignore.internal.links
  true
  如果是真, 在给一个新网页增加链接时, 从同一个主机的链接会被忽略. 这是一个非常有效的方法用来限制链接数据库的大小, 只保持质量最高的一个链接.
 





  db.score.injected
  1.0
  注入新页面所需分数injector.
 





  db.score.link.external
  1.0
  添加新页面时, 来自新主机页面与原因热面的分数因子 added due to a link from
  another host relative to the referencing page's score.
 




  db.score.link.internal
  1.0
  The score factor for pages added due to a link from the
  same host, relative to the referencing page's score.
 




  db.max.outlinks.per.page
  100
  我们会解析的从一个一页面出发的外部链接的最大数量.



  db.max.anchor.length
  100
  链接最大长度.



  db.fetch.retry.max
  3
  抓取时最大重试次数.





  fetchlist.score.by.link.count
  true
  If true, set page scores on fetchlist entries based on
  log(number of anchors), instead of using original page scores. This
  results in prioritization of pages with many incoming links.
 






  fetcher.server.delay
  5.0
  The number of seconds the fetcher will delay between
   successive requests to the same server.




  fetcher.threads.fetch
  10
  同时使用的抓取线程数.
    This is also determines the maximum number of requests that are
    made at once (each FetcherThread handles one connection).




  fetcher.threads.per.host
  1
  每主机允许的同时抓取最大线程数.



  fetcher.verbose
  false
  如果为真, fetcher会做更多的log.




  parser.threads.parse
  10
  ParseSegment同时应该使用的解析线程数.





  io.sort.factor
  100
  The number of streams to merge at once while sorting
  files.  This determines the number of open file handles.




  io.sort.mb
  100
  The total amount of buffer memory to use while sorting
  files, in megabytes.  By default, gives each merge stream 1MB, which
  should minimize seeks.




  io.file.buffer.size
  131072
  The size of buffer for use in sequence files.
  The size of this buffer should probably be a multiple of hardware
  page size (4096 on Intel x86), and it determines how much data is
  buffered during read and write operations.


 



  fs.default.name
  local
  The name of the default file system.  Either the
  literal string "local" or a host:port for NDFS.




  ndfs.name.dir
  /tmp/nutch/ndfs/name
  Determines where on the local filesystem the NDFS name node
      should store the name table.




  ndfs.data.dir
  /tmp/nutch/ndfs/data
  Determines where on the local filesystem an NDFS data node
      should store its blocks.






  mapred.job.tracker
  localhost:8010
  The host and port that the MapReduce job tracker runs at.
 




  mapred.local.dir
  /tmp/nutch/mapred/local
  The local directory where MapReduce stores temprorary files
      related to tasks and jobs.
 






  indexer.score.power
  0.5
  Determines the power of link analyis scores.  Each
  pages's boost is set to scorescorePower where
  score is its link analysis score and scorePower is the
  value of this parameter.  This is compiled into indexes, so, when
  this is changed, pages must be re-indexed for it to take
  effect.




  indexer.boost.by.link.count
  true
  When true scores for a page are multipled by the log of
  the number of incoming links to the page.




  indexer.max.title.length
  100
  The maximum number of characters of a title that are indexed.
 




  indexer.max.tokens
  10000
 
  The maximum number of tokens that will be indexed for a single field
  in a document. This limits the amount of memory required for
  indexing, so that collections with very large files will not crash
  the indexing process by running out of memory.

  Note that this effectively truncates large documents, excluding
  from the index tokens that occur further in the document. If you
  know your source documents are large, be sure to set this value
  high enough to accomodate the expected size. If you set it to
  Integer.MAX_VALUE, then the only limit is your memory, but you
  should anticipate an OutOfMemoryError.
 




  indexer.mergeFactor
  50
  The factor that determines the frequency of Lucene segment
  merges. This must not be less than 2, higher values increase indexing
  speed but lead to increased RAM usage, and increase the number of
  open file handles (which may lead to "Too many open files" errors).
  NOTE: the "segments" here have nothing to do with Nutch segments, they
  are a low-level data unit used by Lucene.
 




  indexer.minMergeDocs
  50
  This number determines the minimum number of Lucene
  Documents buffered in memory between Lucene segment merges. Larger
  values increase indexing speed and increase RAM usage.
 




  indexer.maxMergeDocs
  2147483647
  This number determines the maximum number of Lucene
  Documents to be merged into a new Lucene segment. Larger values
  increase indexing speed and reduce the number of Lucene segments,
  which reduces the number of open file handles; however, this also
  increases RAM usage during indexing.
 




  indexer.termIndexInterval
  128
  Determines the fraction of terms which Lucene keeps in
  RAM when searching, to facilitate random-access.  Smaller values use
  more memory but make searches somewhat faster.  Larger values use
  less memory but make searches somewhat slower.
 







  analysis.common.terms.file
  common-terms.utf8
  The name of a file containing a list of common terms
  that should be indexed in n-grams.






  searcher.dir
  .
 
  Path to root of index directories.  This directory is searched (in
  order) for either the file search-servers.txt, containing a list of
  distributed search servers, or the directory "index" containing
  merged indexes, or the directory "segments" containing segment
  indexes.
 




  searcher.filter.cache.size
  16
 
  Maximum number of filters to cache.  Filters can accelerate certain
  field-based queries, like language, document format, etc.  Each
  filter requires one bit of RAM per page.  So, with a 10 million page
  index, a cache size of 16 consumes two bytes per page, or 20MB.
 




  searcher.filter.cache.threshold
  0.05
 
  Filters are cached when their term is matched by more than this
  fraction of pages.  For example, with a threshold of 0.05, and 10
  million pages, the term must match more than 1/20, or 50,000 pages.
  So, if out of 10 million pages, 50% of pages are in English, and 2%
  are in Finnish, then, with a threshold of 0.05, searches for
  "lang:en" will use a cached filter, while searches for "lang:fi"
  will score all 20,000 finnish documents.
 




  searcher.hostgrouping.rawhits.factor
  2.0
 
  A factor that is used to determine the number of raw hits
  initially fetched, before host grouping is done.
 




  searcher.summary.context
  5
 
  The number of context terms to display preceding and following
  matching terms in a hit summary.
 




  searcher.summary.length
  20
 
  The total number of terms to display in a hit summary.
 






  urlnormalizer.class
  org.apache.nutch.net.BasicUrlNormalizer
  Name of the class used to normalize URLs.



  urlnormalizer.regex.file
  regex-normalize.xml
  Name of the config file used by the RegexUrlNormalizer class.





  mime.types.file
  mime-types.xml
  Name of file in CLASSPATH containing filename extension and
  magic sequence to mime types mapping information




  mime.type.magic
  true
  Defines if the mime content type detector uses magic resolution.
 






  ipc.client.timeout
  10000
  Defines the timeout for IPC calls in milliseconds.





  plugin.folders
  plugins
  Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.




  plugin.includes
  nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)
  Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
 




  plugin.excludes
 
  Regular expression naming plugin directory names to exclude. 
 




  parser.character.encoding.default
  windows-1252
  The character encoding to fall back to when no other information
  is available




  parser.html.impl
  neko
  HTML Parser implementation. Currently the following keywords
  are recognized: "neko" uses NekoHTML, "tagsoup" uses TagSoup.
 






  urlfilter.regex.file
  regex-urlfilter.txt
  Name of file on CLASSPATH containing regular expressions
  used by urlfilter-regex (RegexURLFilter) plugin.




  urlfilter.prefix.file
  prefix-urlfilter.txt
  Name of file on CLASSPATH containing url prefixes
  used by urlfilter-prefix (PrefixURLFilter) plugin.




  urlfilter.order
 
  The order by which url filters are applied.
  If empty, all available url filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.net.RegexURLFilter org.apache.nutch.net.PrefixURLFilter
  then RegexURLFilter is applied first, and PrefixURLFilter second.
  Since all filters are AND'ed, filter ordering does not have impact
  on end result, but it may have performance implication, depending
  on relative expensiveness of filters.
 






  extension.clustering.hits-to-cluster
  100
  Number of snippets retrieved for the clustering extension
  if clustering extension is available and user requested results
  to be clustered.




  extension.clustering.extension-name
 
  Use the specified online clustering extension. If empty,
  the first available extension will be used. The "name" here refers to an 'id'
  attribute of the 'implementation' element in the plugin descriptor XML
  file.






  extension.ontology.extension-name
 
  Use the specified online ontology extension. If empty,
  the first available extension will be used. The "name" here refers to an 'id'
  attribute of the 'implementation' element in the plugin descriptor XML
  file.




  extension.ontology.urls
 
 

  Urls of owl files, separated by spaces, such as
  http://www.example.com/ontology/time.owl
  http://www.example.com/ontology/space.owl
  http://www.example.com/ontology/wine.owl
  Or
  file:/ontology/time.owl
  file:/ontology/space.owl
  file:/ontology/wine.owl
  You have to make sure each url is valid.
  By default, there is no owl file, so query refinement based on ontology
  is silently ignored.
 






  query.url.boost
  4.0
  Used as a boost for url field in Lucene query.
 




  query.anchor.boost
  2.0
  Used as a boost for anchor field in Lucene query.
 





  query.title.boost
  1.5
  Used as a boost for title field in Lucene query.
 




  query.host.boost
  2.0
  Used as a boost for host field in Lucene query.
 




  query.phrase.boost
  1.0
  Used as a boost for phrase in Lucene query.
  Multiplied by boost for field phrase is matched in.
 






  lang.ngram.min.length
  1
  The minimum size of ngrams to uses to identify
  language (must be between 1 and lang.ngram.max.length).
  The larger is the range between lang.ngram.min.length and
  lang.ngram.max.length, the better is the identification, but
  the slowest it is.
 




  lang.ngram.max.length
  4
  The maximum size of ngrams to uses to identify
  language (must be between lang.ngram.min.length and 4).
  The larger is the range between lang.ngram.min.length and
  lang.ngram.max.length, the better is the identification, but
  the slowest it is.
 




  lang.analyze.max.length
  2048
  The maximum bytes of data to uses to indentify
  the language (0 means full content analysis).
  The larger is this value, the better is the analysis, but the
  slowest it is.
 




你可能感兴趣的:(lucene,XSL,Apache,XML,Mapreduce)