nutch 运行中配置文件的修改

集中了两天时间对nutch的抓取效率进行了研究,根据自己的需求只关心网站的html页面。其余的都filter,配置文件很多,需要记录下以便后面方便:

1 nutch-default.xml
   a. http.content.limit -1 表示抓取整个html页面内容 。
   b. fetcher.threads.per.host 5  fetcher.threads.fetch 100 , 如果fetcher.threads.per.host为1的话后面线程数是不会生效的。
   c. plugin.includes 加上urlfilter-(regex|prefix|suffix) 对url进行过滤。
2 regex-urlfilter.xml
  # skip file: ftp: and mailto: urls
  -^(file|ftp|mailto):

  # skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
-[\(\)\{\}\<\>\\]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
-[A-Za-z0-9]{20,}

# accept anything else
+.
3 suffix-urlfilter.xml
  次数添加你想要过滤的链接后缀 如: .zip .rar
4 prefix-urlfilter.txt
  http://
  https://

你可能感兴趣的:(html,C++,c,xml,css)