让crawl-urlfilter.txt指定的过滤生效果

在网络搜索了好些天,让nutch指定搜索过滤的网页,可是老是执行不了.比如:我在urls/url.txt 文件里http://www.360buy.com/

 

而让crawl-urlfilter.txt如下:

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*360buy.com/
#+^http://([a-z0-9]*\.)*360buy.com/product/+$
+^http://www.360buy.com/product/([0-9]{6}).html

# skip everything else
-.
我要搜索的是匹配http://www.360buy.com/product/([0-9]{6}).html的文件,而搜索的结果总是包括+^http://([a-z0-9]*\.)*360buy.com/
的所有文件夹,

后来才发现得修改urlfilter.order,指定URL过滤器的顺序。作者比较喜欢用正则表达式,所以设置为org.apache.nutch.urlfilter.regex.RegexURLFilter。
<property>
  <name>urlfilter.order</name>
  <value>org.apache.nutch.urlfilter.regex.RegexURLFilter</value>
  <description>The order by which url filters are applied.
  If empty, all available url filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.urlfilter.regex.RegexURLFilter org.apache.nutch.urlfilter.prefix.PrefixURLFilter
  then RegexURLFilter is applied first, and PrefixURLFilter second.
  Since all filters are AND'ed, filter ordering does not have impact
  on end result, but it may have performance implication, depending
  on relative expensiveness of filters.
  </description>
</property>

 

配置到nutch-site.xml就行了.

 

你可能感兴趣的:(apache,xml,css,正则表达式,performance)