1. vi conf/crawl-urlfilter.txt
#+[?*!@=] # 添加接受链接带? = &字符的 # accept URLs containing certain characters as probable queries, etc. +[?=&] ## 抓取程序链接/apps/application.php?id=在网页中是动态的相对链接地址 +^http://www.test01.com/apps/application.php?id=([0-9])
2. vi conf/regex-urlfilter.txt
## 同样添加1.所加的
注意:两个文件都需要修改,因为NUTCH加载规则的顺序是crawl-urlfilter.txt-> regex-urlfilter.txt
3. vi conf/nutch-default.xml或者conf/nutch-site.xml
urlfilter.order org.apache.nutch.urlfilter.regex.RegexURLFilter The order by which url filters are applied. If empty, all available url filters (as dictated by properties plugin-includes and plugin-excludes above) are loaded and applied in system defined order. If not empty, only named filters are loaded and applied in given order. For example, if this property has value: org.apache.nutch.urlfilter.regex.RegexURLFilter org.apache.nutch.urlfilter.prefix.PrefixURLFilter then RegexURLFilter is applied first, and PrefixURLFilter second. Since all filters are AND'ed, filter ordering does not have impact on end result, but it may have performance implication, depending on relative expensiveness of filters.
4. 修改conf/nutch-default.xml
db.max.outlinks.per.page -1 The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed.