Dissecting The Nutch Crawler -Summary: Nutch crawler extension points

英文原文出处: DissectingTheNutchCrawler
转载本文请注明出处:http://blog.csdn.net/pwlazy

Summary: Nutch crawler extension points

The main ways to configure the Nutch crawler are as follows:

  1. Configuration files. Default values are in nutch-default.xml, and you should override them in nutch-site.xml.

  2. URLFilter interface. By default, the class net.nutch.net.RegexURLFilter is used, which reads regular expression patterns from regex-urlfilter.txt. So, you can:

    • Edit that file to tune its behavior

    • Or, write a new class that implements net.nutch.net.URLFilter, and change nutch-site.xml to use it.

  3. Protocol interface. To add support for a new protocol, write or add a plugin to the "plugins" directory. To change protocol behavior, modify the appropriate plugin.

  4. Parser interface. As for Protocol, you should add/create a plugin for any new content-types. Otherwise, you will need to replace the appropriate plugin if you want to modify its behavior.

  5. If you need to make other changes, refer to our discussion of Fetcher and FetchListTool. Consider subclassing these classes, overriding the appropriate method, then calling your class from the "nutch" script using the full class path.


综述:Nutch crawler的扩展点

配置Nutch crawler的主要方式如下:

  1. 配置文件。 nutch-default.xml设置了默认值,你应该在nutch-site.xml覆盖相应默认值
  2. URLFilter接口。默认情况下,系统使用class net.nutch.net.RegexURLFilter,它从regex-urlfilter.txt读取正则表达式,所以你可以:

    • 编辑regex-urlfilter.txt来调整RegexURLFilter得行为
    • 或者写一个新类实现net.nutch.net.URLFilter接口,然后改nutch-site.xml,这样你就可以用了

  3. Protocol接口。添加对新的协议得支持,写个插件改变协议行为或者修个某个适合的插件放入plugins目录,
  4. Parser接口。就解析器来说(译注: 原文此处为协议应该是笔误),你应该增加一个插件用于新的内容类型。否则如果你想修改相关插件行为你需要替换相应插件
  5. 如果你想作其他改变,参考我们关于Fetcher and FetchListTool 的讨论。你可以继承这些类,然后覆盖合适的方法,然后将相应的完全的类路径写入nutch脚本,最后调用它

你可能感兴趣的:(xml,.net,正则表达式,Blog,脚本)