本博客属原创文章,转载请注明出处:http://guoyunsky.iteye.com/blog/1744452
本人新浪微博:http://weibo.com/guoyunwb
Heritrix3.0新特性很给力.从性能,功能,灵活配置和灵活控制上都改进很大,可以说更适合垂直抓取了
一.英文原文,点击查看
1. Ability to run multiple crawl jobs simultaneously. The only limit on the number of crawl jobs that can run concurrently is the memory allocated to Heritrix.
2. Single XML configuration file based on the Spring framework. This file replaces order.xml and other Heritrix 1.x configuration files.
3.Ability to browse and modify the configured Spring beans through an easy-to-use browser based utility. See Bean Browser .
4. Enhanced extensibility through the Spring framework. For example, domain overrides can be set at a very fine-grained level. See Sheets.
5. More secure user control console. HTTPS is used to access and manipulate the user control console.
6. Increased scalability. Previously, crawls with large seed values (tens or hundreds of millions) might attempt to utilize more memory than allocated to Heritrix.
This would cause the crawl to crash. Heritrix 3.0 eliminates these problems, allowing stable processing of large scale scrawls.
7. Increased flexibility when modifying a running crawl. Running crawls can be modified by using the Bean Browser or by using the Action Directory.
8. Introduction of parallel queues. When crawling specific sites that can handle large amounts of traffic, the parallel queues option can be used to open many
concurrent crawling connections to a single site.
9. A Scripting Console that accepts script input in various formats such as AppleScript and ECMAScript. Scripting can be used to programmaticly access
and manipulate the core components of Heritrix.
二.翻译
1.能够同时运行多个抓取任务,唯一的限制是要给并行运行的抓取任务分配内存.
2.基于Spring框架去管理XML配置.并且只用这一个XML配置就替换Hertrix1.X的order.xml和其他配置文件.
3.可以通过浏览器工具很方便易用的浏览和修改Spring Bean.
4.增强扩展了Spring框架.可以配置得很细致.具体见Sheets.
5.更安全的控制台限制.通过HTTPS去访问和操作控制台.
6.增强了扩展性.以前的版本,如果有千万级以上的种子都会先载入内存,如此有可能使得超过分配给Heritrix的内存导致内存溢出.Heririx3.0则解决了这个问题.允许这种大规模抓取.
7.可以灵活的修改一个正在运行的抓取任务.通过修改Bean和Action Directory两种方式来修改.
8.引入了并行队列.当抓取指定的站点以前只有一个队列,如此会导致该队列的过于庞大.和抓取缓慢.并行队列的话,会将同一个站点的URL分成多个队列去并行抓取.
9.增加了脚本控制台,可以通过输入各种各样的脚本,如AppleScript,ECMAScript,Python,JS去控制和访问Heritrix的基本组件运行情况(很有意思).
三.补充:
以下只是我的个人理解,从使用和源码的角度来补充Hertrix3.0的改变.刚才的新特性来自维基,我觉得新特性还有以下:
1.很给力的一项功能,增加了增量抓取.而且可以很好的扩展.
2.基于REST(Heritrix使用的是Restlet框架)去控制Heritrix运行.以前是基于Servlet,界面是JSP.
3.可以动态更改抓取,并且不用重启.以前版本更改抓取的话,如增量一些类,更改order.xml配置,都需要停止Heritrix再更改,3.0则可以动态修改,可以从以下几个方面:
4.更完善的报表功能,各种日志文件,可以更清晰直观的了解抓取情况.这个以后会重点介绍.发现很多人不会通过日志去观察抓取情况.