Heritrix 插件 DeDuplicator

DeDuplicator for Heritrix 3 - 27/07/2010


Version 3.0.0-SNAPSHOT-20100727 is now available here.

This version is compiled against Heritrix 3.0.0.

It also updates to use Lucene 3.0.2 (from 2.0.0). Please note that changes in the Lucene library mean that memory usage will be approximately 40% greater than before. Memory usage appears to be approximately 5 bytes per URL in index, as compared to 3.6 bytes per URL previously. Query times have however improved significantly and are now fixed time without regard for the index size. For large indexes this can mean as much as 10-30 times shorter query times. Building indexes is also much faster (approximately 3-4 times as fast).

Currently the DeDupFetchHTTP processor has not been converted.

This release heralds the end of the existing DeDuplicator, built against Heritrix 1.14. One final release (1.0.0) will be released soon with some accumulated bugfixes. A release candidate is available here.

你可能感兴趣的:(Lucene)