最近一直用heritrix爬取网站, 晚上heritrix一直运行着, 但奇怪的是heritrix 抓取速度非常慢, 抓取一个网站, 用了8个多小时, 竟然没有运行完。 于是根据LOG 分析了一下慢的原因
-----===== SNOOZED QUEUES =====----- SNOOZED#0: Queue us,imageshack,img245,+2 (p1) 1 items wakes in: 99m19s74ms last enqueued: http://img245.xxx.us/img245/596/193183637x01ss500sclzzzbx0.jpg last peeked: http://img245.xxxx.us/img245/596/193183637x01ss500sclzzzbx0.jpg total expended: 12 (total budget: -1) active balance: 2988 last(avg) cost: 1(1) totalScheduled fetchSuccesses fetchFailures fetchDisregards fetchResponses robotsDenials successBytes totalBytes fetchNonResponses 2 1 0 0 1 0 59 59 12 SimplePrecedenceProvider 1
SNOOZED QUene 里面有一些图片一直在那里, 并且运行时间相当长,
用浏览器打开, 那图片不存在,于是那图片一直在QUENE当中。
接着我分析了一下heritrix 中代码:
workQueneFrontier 有下面代码, 由于图片不存在会进入needsRetrying代码块中。
if (needsRetrying(curi)) { // Consider errors which can be retried, leaving uri atop queue if(curi.getFetchStatus()!=S_DEFERRED) { wq.expend(curi.getHolderCost()); // all retries but DEFERRED cost } long delay_sec = retryDelayFor(curi); curi.processingCleanup(); // lose state that shouldn't burden retry wq.unpeek(curi); // TODO: consider if this should happen automatically inside unpeek() wq.update(this, curi); // rewrite any changes if (delay_sec > 0) { long delay_ms = delay_sec * 1000; snoozeQueue(wq, now, delay_ms); } else { reenqueueQueue(wq); } // Let everyone interested know that it will be retried. appCtx.publishEvent( new CrawlURIDispositionEvent(this,curi,DEFERRED_FOR_RETRY)); doJournalRescheduled(curi); return; }
retryDelayFor方法是用来抓取失败, 计算等待的时间, 代码于如下
/** * Return a suitable value to wait before retrying the given URI. * * @param curi * CrawlURI to be retried * @return millisecond delay before retry */ protected long retryDelayFor(CrawlURI curi) { int status = curi.getFetchStatus(); return (status == S_CONNECT_FAILED || status == S_CONNECT_LOST || status == S_DOMAIN_UNRESOLVABLE)? getRetryDelaySeconds() : 0; // no delay for most } public int getRetryDelaySeconds() { return (Integer) kp.get("retryDelaySeconds"); }
由于heritrix 默认是等待900秒, 也就是15分钟, 如果抓取失败一个小时也只能运行4次, 8 个小时也就是32次, 难怪一直在运行啊
/** for retryable problems, seconds to wait before a retry */ { setRetryDelaySeconds(900); }
知道原因后就好办了, 修改一下配置文件:
<!-- FRONTIER: Record of all URIs discovered and queued-for-collection --> <bean id="frontier" class="org.archive.crawler.frontier.BdbFrontier"> <!-- <property name="holdQueues" value="true" /> --> <!-- <property name="queueTotalBudget" value="-1" /> --> <!-- <property name="balanceReplenishAmount" value="3000" /> --> <!-- <property name="errorPenaltyAmount" value="100" /> --> <!-- <property name="precedenceFloor" value="255" /> --> <!-- <property name="queuePrecedencePolicy"> <bean class="org.archive.crawler.frontier.precedence.BaseQueuePrecedencePolicy" /> </property> --> <!-- <property name="snoozeLongMs" value="300000" /> --> <property name="retryDelaySeconds" value="90" /> <!-- <property name="maxRetries" value="30" /> --> <!-- <property name="recoveryDir" value="logs" /> --> <!-- <property name="recoveryLogEnabled" value="true" /> --> <!-- <property name="maxOutlinks" value="6000" /> --> <!-- <property name="outboundQueueCapacity" value="50" /> --> <!-- <property name="inboundQueueMultiple" value="3" /> --> <!-- <property name="dumpPendingAtClose" value="false" /> --> </bean>
这是heritrix3的配置, 把时间改成90秒, 也就是只等待1分半钟,
如果是H1的配置, 那可以用管理界面进行配置。
改了一下速度一下提高了很多, 原来8小时才能爬完一个网站, 现在2个小时就行了。
如果再用一下heritrix
增量抓取, 那下次再抓取这个网站时, 速度又会增加很多。这样问题解决了