Heritrix 3.1.0 源码解析(十七)

我们接下来分析与与BdbFrontier对象void finished(CrawlURI cURI)方法相关的方法 

/**

     * Note that the previously emitted CrawlURI has completed

     * its processing (for now).

     *

     * The CrawlURI may be scheduled to retry, if appropriate,

     * and other related URIs may become eligible for release

     * via the next next() call, as a result of finished().

     *

     * TODO: make as many decisions about what happens to the CrawlURI

     * (success, failure, retry) and queue (retire, snooze, ready) as 

     * possible elsewhere, such as in DispositionProcessor. Then, break

     * this into simple branches or focused methods for each case. 

     *  

     * @see org.archive.crawler.framework.Frontier#finished(org.archive.modules.CrawlURI)

     */

    protected void processFinish(CrawlURI curi) {

//        assert Thread.currentThread() == managerThread;        

        long now = System.currentTimeMillis();

        //尝试次数

        curi.incrementFetchAttempts();

        logNonfatalErrors(curi);

        

        WorkQueue wq = (WorkQueue) curi.getHolder();

        // always refresh budgeting values from current curi

        // (whose overlay settings should be active here)

        wq.setSessionBudget(getBalanceReplenishAmount());

        wq.setTotalBudget(getQueueTotalBudget());

        

        assert (wq.peek(this) == curi) : "unexpected peek " + wq;



        int holderCost = curi.getHolderCost();

        //是否需要重新处理

        if (needsReenqueuing(curi)) {

            // codes/errors which don't consume the URI, leaving it atop queue

            if(curi.getFetchStatus()!=S_DEFERRED) {

                wq.expend(holderCost); // all retries but DEFERRED cost

            }

            //延时时间

            long delay_ms = retryDelayFor(curi) * 1000;

            curi.processingCleanup(); // lose state that shouldn't burden retry

            wq.unpeek(curi);

            //更新到WorkQueue wq

            wq.update(this, curi); // rewrite any changes

            //重新归队

            handleQueue(wq,curi.includesRetireDirective(),now,delay_ms);

            appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,DEFERRED_FOR_RETRY));

            doJournalReenqueued(curi);

            wq.makeDirty();

            return; // no further dequeueing, logging, rescheduling to occur

        }



        // Curi will definitely be disposed of without retry, so remove from queue

        //从WorkQueue wq中移除该CrawlURI curi对象

        wq.dequeue(this,curi);

        decrementQueuedCount(1);

        largestQueues.update(wq.getClassKey(), wq.getCount());

        log(curi);

        

        if (curi.isSuccess()) {

            // codes deemed 'success' 

            incrementSucceededFetchCount();

            totalProcessedBytes.addAndGet(curi.getRecordedSize());

            appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,SUCCEEDED));

            doJournalFinishedSuccess(curi);

           

        } else if (isDisregarded(curi)) {

            // codes meaning 'undo' (even though URI was enqueued, 

            // we now want to disregard it from normal success/failure tallies)

            // (eg robots-excluded, operator-changed-scope, etc)

            incrementDisregardedUriCount();

            appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,DISREGARDED));

            holderCost = 0; // no charge for disregarded URIs

            // TODO: consider reinstating forget-URI capability, so URI could be

            // re-enqueued if discovered again

            doJournalDisregarded(curi);

            

        } else {

            // codes meaning 'failure'

            incrementFailedFetchCount();

            appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,FAILED));

            // if exception, also send to crawlErrors

            if (curi.getFetchStatus() == S_RUNTIME_EXCEPTION) {

                Object[] array = { curi };

                loggerModule.getRuntimeErrors().log(Level.WARNING, curi.getUURI()

                        .toString(), array);

            }        

            // charge queue any extra error penalty

            wq.noteError(getErrorPenaltyAmount());

            doJournalFinishedFailure(curi);

            

        }



        wq.expend(holderCost); // successes & failures charge cost to queue

        //延时时间

        long delay_ms = curi.getPolitenessDelay();

        //long delay_ms = 0;

        //重新归队

        handleQueue(wq,curi.includesRetireDirective(),now,delay_ms);

        wq.makeDirty();

        

        if(curi.getRescheduleTime()>0) {

            // marked up for forced-revisit at a set time

            curi.processingCleanup();

            curi.resetForRescheduling(); 

            futureUris.put(curi.getRescheduleTime(),curi);

            futureUriCount.incrementAndGet(); 

        } else {

            curi.stripToMinimal();

            curi.processingCleanup();

        }

    }

首先判断CrawlURI curi对象是否需要重新放入队列,方法如下

/**

     * Checks if a recently processed CrawlURI that did not finish successfully

     * needs to be reenqueued (and thus possibly, processed again after some 

     * time elapses)

     * 

     * @param curi

     *            The CrawlURI to check

     * @return True if we need to retry.

     */

    protected boolean needsReenqueuing(CrawlURI curi) {

        //是否超过最大的尝试次数,默认为30次

        if (overMaxRetries(curi)) {

            return false;

        }

        //根据状态判断

        switch (curi.getFetchStatus()) {

        case HttpStatus.SC_UNAUTHORIZED:

            // We can get here though usually a positive status code is

            // a success. We get here if there is rfc2617 credential data

            // loaded and we're supposed to go around again. See if any

            // rfc2617 credential present and if there, assume it got

            // loaded in FetchHTTP on expectation that we're to go around

            // again. If no rfc2617 loaded, we should not be here.

            boolean loaded = curi.hasRfc2617Credential();

            if (!loaded && logger.isLoggable(Level.FINE)) {

                logger.fine("Have 401 but no creds loaded " + curi);

            }

            return loaded;

        case S_DEFERRED:

        case S_CONNECT_FAILED:

        case S_CONNECT_LOST:

        case S_DOMAIN_UNRESOLVABLE:

            // these are all worth a retry

            // TODO: consider if any others (S_TIMEOUT in some cases?) deserve

            // retry

            return true;

        case S_UNATTEMPTED:

            if(curi.includesRetireDirective()) {

                return true;

            } // otherwise, fall-through: no status is an error without queue-directive

        default:

            return false;

        }

    }

 long retryDelayFor(CrawlURI curi)方法为设置WorkQueue wq延时时间 

/**

     * Return a suitable value to wait before retrying the given URI.

     * 

     * @param curi

     *            CrawlURI to be retried

     * @return millisecond delay before retry

     */

    protected long retryDelayFor(CrawlURI curi) {

        int status = curi.getFetchStatus();

        return (status == S_CONNECT_FAILED || status == S_CONNECT_LOST ||

                status == S_DOMAIN_UNRESOLVABLE)? getRetryDelaySeconds() : 0;

                // no delay for most

    }

getRetryDelaySeconds()的值默认为900秒(15分)

后面为将CrawlURI curi对象更新到WorkQueue wq,最后 重置WorkQueue wq的队列归属(放入不再激活的队列或休眠队列或reenqueueQueue(wq)进一步处理

/**

     * 重置WorkQueue wq的队列归属

     * Send an active queue to its next state, based on the supplied 

     * parameters.

     * 

     * @param wq

     * @param forceRetire

     * @param now

     * @param delay_ms

     */

    protected void handleQueue(WorkQueue wq, boolean forceRetire, long now, long delay_ms) {

        

        inProcessQueues.remove(wq);

        if(forceRetire) {

            retireQueue(wq);

        } else if (delay_ms > 0) {

            snoozeQueue(wq, now, delay_ms);

        } else {

            //Enqueue the given queue to either readyClassQueues or inactiveQueues,as appropriate

            reenqueueQueue(wq);

        }

    }

接下来看后面的方法wq.dequeue(this,curi)为将CrawlURI curi对象从WorkQueue wq中移除

最后重置WorkQueue wq的队列归属

 long delay_ms = curi.getPolitenessDelay(); 

 handleQueue(wq,curi.includesRetireDirective(),now,delay_ms);

handleQueue方法在上面部分

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源码解析系本人原创

转载请注明出处 博客园 刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/21/3033520.html 

你可能感兴趣的:(Heritrix)