Heritrix 3.1.0 源码解析(十三)

接下来分析BdbFrontier类的void finished(CrawlURI curi) 方法,完成CrawlURI对象的扫尾工作

BdbFrontier类的父类的父类AbstractFrontier里面

org.archive.crawler.frontier.BdbFrontier

      org.archive.crawler.frontier.AbstractFrontier

/**

     * Note that the previously emitted CrawlURI has completed

     * its processing (for now).

     *

     * The CrawlURI may be scheduled to retry, if appropriate,

     * and other related URIs may become eligible for release

     * via the next next() call, as a result of finished().

     *

     *  (non-Javadoc)

     * @see org.archive.crawler.framework.Frontier#finished(org.archive.modules.CrawlURI)

     */

    public void finished(CrawlURI curi) {

        try {

            KeyedProperties.loadOverridesFrom(curi);

            processFinish(curi);

        } finally {

            KeyedProperties.clearOverridesFrom(curi); 

        }

    }

继续调用BdbFrontier类的void processFinish(CrawlURI curi)方法,在BdbFrontier类的父类WorkQueueFrontier里面

org.archive.crawler.frontier.BdbFrontier

                org.archive.crawler.frontier.WorkQueueFrontier

/**

     * Note that the previously emitted CrawlURI has completed

     * its processing (for now).

     *

     * The CrawlURI may be scheduled to retry, if appropriate,

     * and other related URIs may become eligible for release

     * via the next next() call, as a result of finished().

     *

     * TODO: make as many decisions about what happens to the CrawlURI

     * (success, failure, retry) and queue (retire, snooze, ready) as 

     * possible elsewhere, such as in DispositionProcessor. Then, break

     * this into simple branches or focused methods for each case. 

     *  

     * @see org.archive.crawler.framework.Frontier#finished(org.archive.modules.CrawlURI)

     */

    protected void processFinish(CrawlURI curi) {

//        assert Thread.currentThread() == managerThread;

        

        long now = System.currentTimeMillis();



        curi.incrementFetchAttempts();

        logNonfatalErrors(curi);

        

        WorkQueue wq = (WorkQueue) curi.getHolder();

        // always refresh budgeting values from current curi

        // (whose overlay settings should be active here)

        wq.setSessionBudget(getBalanceReplenishAmount());

        wq.setTotalBudget(getQueueTotalBudget());

        

        assert (wq.peek(this) == curi) : "unexpected peek " + wq;



        int holderCost = curi.getHolderCost();



        if (needsReenqueuing(curi)) {

            // codes/errors which don't consume the URI, leaving it atop queue

            if(curi.getFetchStatus()!=S_DEFERRED) {

                wq.expend(holderCost); // all retries but DEFERRED cost

            }

            long delay_ms = retryDelayFor(curi) * 1000;

            curi.processingCleanup(); // lose state that shouldn't burden retry

            wq.unpeek(curi);

            wq.update(this, curi); // rewrite any changes

            handleQueue(wq,curi.includesRetireDirective(),now,delay_ms);

            appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,DEFERRED_FOR_RETRY));

            doJournalReenqueued(curi);

            wq.makeDirty();

            return; // no further dequeueing, logging, rescheduling to occur

        }



        // Curi will definitely be disposed of without retry, so remove from queue

        wq.dequeue(this,curi);

        decrementQueuedCount(1);

        largestQueues.update(wq.getClassKey(), wq.getCount());

        log(curi);



        

        if (curi.isSuccess()) {

            // codes deemed 'success' 

            incrementSucceededFetchCount();

            totalProcessedBytes.addAndGet(curi.getRecordedSize());

            appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,SUCCEEDED));

            doJournalFinishedSuccess(curi);

           

        } else if (isDisregarded(curi)) {

            // codes meaning 'undo' (even though URI was enqueued, 

            // we now want to disregard it from normal success/failure tallies)

            // (eg robots-excluded, operator-changed-scope, etc)

            incrementDisregardedUriCount();

            appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,DISREGARDED));

            holderCost = 0; // no charge for disregarded URIs

            // TODO: consider reinstating forget-URI capability, so URI could be

            // re-enqueued if discovered again

            doJournalDisregarded(curi);

            

        } else {

            // codes meaning 'failure'

            incrementFailedFetchCount();

            appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,FAILED));

            // if exception, also send to crawlErrors

            if (curi.getFetchStatus() == S_RUNTIME_EXCEPTION) {

                Object[] array = { curi };

                loggerModule.getRuntimeErrors().log(Level.WARNING, curi.getUURI()

                        .toString(), array);

            }        

            // charge queue any extra error penalty

            wq.noteError(getErrorPenaltyAmount());

            doJournalFinishedFailure(curi);

            

        }



        wq.expend(holderCost); // successes & failures charge cost to queue

        

        long delay_ms = curi.getPolitenessDelay();

        handleQueue(wq,curi.includesRetireDirective(),now,delay_ms);

        wq.makeDirty();

        

        if(curi.getRescheduleTime()>0) {

            // marked up for forced-revisit at a set time

            curi.processingCleanup();

            curi.resetForRescheduling(); 

            futureUris.put(curi.getRescheduleTime(),curi);

            futureUriCount.incrementAndGet(); 

        } else {

            curi.stripToMinimal();

            curi.processingCleanup();

        }

    }

上述方面首先获取CrawlURI curi的holder属性(该CrawlURI curi对象对应classkey值得BdbWorkQueue对象,这里涉及到Heritrix3.1.0工作队列的调度,后文再分析),

然后调用BdbWorkQueue对象的synchronized void dequeue(final WorkQueueFrontier frontier, CrawlURI expected)方法

org.archive.crawler.frontier.BdbWorkQueue

      org.archive.crawler.frontier.WorkQueue

/**

     * Remove the peekItem from the queue and adjusts the count.

     * 

     * @param frontier  Work queues manager.

     */

    protected synchronized void dequeue(final WorkQueueFrontier frontier, CrawlURI expected) {

        try {

            deleteItem(frontier, peekItem);

        } catch (IOException e) {

            //FIXME better exception handling

            e.printStackTrace();

            throw new RuntimeException(e);

        }

        unpeek(expected);

        count--;

        lastDequeueTime = System.currentTimeMillis();

    }

org.archive.crawler.frontier.BdbWorkQueue

protected void deleteItem(final WorkQueueFrontier frontier,

            final CrawlURI peekItem) throws IOException {

        try {

            final BdbMultipleWorkQueues queues = ((BdbFrontier) frontier)

                .getWorkQueues();

             queues.delete(peekItem);

        } catch (DatabaseException e) {

            throw new IOException(e);

        }

    }

最后调用BdbMultipleWorkQueues对象的void delete(CrawlURI item) 方法,前面文章已经涉及过,这里不再重复这个方法了

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源码解析系本人原创

转载请注明出处 博客园 刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/17/3025419.html 

你可能感兴趣的:(Heritrix)