heritrix3源码分析(outbound 和inbound)

heritrix3 与heritrix1.14 相比有很大不同, heritrix3 定义了一种阻塞的FIFO queue, 属于典型的生产消费者模型


AbstractFrontier 中定义了2个 容器, inbound 和outbound

inbound 容器存储的是那些即将要处理的crawlUrI, heritrix 爬取到的链接, 准备处理的链接都是先放在inbound 当中.

outbound 容器存储的当前要处理的crawlUrI, Frontier 链接工厂会从outbound取出链接并处理


    /** inbound updates: URIs to be scheduled, finished; requested state changes */

    transient protected ArrayBlockingQueue inbound;

    /** outbound URIs */ 

    transient protected ArrayBlockingQueue outbound;



AbstractFrontier启动时实例化 inbound , outbound

inbound 的容量是outbound 的十倍
    public void start() {
        if(isRunning()) {
            return; 
        }
        
        if (getRecoveryLogEnabled()) try {
            initJournal(loggerModule.getPath().getFile().getAbsolutePath());
        } catch (IOException e) {
            throw new IllegalStateException(e);
        }
        
        this.outboundCapacity = getOutboundQueueCapacity();
        this.inboundCapacity = outboundCapacity * 
            getInboundQueueMultiple();
        outbound = new ArrayBlockingQueue<CrawlURI>(outboundCapacity, true);
        inbound = new ArrayBlockingQueue<InEvent>(inboundCapacity, true);
        pause();
        startManagerThread();
    }



inbound中存的是一个个等处理的事件
从inbound中处理这些事件

    /**
     * Drain the inbound queue of update events, or at the very least
     * wait until some additional delayed-queue URI becomes available. 
     * 
     * @throws InterruptedException
     */
    protected void drainInbound() throws InterruptedException {
        int batch = inbound.size();
        for(int i = 0; i < batch; i++) {
            inbound.take().process();
        }
        if(batch==0) {
            // always do at least one timed try
            InEvent toProcess = inbound.poll(getMaxInWait(),
                    TimeUnit.MILLISECONDS);
            if (toProcess != null) {
                toProcess.process();
            }
        }
    }


未完,待续。。。。

你可能感兴趣的:(Heritrix)