Heritrix 3.1.0 源码解析(十五)

本文分析Heritrix3.1.0系统里面的WorkQueue队列(具体是BdbWorkQueue)的调度机制,这部分是系统里面比较复杂的,我只能是尝试分析(本文可能会修改)

我在Heritrix 3.1.0 源码解析(六)一文中涉及BdbFrontier对象的初始化,现在回顾一下

我们看到在WorkQueueFrontier类中的初始化方法void start()里面进一步调用了void initInternalQueues()方法

而void initInternalQueues()方法 里面进一步调用子类BdbFrontier的void initOtherQueues()方法与void initAllQueues()方法(父类为抽象方法)

@Override

    protected void initOtherQueues() throws DatabaseException {

        boolean recycle = (recoveryCheckpoint != null);

        

        // tiny risk of OutOfMemoryError: if giant number of snoozed

        // queues all wake-to-ready at once

        readyClassQueues = new LinkedBlockingQueue<String>();



        inactiveQueuesByPrecedence = new ConcurrentSkipListMap<Integer,Queue<String>>();

        

        retiredQueues = bdb.getStoredQueue("retiredQueues", String.class, recycle);



        // primary snoozed queues

        snoozedClassQueues = new DelayQueue<DelayedWorkQueue>();

        // just in case: overflow for extreme situations

        snoozedOverflow = bdb.getStoredMap(

                "snoozedOverflow", Long.class, DelayedWorkQueue.class, true, false);

            

        this.futureUris = bdb.getStoredMap(

                "futureUris", Long.class, CrawlURI.class, true, recoveryCheckpoint!=null);

        

        // initialize master map in which other queues live

        this.pendingUris = createMultipleWorkQueues();

    }

上面方法主要是初始化队列,这里解释一下:

readyClassQueues存储着已经准备好被爬取的队列的key;[Queue类型]

inactiveQueuesByPrecedence用Map类型存储着优先级存与非活动状态的队列(队列存储着key);[Map类型]

retiredQueues存储着不再激活的url队列的key;[Queue类型]

snoozedClassQueues存储着所有休眠的url队列的key,它们都按唤醒时间排序;[Queue类型]

snoozedOverflow用Map类型存储着休眠到期时间与过载的休眠状态的队列(队列存储着key)[Map类型]

futureUris用Map类型存储着调度时间与CrawlURI对象[Map类型]

这里我们需要注意的是snoozedClassQueues队列的类型DelayQueue<DelayedWorkQueue>,用于放置实现了Delayed接口的对象,其中的对象只能在其到期时才能从队列中取走。这种队列是有序的,即队头对象的延迟到期时间最长

DelayedWorkQueue类的源码如下

/**

 * A named WorkQueue wrapped with a wake time, perhaps referenced only

 * by name. 

 * 

 * @contributor gojomo

 */

class DelayedWorkQueue implements Delayed, Serializable {

    private static final long serialVersionUID = 1L;



    public String classKey;

    public long wakeTime;

    

    /**

     * Reference to the WorkQueue, perhaps saving a deserialization

     * from allQueues.

     */

    protected transient WorkQueue workQueue;

    

    public DelayedWorkQueue(WorkQueue queue) {

        this.classKey = queue.getClassKey();

        this.wakeTime = queue.getWakeTime();

        this.workQueue = queue;

    }

    

    // TODO: consider if this should be method on WorkQueueFrontier

    public WorkQueue getWorkQueue(WorkQueueFrontier wqf) {

        if (workQueue == null) {

            // This is a recently deserialized DelayedWorkQueue instance

            WorkQueue result = wqf.getQueueFor(classKey);

            this.workQueue = result;

        }

        return workQueue;

    }



    public long getDelay(TimeUnit unit) {

        return unit.convert(

                wakeTime - System.currentTimeMillis(),

                TimeUnit.MILLISECONDS);

    }

    

    public String getClassKey() {

        return classKey;

    }

    

    public long getWakeTime() {

        return wakeTime;

    }

    

    public void setWakeTime(long time) {

        this.wakeTime = time;

    }

    

    public int compareTo(Delayed obj) {

        if (this == obj) {

            return 0; // for exact identity only

        }

        DelayedWorkQueue other = (DelayedWorkQueue) obj;

        if (wakeTime > other.getWakeTime()) {

            return 1;

        }

        if (wakeTime < other.getWakeTime()) {

            return -1;

        }

        // at this point, the ordering is arbitrary, but still

        // must be consistent/stable over time

        return this.classKey.compareTo(other.getClassKey());        

    }

    

}

该对象必须实现long getDelay(TimeUnit unit) 方法和int compareTo(Delayed obj)方法,用于队列的排序(我们可以看到,DelayedWorkQueue对象是对WorkQueue queue对象的封装,里面按WorkQueue queue设置的延迟时间排序)

@Override

    protected void initAllQueues() throws DatabaseException {

        boolean isRecovery = (recoveryCheckpoint != null);

        this.allQueues = bdb.getObjectCache("allqueues", isRecovery, WorkQueue.class, BdbWorkQueue.class);

        //后面部分的代码略

    }

上面方法主要是初始化ObjectIdentityCache<WorkQueue> allQueues变量,可以理解为BdbWorkQueue队列工厂

接下来分析与BdbFrontier对象void schedule(CrawlURI curi)方法相关的方法

/**

     * Send a CrawlURI to the appropriate subqueue.

     * 

     * @param curi

     */

    protected void sendToQueue(CrawlURI curi) {

//        assert Thread.currentThread() == managerThread;

        

        WorkQueue wq = getQueueFor(curi.getClassKey());

        synchronized(wq) {

            int originalPrecedence = wq.getPrecedence();

            wq.enqueue(this, curi);

            // always take budgeting values from current curi

            // (whose overlay settings should be active here)

            wq.setSessionBudget(getBalanceReplenishAmount());

            wq.setTotalBudget(getQueueTotalBudget());

            

            if(!wq.isRetired()) {

                incrementQueuedUriCount();

                int currentPrecedence = wq.getPrecedence();

                if(!wq.isManaged() || currentPrecedence < originalPrecedence) {

                    // queue newly filled or bumped up in precedence; ensure enqueuing

                    // at precedence level (perhaps duplicate; if so that's handled elsewhere)

                    deactivateQueue(wq);

                }

            }

        }

        // Update recovery log.

        doJournalAdded(curi);

        wq.makeDirty();

        largestQueues.update(wq.getClassKey(), wq.getCount());

    }

首先是根据classkey从ObjectIdentityCache<WorkQueue> allQueues里面获取BdbWorkQueue队列工厂,WorkQueue getQueueFor(final String classKey) 方法在BdbFrontier类里面

/**

     * Return the work queue for the given classKey, or null

     * if no such queue exists.

     * 

     * @param classKey key to look for

     * @return the found WorkQueue

     */

    protected WorkQueue getQueueFor(final String classKey) {      

        WorkQueue wq = allQueues.getOrUse(

                classKey,

                new Supplier<WorkQueue>() {

                    public BdbWorkQueue get() {

                        String qKey = new String(classKey); // ensure private minimal key

                        BdbWorkQueue q = new BdbWorkQueue(qKey, BdbFrontier.this);

                        q.setTotalBudget(getQueueTotalBudget()); 

                        getQueuePrecedencePolicy().queueCreated(q);

                        return q;

                    }});

        return wq;

    }

在初始化对应classkey的BdbWorkQueue对象同时,设置long totalBudget成员和 PrecedenceProvider precedenceProvider成员属性值

再接着void sendToQueue(CrawlURI curi)方法分析,后面部分是先锁定WorkQueue wq对象(防止多线程同时写入),写入BDB数据库,设置属性int sessionBudget  long totalBudget

如果队列不是移除的队列,再判断该队列是否在生命周期内,如果不在生命周期或者活动队列的数量超过设定的阈值(currentPrecedence < originalPrecedence),将指定队列归入非活动状态队列 (重置highestPrecedenceWaiting值 非活动状态队列里面的precedence最小值)

    /**

     * Put the given queue on the inactiveQueues queue

     * @param wq

     */

    protected void deactivateQueue(WorkQueue wq) {

        int precedence = wq.getPrecedence();



        synchronized(wq) {

            wq.noteDeactivated();//active = false; 活动状态   isManaged = true; 被管理

            inProcessQueues.remove(wq);//从进程中的队列移除该队列

            if(wq.getCount()==0) {

                System.err.println("deactivate empty queue?");

            }



            synchronized (getInactiveQueuesByPrecedence()) {

                getInactiveQueuesForPrecedence(precedence).add(wq.getClassKey());

                if(wq.getPrecedence() < highestPrecedenceWaiting ) {

                    highestPrecedenceWaiting = wq.getPrecedence();

                }

            }



            if(logger.isLoggable(Level.FINE)) {

                logger.log(Level.FINE,

                        "queue deactivated to p" + precedence 

                        + ": " + wq.getClassKey());

            }

        }

    }

getInactiveQueuesByPrecedence()方法是获取用Map类型存储着优先级存与非活动状态的队列(队列存储着key);[Map类型]

getInactiveQueuesForPrecedence(precedence)方法是按指定优先级获取非活动状态的队列(如果没有则创建),然后将该队列的classkey添加到该非活动状态的队列里面 

 /**

     * 按指定优先级获取非活动状态的队列

     * Get the queue of inactive uri-queue names at the given precedence. 

     * 

     * @param precedence

     * @return queue of inacti

     */

    protected Queue<String> getInactiveQueuesForPrecedence(int precedence) {

        //优先级 /非活动状态的队列的map容器

        Map<Integer,Queue<String>> inactiveQueuesByPrecedence = 

            getInactiveQueuesByPrecedence();

        Queue<String> candidate = inactiveQueuesByPrecedence.get(precedence);

        if(candidate==null) {

            candidate = createInactiveQueueForPrecedence(precedence);

            inactiveQueuesByPrecedence.put(precedence,candidate);

        }

        return candidate;

    }

相关方法在其子类BdbFrontier里面

/* (non-Javadoc)

     * 创建非活动状态的队列

     * @see org.archive.crawler.frontier.WorkQueueFrontier#createInactiveQueueForPrecedence(int)

     */

    @Override

    Queue<String> createInactiveQueueForPrecedence(int precedence) {

        return createInactiveQueueForPrecedence(precedence, false);

    }

    

    /** 

     * inactiveQueues存储着所有非活动状态的url队列的key;

     * Optionally reuse prior data, for use when resuming from a checkpoint

     */

    Queue<String> createInactiveQueueForPrecedence(int precedence, boolean usePriorData) {

        return bdb.getStoredQueue("inactiveQueues-"+precedence, String.class, usePriorData);

    }

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源码解析系本人原创

转载请注明出处 博客园 刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/21/3033437.html

你可能感兴趣的:(Heritrix)