Frontier是Heritrix最核心的组成部分之一,也是最复杂的组成部分.它主要功能是为处理链接的线程提供URL,并负责链接处理完成后的一些后续调度操作.并且为了提高效率,它在内部使用了Berkeley DB.本节将对它的内部机理进行详细解剖.
在Heritrix的官方文档上有一个Frontier的例子,虽然很简单,但是它却解释Frontier实现的基本原理.在这里就不讨论,有兴趣的读者可以参考相应文档.但是不得不提它的三个核心方法:
(1)next(int timeout):为处理线程提供一个链接.Heritrix的所有处理线程(ToeThread)都是通过调用该方法获取链接的.
(2)schedule(CandidateURI caURI):调度待处理的链接.
(3)finished(CrawlURI cURI):完成一个已处理的链接.
整体结构如下:
(1)BdbFrontier链接工厂
initQueue()初始化等待队列,继承了WorkQueueFrontier,是Heritrix唯一个具有实际意义的链接工厂.
- package org.archive.crawler.frontier;
- public class BdbFrontier extends WorkQueueFrontier implements Serializable
- {
- /** 所有待抓取的链接*/
- protected transient BdbMultipleWorkQueues pendingUris;
- //初始化pendingUris,父类为抽象方法
- protected void initQueue() throws IOException {
- try {
- this.pendingUris = createMultipleWorkQueues();
- } catch(DatabaseException e) {
- throw (IOException)new IOException(e.getMessage()).initCause(e);
- }
- }
- private BdbMultipleWorkQueues createMultipleWorkQueues()
- throws DatabaseException {
- return new BdbMultipleWorkQueues(this.controller.getBdbEnvironment(),
- this.controller.getBdbEnvironment().getClassCatalog(),
- this.controller.isCheckpointRecover());
- }
- protected BdbMultipleWorkQueues getWorkQueues() {
- return pendingUris;
- }
- ..............................
- }
(2)next():为处理线程提供一个链接.Heritrix的所有处理线程(ToeThread)都是通过调用该方法获取链接的.
就是WorkQueueFrontier的next()方法。
*说明:WorkQueueFrontier的next方法实际是调用WorkQueue的peek()方法,WorkQueue的peek()方法又由BdbWorkQueue的peekItem()来实现,BdbWorkQueue的peekItem()方法又调用BdbFrontier的getWorkQueues()方法拿到BdbMultipleWorkQueues队列也就是等待队列,在调用BdbMultipleWorkQueues的get()方法调用getNextNearestItem()方法从等待队列中拿出链接并加入正在处理队列。
- public CrawlURI next()
- throws InterruptedException, EndedException {
- while (true) { // loop left only by explicit return or exception
- long now = System.currentTimeMillis();
- // Do common checks for pause, terminate, bandwidth-hold
- preNext(now);
- synchronized(readyClassQueues) {
- int activationsNeeded = targetSizeForReadyQueues() - readyClassQueues.size();
- while(activationsNeeded > 0 && !inactiveQueues.isEmpty()) {
- activateInactiveQueue();
- activationsNeeded--;
- }
- }
- WorkQueue readyQ = null;
- Object key = readyClassQueues.poll(DEFAULT_WAIT,TimeUnit.MILLISECONDS);
- if (key != null) {
- readyQ = (WorkQueue)this.allQueues.get(key);
- }
- if (readyQ != null) {
- while(true) { // loop left by explicit return or break on empty
- CrawlURI curi = null;
- synchronized(readyQ) {
- /**取出一个URL,最终从子类BdbFrontier的
- * pendingUris中取出一个链接
- */
- curi = readyQ.peek(this);
- if (curi != null) {
- // check if curi belongs in different queue
- String currentQueueKey = getClassKey(curi);
- if (currentQueueKey.equals(curi.getClassKey())) {
- // curi was in right queue, emit
- noteAboutToEmit(curi, readyQ);
- //加入正在处理队列中
- inProcessQueues.add(readyQ);
- return curi; //返回
- }
- // URI's assigned queue has changed since it
- // was queued (eg because its IP has become
- // known). Requeue to new queue.
- curi.setClassKey(currentQueueKey);
- readyQ.dequeue(this);//出队列
- decrementQueuedCount(1);
- curi.setHolderKey(null);
- // curi will be requeued to true queue after lock
- // on readyQ is released, to prevent deadlock
- } else {
- // readyQ is empty and ready: it's exhausted
- // release held status, allowing any subsequent
- // enqueues to again put queue in ready
- readyQ.clearHeld();
- break;
- }
- }
- if(curi!=null) {
- // complete the requeuing begun earlier
- sendToQueue(curi);
- }
- }
- } else {
- // ReadyQ key wasn't in all queues: unexpected
- if (key != null) {
- logger.severe("Key "+ key +
- " in readyClassQueues but not allQueues");
- }
- }
- if(shouldTerminate) {
- // skip subsequent steps if already on last legs
- throw new EndedException("shouldTerminate is true");
- }
- if(inProcessQueues.size()==0) {
- // Nothing was ready or in progress or imminent to wake; ensure
- // any piled-up pending-scheduled URIs are considered
- this.alreadyIncluded.requestFlush();
- }
- }
- }
- //将URL加入待处理队列
- public void schedule(CandidateURI caUri) {
- // Canonicalization may set forceFetch flag. See
- // #canonicalization(CandidateURI) javadoc for circumstance.
- String canon = canonicalize(caUri);
- if (caUri.forceFetch()) {
- alreadyIncluded.addForce(canon, caUri);
- } else {
- alreadyIncluded.add(canon, caUri);
- }
- }
(3)schedule(CandidateURI caURI):将caURI放入等待队列,其实就是BdbMultipleWorkQueues管理的,它是对Berkeley DB的简单封装.在内部有一个Berkeley Database,存放所有待处理的链接.
Berkeley Database数据库,用于存放等待的链接。 BdbWorkQueue:代表一个链接队列,该队列中所有的链接都具有相同的键值.它实际上是通过调用BdbMultipleWorkQueues的get方法从等处理链接数据库中取得一个链接的.
(4)BdbUriUniqFilter:实际上是一个过滤器,它用来检查一个要进入等待队列的链接是否已经被抓取过.
(5)finished(CrawlURI cURI):完成一个已处理的链接.