本文继续分析与heritrix3.1.0系统的处理器相关的源码
我们照例先来浏览一下class uml图
所有的处理器都继承自抽象父类Processor,其中重要的方法如下
/** * Processes the given URI. First checks {@link #ENABLED} and * {@link #DECIDE_RULES}. If ENABLED is false, then nothing happens. * If the DECIDE_RULES indicate REJECT, then the * {@link #innerRejectProcess(ProcessorURI)} method is invoked, and * the process method returns. * * <p>Next, the {@link #shouldProcess(ProcessorURI)} method is * consulted to see if this Processor knows how to handle the given * URI. If it returns false, then nothing futher occurs. * * <p>FIXME: Should innerRejectProcess be called when ENABLED is false, * or when shouldProcess returns false? The previous Processor * implementation didn't handle it that way. * * <p>Otherwise, the URI is considered valid. This processor's count * of handled URIs is incremented, and the * {@link #innerProcess(ProcessorURI)} method is invoked to actually * perform the process. * * @param uri The URI to process * @throws InterruptedException if the thread is interrupted */ public ProcessResult process(CrawlURI uri) throws InterruptedException { if (!getEnabled()) { return ProcessResult.PROCEED; } if (getShouldProcessRule().decisionFor(uri) == DecideResult.REJECT) { innerRejectProcess(uri); return ProcessResult.PROCEED; } if (shouldProcess(uri)) { uriCount.incrementAndGet(); return innerProcessResult(uri); } else { return ProcessResult.PROCEED; } }
首先判断是否需要该处理器处理,shouldProcess(CrawlURI uri)为抽象方法,由子类实现(具体处理器类判断是否需要经过自身处理当前CrawlURI uri对象)
里面进一步调用ProcessResult innerProcessResult(CrawlURI uri) 方法(有些子类覆盖了该方法)
protected ProcessResult innerProcessResult(CrawlURI uri) throws InterruptedException { innerProcess(uri); return ProcessResult.PROCEED; }
继续调用void innerProcess(CrawlURI uri)方法,该方法是抽象方法,由子类实现
/** * Actually performs the process. By the time this method is invoked, * it is known that the given URI passes the {@link #ENABLED}, the * {@link #DECIDE_RULES} and the {@link #shouldProcess(ProcessorURI)} * tests. * * @param uri the URI to process * @throws InterruptedException if the thread is interrupted */ protected abstract void innerProcess(CrawlURI uri) throws InterruptedException;
处理器Processor类的子类 逻辑上又分为几大不同类别的处理器,它们在系统运行时已经属于不同的处理器链,在类的继承层次上 又有各自的层次归属
本文以及接下来的文章我只能选择部分处理器Processor分析一下
CandidatesProcessor处理器:CandidatesProcessor处理器里面拥有CandidateChain candidateChain成员,调用该处理器链的处理器方法
通过该处理器的CrawlURI cURI对象最终调用BdbFrontier的schedule(CrawlURI cURI)方法添加到BDB数据库
/** * Candidate chain */ protected CandidateChain candidateChain; public CandidateChain getCandidateChain() { return this.candidateChain; } @Autowired public void setCandidateChain(CandidateChain candidateChain) { this.candidateChain = candidateChain; } /** * The frontier to use. */ protected Frontier frontier; public Frontier getFrontier() { return this.frontier; } @Autowired public void setFrontier(Frontier frontier) { this.frontier = frontier; }
实际调用的处理器方法如下
/* (non-Javadoc) * @see org.archive.modules.Processor#innerProcess(org.archive.modules.CrawlURI) */ @Override protected void innerProcess(final CrawlURI curi) throws InterruptedException { // Handle any prerequisites when S_DEFERRED for prereqs if (curi.hasPrerequisiteUri() && curi.getFetchStatus() == S_DEFERRED) { CrawlURI prereq = curi.getPrerequisiteUri(); prereq.setFullVia(curi); sheetOverlaysManager.applyOverlaysTo(prereq); try { KeyedProperties.clearOverridesFrom(curi); KeyedProperties.loadOverridesFrom(prereq); getCandidateChain().process(prereq, null); if(prereq.getFetchStatus()>=0) { frontier.schedule(prereq); } else { curi.setFetchStatus(S_PREREQUISITE_UNSCHEDULABLE_FAILURE); } } finally { KeyedProperties.clearOverridesFrom(prereq); KeyedProperties.loadOverridesFrom(curi); } return; } // Don't consider candidate links of error pages if (curi.getFetchStatus() < 200 || curi.getFetchStatus() >= 400) { curi.getOutLinks().clear(); return; } for (Link wref: curi.getOutLinks()) { CrawlURI candidate; try { candidate = curi.createCrawlURI(curi.getBaseURI(),wref); // at least for duration of candidatechain, offer // access to full CrawlURI of via candidate.setFullVia(curi); } catch (URIException e) { loggerModule.logUriError(e, curi.getUURI(), wref.getDestination().toString()); continue; } sheetOverlaysManager.applyOverlaysTo(candidate); try { KeyedProperties.clearOverridesFrom(curi); KeyedProperties.loadOverridesFrom(candidate); if(getSeedsRedirectNewSeeds() && curi.isSeed() && wref.getHopType() == Hop.REFER && candidate.getHopCount() < SEEDS_REDIRECT_NEW_SEEDS_MAX_HOPS) { candidate.setSeed(true); } getCandidateChain().process(candidate, null); if(candidate.getFetchStatus()>=0) { if(checkForSeedPromotion(candidate)) { /* * We want to guarantee crawling of seed version of * CrawlURI even if same url has already been enqueued, * see https://webarchive.jira.com/browse/HER-1891 */ candidate.setForceFetch(true); getSeeds().addSeed(candidate); } else { frontier.schedule(candidate); } curi.getOutCandidates().add(candidate); } } finally { KeyedProperties.clearOverridesFrom(candidate); KeyedProperties.loadOverridesFrom(curi); } } curi.getOutLinks().clear(); }
我们查看一下爬行任务配置文件crawler-beans.cxml,CandidateChain candidateChain处理器链的相关处理器如下
<!-- CANDIDATE CHAIN --> <!-- first, processors are declared as top-level named beans --> <bean id="candidateScoper" class="org.archive.crawler.prefetch.CandidateScoper"> </bean> <bean id="preparer" class="org.archive.crawler.prefetch.FrontierPreparer"> <!-- <property name="preferenceDepthHops" value="-1" /> --> <!-- <property name="preferenceEmbedHops" value="1" /> --> <!-- <property name="canonicalizationPolicy"> <ref bean="canonicalizationPolicy" /> </property> --> <!-- <property name="queueAssignmentPolicy"> <ref bean="queueAssignmentPolicy" /> </property> --> <!-- <property name="uriPrecedencePolicy"> <ref bean="uriPrecedencePolicy" /> </property> --> <!-- <property name="costAssignmentPolicy"> <ref bean="costAssignmentPolicy" /> </property> --> </bean> <!-- now, processors are assembled into ordered CandidateChain bean --> <bean id="candidateProcessors" class="org.archive.modules.CandidateChain"> <property name="processors"> <list> <!-- apply scoping rules to each individual candidate URI... --> <ref bean="candidateScoper"/> <!-- ...then prepare those ACCEPTed to be enqueued to frontier. --> <ref bean="preparer"/> </list> </property> </bean>
---------------------------------------------------------------------------
本系列Heritrix 3.1.0 源码解析系本人原创
转载请注明出处 博客园 刺猬的温驯
本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/23/3036954.html