本文接着上文分析,CandidateChain candidateChain处理器链相关联的处理器
CandidateChain处理器链有两个处理器
org.archive.crawler.prefetch.CandidateScoper
org.archive.crawler.prefetch.FrontierPreparer
要了解上面的处理器,我们先要了解另外一个抽象类Scoper,继承自抽象父类Processor,该类用来控制CrawlURI caUri对象的范围,里面有一个成员变量DecideRule scope
protected DecideRule scope; public DecideRule getScope() { return this.scope; } @Autowired public void setScope(DecideRule scope) { this.scope = scope; }
该类重要的方法如下(调用成员变量DecideRule scope的DecideResult decisionFor(CrawlURI uri)方法)
/** * Schedule the given {@link CrawlURI CrawlURI} with the Frontier. * @param caUri The CrawlURI to be scheduled. * @return true if CrawlURI was accepted by crawl scope, false * otherwise. */ protected boolean isInScope(CrawlURI caUri) { boolean result = false; //System.out.println(this.getClass().getName()+":"+"scope name:"+scope.getClass().getName()); DecideResult dr = scope.decisionFor(caUri); if (dr == DecideResult.ACCEPT) { result = true; if (fileLogger != null) { fileLogger.info("ACCEPT " + caUri); } } else { outOfScope(caUri); } return result; } /** * Called when a CrawlURI is ruled out of scope. * Override if you don't want logs as coming from this class. * @param caUri CrawlURI that is out of scope. */ protected void outOfScope(CrawlURI caUri) { if (fileLogger != null) { fileLogger.info("REJECT " + caUri); } }
该类的子类调用上面的方法判断CrawlURI caUri对象是否溢出范围,CandidateScoper类和FrontierPreparer类都是它的子类,另外还有Preselector类等
CandidateScoper类代码很简单,覆盖Processor类的ProcessResult innerProcessResult(CrawlURI curi)方法
@Override protected ProcessResult innerProcessResult(CrawlURI curi) throws InterruptedException { if (!isInScope(curi)) { // Scope rejected curi.setFetchStatus(S_OUT_OF_SCOPE); return ProcessResult.FINISH; } return ProcessResult.PROCEED; }
表达式!isInScope(curi)调用父类抽象类Scoper的方法判断当前CrawlURI curi对象是否溢出范围
FrontierPreparer类主要是为CrawlURI curi对象设置相关值,为抓取数据做准备(没发现该类调用父类抽象类Scoper方法)
/* (non-Javadoc) * @see org.archive.modules.Processor#innerProcess(org.archive.modules.CrawlURI) */ @Override protected void innerProcess(CrawlURI curi) { prepare(curi); } /** * Apply all configured policies to CrawlURI * * @param curi CrawlURI */ public void prepare(CrawlURI curi) { // set schedulingDirective curi.setSchedulingDirective(getSchedulingDirective(curi)); // set canonicalized version curi.setCanonicalString(canonicalize(curi)); // set queue key curi.setClassKey(getClassKey(curi)); // set cost curi.setHolderCost(getCost(curi)); // set URI precedence getUriPrecedencePolicy().uriScheduled(curi); }
上面的方法void prepare(CrawlURI curi)为CrawlURI curi对象设置相关值,计算相应值的方法如下
/** * Calculate the coarse, original 'schedulingDirective' prioritization * for the given CrawlURI * * @param curi * @return */ protected int getSchedulingDirective(CrawlURI curi) { if(StringUtils.isNotEmpty(curi.getPathFromSeed())) { char lastHop = curi.getPathFromSeed().charAt(curi.getPathFromSeed().length()-1); if(lastHop == 'R') { // refer return getPreferenceDepthHops() >= 0 ? HIGH : MEDIUM; } } if (getPreferenceDepthHops() == 0) { return HIGH; // this implies seed redirects are treated as path // length 1, which I belive is standard. // curi.getPathFromSeed() can never be null here, because // we're processing a link extracted from curi } else if (getPreferenceDepthHops() > 0 && curi.getPathFromSeed().length() + 1 <= getPreferenceDepthHops()) { return HIGH; } else { // optionally preferencing embeds up to MEDIUM int prefHops = getPreferenceEmbedHops(); if (prefHops > 0) { int embedHops = curi.getTransHops(); if (embedHops > 0 && embedHops <= prefHops && curi.getSchedulingDirective() == SchedulingConstants.NORMAL) { // number of embed hops falls within the preferenced range, and // uri is not already MEDIUM -- so promote it return MEDIUM; } } // Everything else stays as previously assigned // (probably NORMAL, at least for now) return curi.getSchedulingDirective(); } } /** * Canonicalize passed CrawlURI. This method differs from * {@link #canonicalize(UURI)} in that it takes a look at * the CrawlURI context possibly overriding any canonicalization effect if * it could make us miss content. If canonicalization produces an URL that * was 'alreadyseen', but the entry in the 'alreadyseen' database did * nothing but redirect to the current URL, we won't get the current URL; * we'll think we've already see it. Examples would be archive.org * redirecting to www.archive.org or the inverse, www.netarkivet.net * redirecting to netarkivet.net (assuming stripWWW rule enabled). * <p>Note, this method under circumstance sets the forceFetch flag. * * @param cauri CrawlURI to examine. * @return Canonicalized <code>cacuri</code>. */ protected String canonicalize(CrawlURI cauri) { String canon = getCanonicalizationPolicy().canonicalize(cauri.getURI()); if (cauri.isLocation()) { // If the via is not the same as where we're being redirected (i.e. // we're not being redirected back to the same page, AND the // canonicalization of the via is equal to the the current cauri, // THEN forcefetch (Forcefetch so no chance of our not crawling // content because alreadyseen check things its seen the url before. // An example of an URL that redirects to itself is: // http://bridalelegance.com/images/buttons3/tuxedos-off.gif. // An example of an URL whose canonicalization equals its via's // canonicalization, and we want to fetch content at the // redirection (i.e. need to set forcefetch), is netarkivet.dk. if (!cauri.toString().equals(cauri.getVia().toString()) && getCanonicalizationPolicy().canonicalize( cauri.getVia().toCustomString()).equals(canon)) { cauri.setForceFetch(true); } } return canon; } /** * @param cauri CrawlURI we're to get a key for. * @return a String token representing a queue */ public String getClassKey(CrawlURI curi) { assert KeyedProperties.overridesActiveFrom(curi); String queueKey = getQueueAssignmentPolicy().getClassKey(curi); return queueKey; } /** * Return the 'cost' of a CrawlURI (how much of its associated * queue's budget it depletes upon attempted processing) * * @param curi * @return the associated cost */ protected int getCost(CrawlURI curi) { assert KeyedProperties.overridesActiveFrom(curi); int cost = curi.getHolderCost(); if (cost == CrawlURI.UNCALCULATED) { cost = getCostAssignmentPolicy().costOf(curi); } return cost; }
这些方法涉及相应的策略类,这个话题比较大,留在后面的文章再解析吧
Preselector类用来配置正则过滤CrawlURI curi对象
@Override protected ProcessResult innerProcessResult(CrawlURI puri) { CrawlURI curi = (CrawlURI)puri; // Check if uris should be blocked if (getBlockAll()) { curi.setFetchStatus(S_BLOCKED_BY_USER); return ProcessResult.FINISH; } // Check if allowed by regular expression String regex = getAllowByRegex(); if (regex != null && !regex.equals("")) { if (!TextUtils.matches(regex, curi.toString())) { curi.setFetchStatus(S_BLOCKED_BY_USER); return ProcessResult.FINISH; } } // Check if blocked by regular expression regex = getBlockByRegex(); if (regex != null && !regex.equals("")) { if (TextUtils.matches(regex, curi.toString())) { curi.setFetchStatus(S_BLOCKED_BY_USER); return ProcessResult.FINISH; } } // Possibly recheck scope if (getRecheckScope()) { if (!isInScope(curi)) { // Scope rejected curi.setFetchStatus(S_OUT_OF_SCOPE); return ProcessResult.FINISH; } } return ProcessResult.PROCEED; }
对应的配置文件crawler-beans.cxml里面的配置示例如下
<!-- FETCH CHAIN --> <!-- first, processors are declared as top-level named beans --> <bean id="preselector" class="org.archive.crawler.prefetch.Preselector"> <!-- <property name="recheckScope" value="false" />--> <!-- <property name="blockAll" value="false" />--> <!-- <property name="blockByRegex" value="" />--> <!-- <property name="allowByRegex" value="" />--> </bean>
---------------------------------------------------------------------------
本系列Heritrix 3.1.0 源码解析系本人原创
转载请注明出处 博客园 刺猬的温驯
本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/23/3037360.html