Heritrix 3.1.0 源码解析(二十)

本文接着上文分析,CandidateChain candidateChain处理器链相关联的处理器

CandidateChain处理器链有两个处理器

org.archive.crawler.prefetch.CandidateScoper

org.archive.crawler.prefetch.FrontierPreparer

要了解上面的处理器,我们先要了解另外一个抽象类Scoper,继承自抽象父类Processor,该类用来控制CrawlURI caUri对象的范围,里面有一个成员变量DecideRule scope

 protected DecideRule scope;

    public DecideRule getScope() {

        return this.scope;

    }

    @Autowired

    public void setScope(DecideRule scope) {

        this.scope = scope;

    }

该类重要的方法如下(调用成员变量DecideRule scope的DecideResult decisionFor(CrawlURI uri)方法

/**

     * Schedule the given {@link CrawlURI CrawlURI} with the Frontier.

     * @param caUri The CrawlURI to be scheduled.

     * @return true if CrawlURI was accepted by crawl scope, false

     * otherwise.

     */

    protected boolean isInScope(CrawlURI caUri) {

        boolean result = false;

        //System.out.println(this.getClass().getName()+":"+"scope name:"+scope.getClass().getName());

        DecideResult dr = scope.decisionFor(caUri);

        if (dr == DecideResult.ACCEPT) {

            result = true;

            if (fileLogger != null) {

                fileLogger.info("ACCEPT " + caUri); 

            }

        } else {

            outOfScope(caUri);

        }

        return result;

    }

    

    /**

     * Called when a CrawlURI is ruled out of scope.

     * Override if you don't want logs as coming from this class.

     * @param caUri CrawlURI that is out of scope.

     */

    protected void outOfScope(CrawlURI caUri) {

        if (fileLogger != null) {

            fileLogger.info("REJECT " + caUri); 

        }

    }

该类的子类调用上面的方法判断CrawlURI caUri对象是否溢出范围,CandidateScoper类和FrontierPreparer类都是它的子类,另外还有Preselector类等

CandidateScoper类代码很简单,覆盖Processor类的ProcessResult innerProcessResult(CrawlURI curi)方法

@Override

    protected ProcessResult innerProcessResult(CrawlURI curi) throws InterruptedException {

        if (!isInScope(curi)) {

            // Scope rejected

            curi.setFetchStatus(S_OUT_OF_SCOPE);

            return ProcessResult.FINISH;

        }

        return ProcessResult.PROCEED;

    }

表达式!isInScope(curi)调用父类抽象类Scoper的方法判断当前CrawlURI curi对象是否溢出范围

FrontierPreparer类主要是为CrawlURI curi对象设置相关值,为抓取数据做准备(没发现该类调用父类抽象类Scoper方法

 /* (non-Javadoc)

     * @see org.archive.modules.Processor#innerProcess(org.archive.modules.CrawlURI)

     */

    @Override

    protected void innerProcess(CrawlURI curi) {

        prepare(curi);

    }

    

    /**

     * Apply all configured policies to CrawlURI

     * 

     * @param curi CrawlURI

     */

    public void prepare(CrawlURI curi) {

        

        // set schedulingDirective

        curi.setSchedulingDirective(getSchedulingDirective(curi));

            

        // set canonicalized version

        curi.setCanonicalString(canonicalize(curi));

        

        // set queue key

        curi.setClassKey(getClassKey(curi));

        

        // set cost

        curi.setHolderCost(getCost(curi));

        

        // set URI precedence

        getUriPrecedencePolicy().uriScheduled(curi);





    }

上面的方法void prepare(CrawlURI curi)为CrawlURI curi对象设置相关值,计算相应值的方法如下 

/**

     * Calculate the coarse, original 'schedulingDirective' prioritization

     * for the given CrawlURI

     * 

     * @param curi

     * @return

     */

    protected int getSchedulingDirective(CrawlURI curi) {

        if(StringUtils.isNotEmpty(curi.getPathFromSeed())) {

            char lastHop = curi.getPathFromSeed().charAt(curi.getPathFromSeed().length()-1);

            if(lastHop == 'R') {

                // refer

                return getPreferenceDepthHops() >= 0 ? HIGH : MEDIUM;

            } 

        }

        if (getPreferenceDepthHops() == 0) {

            return HIGH;

            // this implies seed redirects are treated as path

            // length 1, which I belive is standard.

            // curi.getPathFromSeed() can never be null here, because

            // we're processing a link extracted from curi

        } else if (getPreferenceDepthHops() > 0 && 

            curi.getPathFromSeed().length() + 1 <= getPreferenceDepthHops()) {

            return HIGH;

        } else {

            // optionally preferencing embeds up to MEDIUM

            int prefHops = getPreferenceEmbedHops(); 

            if (prefHops > 0) {

                int embedHops = curi.getTransHops();

                if (embedHops > 0 && embedHops <= prefHops

                        && curi.getSchedulingDirective() == SchedulingConstants.NORMAL) {

                    // number of embed hops falls within the preferenced range, and

                    // uri is not already MEDIUM -- so promote it

                    return MEDIUM;

                }

            }

            // Everything else stays as previously assigned

            // (probably NORMAL, at least for now)

            return curi.getSchedulingDirective();

        }

    }

    /**

     * Canonicalize passed CrawlURI. This method differs from

     * {@link #canonicalize(UURI)} in that it takes a look at

     * the CrawlURI context possibly overriding any canonicalization effect if

     * it could make us miss content. If canonicalization produces an URL that

     * was 'alreadyseen', but the entry in the 'alreadyseen' database did

     * nothing but redirect to the current URL, we won't get the current URL;

     * we'll think we've already see it. Examples would be archive.org

     * redirecting to www.archive.org or the inverse, www.netarkivet.net

     * redirecting to netarkivet.net (assuming stripWWW rule enabled).

     * <p>Note, this method under circumstance sets the forceFetch flag.

     * 

     * @param cauri CrawlURI to examine.

     * @return Canonicalized <code>cacuri</code>.

     */

    protected String canonicalize(CrawlURI cauri) {

        String canon = getCanonicalizationPolicy().canonicalize(cauri.getURI());

        if (cauri.isLocation()) {

            // If the via is not the same as where we're being redirected (i.e.

            // we're not being redirected back to the same page, AND the

            // canonicalization of the via is equal to the the current cauri, 

            // THEN forcefetch (Forcefetch so no chance of our not crawling

            // content because alreadyseen check things its seen the url before.

            // An example of an URL that redirects to itself is:

            // http://bridalelegance.com/images/buttons3/tuxedos-off.gif.

            // An example of an URL whose canonicalization equals its via's

            // canonicalization, and we want to fetch content at the

            // redirection (i.e. need to set forcefetch), is netarkivet.dk.

            if (!cauri.toString().equals(cauri.getVia().toString()) &&

                    getCanonicalizationPolicy().canonicalize(

                            cauri.getVia().toCustomString()).equals(canon)) {

                cauri.setForceFetch(true);

            }

        }

        return canon;

    }

    

    /**

     * @param cauri CrawlURI we're to get a key for.

     * @return a String token representing a queue

     */

    public String getClassKey(CrawlURI curi) {

        assert KeyedProperties.overridesActiveFrom(curi);      

        String queueKey = getQueueAssignmentPolicy().getClassKey(curi);

        return queueKey;

    }

    

    /**

     * Return the 'cost' of a CrawlURI (how much of its associated

     * queue's budget it depletes upon attempted processing)

     * 

     * @param curi

     * @return the associated cost

     */

    protected int getCost(CrawlURI curi) {

        assert KeyedProperties.overridesActiveFrom(curi);

        

        int cost = curi.getHolderCost();

        if (cost == CrawlURI.UNCALCULATED) {

            cost = getCostAssignmentPolicy().costOf(curi);

        }

        return cost;

    }

这些方法涉及相应的策略类,这个话题比较大,留在后面的文章再解析吧

Preselector类用来配置正则过滤CrawlURI curi对象

@Override

    protected ProcessResult innerProcessResult(CrawlURI puri) {

        CrawlURI curi = (CrawlURI)puri;

        

        // Check if uris should be blocked

        if (getBlockAll()) {

            curi.setFetchStatus(S_BLOCKED_BY_USER);

            return ProcessResult.FINISH;

        }



        // Check if allowed by regular expression

        String regex = getAllowByRegex();

        if (regex != null && !regex.equals("")) {

            if (!TextUtils.matches(regex, curi.toString())) {

                curi.setFetchStatus(S_BLOCKED_BY_USER);

                return ProcessResult.FINISH;

            }

        }



        // Check if blocked by regular expression

        regex = getBlockByRegex();

        if (regex != null && !regex.equals("")) {

            if (TextUtils.matches(regex, curi.toString())) {

                curi.setFetchStatus(S_BLOCKED_BY_USER);

                return ProcessResult.FINISH;

            }

        }



        // Possibly recheck scope

        if (getRecheckScope()) {

            if (!isInScope(curi)) {

                // Scope rejected

                curi.setFetchStatus(S_OUT_OF_SCOPE);

                return ProcessResult.FINISH;

            }

        }

        

        return ProcessResult.PROCEED;

    }

对应的配置文件crawler-beans.cxml里面的配置示例如下

 <!-- FETCH CHAIN --> 

 <!-- first, processors are declared as top-level named beans -->

 <bean id="preselector" class="org.archive.crawler.prefetch.Preselector">

      <!-- <property name="recheckScope" value="false" />-->

     <!--  <property name="blockAll" value="false" />-->

     <!--  <property name="blockByRegex" value="" />-->

     <!--  <property name="allowByRegex" value="" />-->

 </bean>

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源码解析系本人原创

转载请注明出处 博客园 刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/23/3037360.html

你可能感兴趣的:(Heritrix)