Heritrix 3.1.0 源码解析(三十)

作为CrawlURI uri对象在处理器链中的生命周期,本人认为逻辑上应该从FrontierPreparer处理器开始,再经过后续的处理器(其实具体CrawlURI uri对象的生命周期,是在它的父级CrawlURI uri对象的抽取处理器处理时已经初具雏形,父级CrawlURI uri对象与它的子级CrawlURI uri对象的生命周期是交错的,关于处理器的流程我在前面已经描述过)

经过FrontierPreparer处理器处理的CrawlURI uri对象下一步才是进入BdbFrontier对象的Schedule方法添加到BdbWorkQueue工作队列

该处理器主要是为CrawlURI uri对象初始化配置,包括调度等级、格式化URL链接、生成classkey、设置holderCost、设置优先级策略,为BdbFrontier对象对其调度做准备

本人在Heritrix 3.1.0 源码解析(二十)解析CandidateChain candidateChain处理器链相关联的处理器时已经提到FrontierPreparer处理器,此文并没有分析该处理器的作用,现在回顾一下

首先是设置CrawlURI curi对象的调度等级,是根据当前CrawlURI curi对象的pathFromSeed属性(从seed到当前CrawlURI curi的Hop值,不同链接类型有不同的代码)

/**

     * Calculate the coarse, original 'schedulingDirective' prioritization

     * for the given CrawlURI

     * 

     * @param curi

     * @return

     */

    protected int getSchedulingDirective(CrawlURI curi) {

        if(StringUtils.isNotEmpty(curi.getPathFromSeed())) {

            char lastHop = curi.getPathFromSeed().charAt(curi.getPathFromSeed().length()-1);

            if(lastHop == 'R') {

                // refer

                return getPreferenceDepthHops() >= 0 ? HIGH : MEDIUM;

            } 

        }

        if (getPreferenceDepthHops() == 0) {

            return HIGH;

            // this implies seed redirects are treated as path

            // length 1, which I belive is standard.

            // curi.getPathFromSeed() can never be null here, because

            // we're processing a link extracted from curi

        } else if (getPreferenceDepthHops() > 0 && 

            curi.getPathFromSeed().length() + 1 <= getPreferenceDepthHops()) {

            return HIGH;

        } else {

            // optionally preferencing embeds up to MEDIUM

            int prefHops = getPreferenceEmbedHops(); 

            if (prefHops > 0) {

                int embedHops = curi.getTransHops();

                if (embedHops > 0 && embedHops <= prefHops

                        && curi.getSchedulingDirective() == SchedulingConstants.NORMAL) {

                    // number of embed hops falls within the preferenced range, and

                    // uri is not already MEDIUM -- so promote it

                    return MEDIUM;

                }

            }

            // Everything else stays as previously assigned

            // (probably NORMAL, at least for now)

            return curi.getSchedulingDirective();

        }

    }

UriCanonicalizationPolicy,姑且称为URL格式化策略类,该类为抽象类,提供格式化URL的抽象方法,由具体子类实现

/**

 * URI Canonicalizatioon Policy

 * 

 * @contributor stack

 * @contributor gojomo

 */

public abstract class UriCanonicalizationPolicy {

    public abstract String canonicalize(String uri);

}

RulesCanonicalizationPolicy类继承自抽象类UriCanonicalizationPolicy,实现格式化URL方法

/**

 * URI Canonicalizatioon Policy

 * 

 * @contributor stack

 * @contributor gojomo

 */

public class RulesCanonicalizationPolicy 

    extends UriCanonicalizationPolicy

    implements HasKeyedProperties {

    private static Logger logger =

        Logger.getLogger(RulesCanonicalizationPolicy.class.getName());

    

    protected KeyedProperties kp = new KeyedProperties();

    public KeyedProperties getKeyedProperties() {

        return kp;

    }

    

    {

        setRules(getDefaultRules());

    }

    @SuppressWarnings("unchecked")

    public List<CanonicalizationRule> getRules() {

        return (List<CanonicalizationRule>) kp.get("rules");

    }

    public void setRules(List<CanonicalizationRule> rules) {

        kp.put("rules", rules);

    }

    

    /**

     * Run the passed uuri through the list of rules.

     * @param context Url to canonicalize.

     * @param rules Iterator of canonicalization rules to apply (Get one

     * of these on the url-canonicalizer-rules element in order files or

     * create a list externally).  Rules must implement the Rule interface.

     * @return Canonicalized URL.

     */

    public String canonicalize(String before) {

        String canonical = before;

        if (logger.isLoggable(Level.FINER)) {

            logger.finer("Canonicalizing: "+before);

        }

        for (CanonicalizationRule rule : getRules()) {

            if(rule.getEnabled()) {

                canonical = rule.canonicalize(canonical);

            }

            if (logger.isLoggable(Level.FINER)) {

                logger.finer(

                    "Rule " + rule.getClass().getName() + " "

                    + (rule.getEnabled()

                            ? canonical :" (disabled)"));

            }

        }

        return canonical;

    }

    

    /**

     * A reasonable set of default rules to use, if no others are

     * provided by operator configuration.

     */

    public static List<CanonicalizationRule> getDefaultRules() {

        List<CanonicalizationRule> rules = new ArrayList<CanonicalizationRule>(6);

        rules.add(new LowercaseRule());

        rules.add(new StripUserinfoRule());

        rules.add(new StripWWWNRule());

        rules.add(new StripSessionIDs());

        rules.add(new StripSessionCFIDs());

        rules.add(new FixupQueryString());

        return rules;

    }

}

格式化URL方法里面迭代调用CanonicalizationRule类型集合里面的成员对象的String canonicalize(String url)方法

CanonicalizationRule是接口,接口声明了String canonicalize(String url)方法,实现该接口的有上面静态方法List<CanonicalizationRule> getDefaultRules()中添加的类,这种处理方式有点类似composite模式与Iterator模式的结合,不过枝节点与叶节点并没有实现共同的接口类型

QueueAssignmentPolicy类为生成URL对象的Classkey策略,该类同样为抽象类,提供生成Classkey的方法(工作队列的标识也就是根据这个生成的Classkey)

默认的生成URL对象的Classkey策略为SurtAuthorityQueueAssignmentPolicy实现类,是根据URL对象的域名生成字符串,因此相同域名的站点里面的URL对象也就只有这一个Classkey标识,也就是只有一个工作队列

我们可以扩展Classkey生成策略,比较经典的是利用ELFHash算法为CrawlURI curi对象分配Key值 ,我这里做一个示例,新建MyQueueAssignmentPolicy类,继承自抽象类QueueAssignmentPolicy,相关源码如下:

/**

     * 

     */

    private static final long serialVersionUID = 1L;



    @Override

    public String getClassKey(CrawlURI cauri) 

    {

        // TODO Auto-generated method stub

        String uri = cauri.getURI().toString();         

        long hash = ELFHash(uri);//利用ELFHash算法为uri分配Key值         

        String a = Long.toString(hash % 50);//取模50,对应50个线程         

        return a;

    }

    public long ELFHash(String str)      

    {         

        long hash = 0;         

        long x   = 0;         

        for(int i = 0; i < str.length(); i++)         

        {            

            hash = (hash << 4) + str.charAt(i);//将字符中的每个元素依次按前四位与上            

            if((x = hash & 0xF0000000L) != 0)//个元素的低四位想与           

            {               

                hash ^= (x >> 24);//长整的高四位大于零,折回再与长整后四位异或              

                hash &= ~x;            

            }         

        }         

        return (hash & 0x7FFFFFFF);      

    }

然后我们在配置文件crawler-beans.cxml里面将FrontierPreparer处理器Bean的queueAssignmentPolicy属性设置成我们扩展的MyQueueAssignmentPolicy类的Bean就可以了

UriPrecedencePolicy类为CrawlURI curi对象优先级策略,该类同样为抽象类,提供设置CrawlURI curi对象的优先级的抽象方法

abstract public class UriPrecedencePolicy implements Serializable {



    /**

     * Add a precedence value to the supplied CrawlURI, which is being 

     * scheduled onto a frontier queue for the first time. 

     * @param curi CrawlURI to assign a precedence value

     */

    abstract public void uriScheduled(CrawlURI curi);



}

默认为CostUriPrecedencePolicy类,根据CrawlURI curi对象的持有成本设置其优先级

/**

 * UriPrecedencePolicy which sets a URI's precedence to its 'cost' -- which

 * simulates the in-queue sorting order in Heritrix 1.x, where cost 

 * contributed the same bits to the queue-insert-key that precedence now does.

 */

public class CostUriPrecedencePolicy extends UriPrecedencePolicy {

    private static final long serialVersionUID = -8164425278358540710L;



    /* (non-Javadoc)

     * @see org.archive.crawler.frontier.precedence.UriPrecedencePolicy#uriScheduled(org.archive.crawler.datamodel.CrawlURI)

     */

    @Override

    public void uriScheduled(CrawlURI curi) {

        curi.setPrecedence(curi.getHolderCost()); 

    }

}

FrontierPreparer处理器Bean的相关策略在crawler-beans.cxml配置文件中的配置如下

 <!-- 

   OPTIONAL BEANS

    Uncomment and expand as needed, or if non-default alternate 

    implementations are preferred.

  -->

  

 <!-- CANONICALIZATION POLICY -->

 <bean id="canonicalizationPolicy" 

   class="org.archive.modules.canonicalize.RulesCanonicalizationPolicy">

   <property name="rules">

    <list>

     <bean class="org.archive.modules.canonicalize.LowercaseRule" />

     <bean class="org.archive.modules.canonicalize.StripUserinfoRule" />

     <bean class="org.archive.modules.canonicalize.StripWWWNRule" />

     <bean class="org.archive.modules.canonicalize.StripSessionIDs" />

     <bean class="org.archive.modules.canonicalize.StripSessionCFIDs" />

     <bean class="org.archive.modules.canonicalize.FixupQueryString" />

    </list>

  </property>

 </bean> 



 <!-- QUEUE ASSIGNMENT POLICY -->

 <bean id="queueAssignmentPolicy" 

   class="org.archive.crawler.frontier.SurtAuthorityQueueAssignmentPolicy">

  <property name="forceQueueAssignment" value="" />

  <property name="deferToPrevious" value="true" />

  <property name="parallelQueues" value="1" />

 </bean>

 

 <!-- URI PRECEDENCE POLICY -->

 <bean id="uriPrecedencePolicy" 

   class="org.archive.crawler.frontier.precedence.CostUriPrecedencePolicy">

 </bean>

 

 <!-- COST ASSIGNMENT POLICY -->

 <bean id="costAssignmentPolicy" 

   class="org.archive.crawler.frontier.UnitCostAssignmentPolicy">

 </bean>

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源码解析系本人原创

转载请注明出处 博客园 刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/29/3050992.html

你可能感兴趣的:(Heritrix)