Heritrix3.3.0源码阅读 URI过滤规则

在Heritrix3.3.0源码阅读 crawler-beans.cxml中URI过滤规则的配置中,我们看到了Heritrix3.3.0配置的用于决定URI是否被接受的类。而本文的目的是,通过阅读源码,了解

(1)一个URI处理类是怎样工作的

(2)一系列URI处理类是如何配合工作的。

首先,我们来解决第一个问题。

(一)

所有URI处理类都必须继承DecideRule抽象类:

<span style="font-size:24px;">package org.archive.modules.deciderules;


import java.io.Serializable;

import org.archive.modules.CrawlURI;
import org.archive.spring.HasKeyedProperties;
import org.archive.spring.KeyedProperties;

public abstract class DecideRule implements Serializable, HasKeyedProperties {
    // 一个线程安全的HashMap,用于保存一些键值对
    protected KeyedProperties kp = new KeyedProperties();
    public KeyedProperties getKeyedProperties() {
        return kp;
    }
    
    {
        setEnabled(true);
    }
    public boolean getEnabled() {
        return (Boolean) kp.get("enabled");
    }
    public void setEnabled(boolean enabled) {
        kp.put("enabled",enabled);
    }

    protected String comment = "";
    public String getComment() {
        return comment;
    }
    public void setComment(String comment) {
        this.comment = comment;
    }
    
    public DecideRule() {

    }
    
    /**
     * 为一个URI做决策
     * @param uri
     * @return
     */
    public DecideResult decisionFor(CrawlURI uri) {
    	// enabled的状态为false就返回DecideResult.NONE
        if (!getEnabled()) {
            return DecideResult.NONE;
        }
        // innerDecide方法才是用来做决策的
        DecideResult result = innerDecide(uri);
        
        // 我觉得是废话,如果有谁知道用处,希望告知
        if (result == DecideResult.NONE) {
            return result;
        }

        return result;
    }
    
    /**
     * 真正做决策的方法
     * @param uri
     * @return
     */
    protected abstract DecideResult innerDecide(CrawlURI uri);
    
    /**
     * 该方法在该规则只有一个决策结果时有用
     * @param uri
     * @return
     */
    public DecideResult onlyDecision(CrawlURI uri) {
        return null;
    }

    /**
     * 判断是否接受某个URI
     * @param uri
     * @return
     */
    public boolean accepts(CrawlURI uri) {
    	// 通过decisionFor方法的判定结果与DecideResult.ACCEPT作比较
    	// 来判定是否接受某个URI
        return DecideResult.ACCEPT == decisionFor(uri);
    }

}</span>
enable的值决定了一个处理类是否处理URI,true表示处理,false表示不处理。用来获得该处理类对某个URI的处理结果的方法是decisionFor。这个方法在enable为false时直接返回NONE(它的意义接下来就会给出);如果enable为true,就调用innerDecide方法来对URI进行处理。innerDecide方法在子类中实现。这里还必须提提onlyDecision方法,它在处理类仅会返回一种处理结果时有用。


接下来看看DecideRule中老是出现的DecideResult:

package org.archive.modules.deciderules;


/**
 * The decision of a DecideRule.
 * 
 * DecideRule决定
 * 
 * @author pjack
 */
public enum DecideResult {

    /** Indicates the URI was accepted. */
	// 表示这个URI是被接受的
    ACCEPT, 
    
    /** Indicates the URI was neither accepted nor rejected. */
    // 表示这个URI及没有被接受,也没有被拒绝
    NONE, 
    
    /** Indicates the URI was rejected. */
    // 表示这个URI被拒绝了
    REJECT;

    
    /**
     * 反转结果
     * @param result
     * @return
     */
    public static DecideResult invert(DecideResult result) {
        switch (result) {
            case ACCEPT:
                return REJECT;
            case REJECT:
                return ACCEPT;
            default:
                return result;
        }
    }
}
它的作用看一眼就明了了,就不多说了。

接下来,选两个DecideRule的具体子类来说说。先看看RejectDecideRule类,它是配置的第一个具体处理类:

package org.archive.modules.deciderules;

import org.archive.modules.CrawlURI;

/**
 * 该类对所有URI返回结果都为DecideResult.REJECT
 *
 */
public class RejectDecideRule extends DecideRule {

    private static final long serialVersionUID = 3L;


    @Override
    protected DecideResult innerDecide(CrawlURI uri) {
        return DecideResult.REJECT;
    }
    
    
    @Override
    public DecideResult onlyDecision(CrawlURI uri) {
        return DecideResult.REJECT;
    }
}

这个处理类重写了DecideRule的innerDecide方法和onlyDecision方法。从它简短的代码中一眼就能看出,它对所有URI都返回REJECT。

然后看看TooManyHopsDecideRule:

/**
 * Rule REJECTs any CrawlURIs whose total number of hops (length of the 
 * hopsPath string, traversed links of any type) is over a threshold.
 * Otherwise returns PASS.
 *
 * 规则拒绝所有这样的CrawlURIs:它们的跳数(深度)大于阈值。对于另外的CrawlURIs,
 * 既不接受,也不拒绝。
 *
 * @author gojomo
 */
public class TooManyHopsDecideRule extends PredicatedDecideRule {

    private static final long serialVersionUID = 3L;

    /** default for this class is to REJECT */
    /**
     * 默认情况下,返回DecideResult.REJECT
     */
    {
        setDecision(DecideResult.REJECT);
    }
    
    /**
     * Max path depth for which this filter will match.
     */
    /**
     * 设置默认最大深度
     */
    {
            setMaxHops(20);
    }
    public int getMaxHops() {
        return (Integer) kp.get("maxHops");
    }
    public void setMaxHops(int maxHops) {
        kp.put("maxHops", maxHops);
    }
    
    /**
     * Usual constructor. 
     */
    public TooManyHopsDecideRule() {
    }

    /**
     * Evaluate whether given object is over the threshold number of
     * hops.
     * 
     * 评估给的CrawlURI是否超过了设置的最大深度
     * 
     * @param object
     * @return true if the mx-hops is exceeded
     */
    @Override
    protected boolean evaluate(CrawlURI uri) {
        return uri.getHopCount() > getMaxHops();
    }

}
要讲这个类,还必须看看它的直接父类的代码:

/**
 * Rule which applies the configured decision only if a 
 * test evaluates to true. Subclasses override evaluate()
 * to establish the test. 
 * 
 * 当evaluate方法返回true时,才应用配置的规则。子类需要重写evaluate
 * 函数。
 *
 * @author gojomo
 */
public abstract class PredicatedDecideRule extends DecideRule {

    {
        setDecision(DecideResult.ACCEPT);
    }
    public DecideResult getDecision() {
        return (DecideResult) kp.get("decision");
    }
    public void setDecision(DecideResult decision) {
        kp.put("decision",decision);
    }
    
    public PredicatedDecideRule() {
    }

    @Override
    protected DecideResult innerDecide(CrawlURI uri) {
        if (evaluate(uri)) {
            return getDecision();
        }
        return DecideResult.NONE;
    }

    protected abstract boolean evaluate(CrawlURI object);
}
PredicatedDecideRule重写了DecideRule的innerDecide,而innerDecide方法又把决策委托给evaluate方法去做,evaluate方法在TooManyHopsDecideRule中被重写。TooManyHopsDecideRule在URI的深度小于设置的最大深度时,返回ACCEPT;对其它URI返回NONE。

由这几个类的代码阅读可以看出,处理类用于得出结果的方法是innerDecide;当我们需要写我们自己的URI处理类时,只需要继承DecideRule,并重写innerDecide方法就行。

接下来看看,处理序列中的多个处理类是怎样协同工作的。

(2)

我们看看DecideRuleSequence类:

package org.archive.modules.deciderules;

import java.util.List;
import java.util.logging.Level;
import java.util.logging.Logger;

import org.archive.modules.CrawlURI;
import org.archive.modules.SimpleFileLoggerProvider;
import org.archive.modules.net.CrawlHost;
import org.archive.modules.net.ServerCache;
import org.json.JSONObject;
import org.springframework.beans.factory.BeanNameAware;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.Lifecycle;

public class DecideRuleSequence extends DecideRule implements BeanNameAware, Lifecycle {
    final private static Logger LOGGER = 
            Logger.getLogger(DecideRuleSequence.class.getName());
    private static final long serialVersionUID = 3L;

    protected transient Logger fileLogger = null;

    /**
     * If enabled, log decisions to file named logs/{spring-bean-id}.log. Format
     * is: [timestamp] [decisive-rule-num] [decisive-rule-class] [decision]
     * [uri] [extraInfo]
     * 
     * Relies on Spring Lifecycle to initialize the log. Only top-level
     * beans get the Lifecycle treatment from Spring, so bean must be top-level
     * for logToFile to work. (This is true of other modules that support
     * logToFile, and anything else that uses Lifecycle, as well.)
     */
    /**
     * 如果logToFile为真,就把决策放到日志文件中:logs/{spring-bean-id}.log。
     */
    {
        setLogToFile(false);
    }
    public boolean getLogToFile() {
        return (Boolean) kp.get("logToFile");
    }
    public void setLogToFile(boolean enabled) {
        kp.put("logToFile",enabled);
    }

    /**
     * Whether to include the "extra info" field for each entry in crawl.log.
     * "Extra info" is a json object with entries "host", "via", "source" and
     * "hopPath".
     */
    protected boolean logExtraInfo = false;
    public boolean getLogExtraInfo() {
        return logExtraInfo;
    }
    public void setLogExtraInfo(boolean logExtraInfo) {
        this.logExtraInfo = logExtraInfo;
    }

    // provided by CrawlerLoggerModule which is in heritrix-engine, inaccessible
    // from here, thus the need for the SimpleFileLoggerProvider interface
    protected SimpleFileLoggerProvider loggerModule;
    public SimpleFileLoggerProvider getLoggerModule() {
        return this.loggerModule;
    }
    @Autowired
    public void setLoggerModule(SimpleFileLoggerProvider loggerModule) {
        this.loggerModule = loggerModule;
    }

    @SuppressWarnings("unchecked")
    public List<DecideRule> getRules() {
        return (List<DecideRule>) kp.get("rules");
    }
    /**
     * 在这里把规则集合注入了进来
     * @param rules
     */
    public void setRules(List<DecideRule> rules) {
        kp.put("rules", rules);
    }

    protected ServerCache serverCache;
    public ServerCache getServerCache() {
        return this.serverCache;
    }
    @Autowired
    public void setServerCache(ServerCache serverCache) {
        this.serverCache = serverCache;
    }

    /**
     * 真正做决定的方法;
     * 从这个方法可以看出,在规则链的后面的规则得出的非DecideResult.NONE决策
     * 会覆盖前面的规则得出的决策。
     */
    public DecideResult innerDecide(CrawlURI uri) {
        DecideRule decisiveRule = null;
        // 真正做决定的规则
        int decisiveRuleNumber = -1;
        // 默认既不拒绝,也不接受
        DecideResult result = DecideResult.NONE;
        List<DecideRule> rules = getRules();
        int max = rules.size();

        for (int i = 0; i < max; i++) {
            DecideRule rule = rules.get(i);
            if (rule.onlyDecision(uri) != result) {
                DecideResult r = rule.decisionFor(uri);
                if (LOGGER.isLoggable(Level.FINEST)) {
                    LOGGER.finest("DecideRule #" + i + " " + 
                            rule.getClass().getName() + " returned " + r + " for url: " + uri);
                }
                if (r != DecideResult.NONE) {
                    result = r;
                    decisiveRule = rule;
                    decisiveRuleNumber = i;
                }
            }
        }

        decisionMade(uri, decisiveRule, decisiveRuleNumber, result);

        return result;
    }

    /**
     * 在一个CrawlURI被决定是否接受之后被调用的方法
     * @param uri
     * @param decisiveRule
     * @param decisiveRuleNumber
     * @param result
     */
    protected void decisionMade(CrawlURI uri, DecideRule decisiveRule,
            int decisiveRuleNumber, DecideResult result) {
        if (fileLogger != null) {
            JSONObject extraInfo = null;
            if (logExtraInfo) {
                CrawlHost crawlHost = getServerCache().getHostFor(uri.getUURI());
                String host = "-";
                if (crawlHost != null) {
                    host  = crawlHost.fixUpName();
                }

                extraInfo = new JSONObject();
                extraInfo.put("hopPath", uri.getPathFromSeed());
                extraInfo.put("via", uri.getVia());
                extraInfo.put("seed", uri.getSourceTag());
                extraInfo.put("host", host);
            }

            fileLogger.info(decisiveRuleNumber 
                    + " " + decisiveRule.getClass().getSimpleName() 
                    + " " + result 
                    + " " + uri
                    + (extraInfo != null ? " " + extraInfo : ""));
        }
    }

    protected String beanName;
    public String getBeanName() {
        return this.beanName;
    }
    @Override
    public void setBeanName(String name) {
        this.beanName = name;
    }

    protected boolean isRunning = false;
    @Override
    public boolean isRunning() {
        return isRunning;
    }
    @Override
    public void start() {
    	// 实例化日志
        if (getLogToFile() && fileLogger == null) {
            fileLogger = loggerModule.setupSimpleLog(getBeanName());
        }
        isRunning = true;
    }
    @Override
    public void stop() {
        isRunning = false;
    }
}
这个类同样是DecideRule的子类,它重写了innerDecide方法,并从该方法的实现可以看出,当后面的处理类的返回结果不为NONE时,新的结果就会覆盖老的结果。这时,我们终于明白了配置文件中的这句话:

<!-- SCOPE: rules for which discovered URIs to crawl; order is very  
      important because last decision returned other than 'NONE' wins. -->


所以,当我们需要定制我们自己的URI过滤过则时,我们不仅需要控制innerDecide的行为,还需要调整各个处理类的顺序。

(由于各个类的内容少且简单,故把所有代码都贴上来了)

你可能感兴趣的:(源码,Heritrix,网络爬虫)