在Heritrix3.3.0源码阅读 crawler-beans.cxml中URI过滤规则的配置中,我们看到了Heritrix3.3.0配置的用于决定URI是否被接受的类。而本文的目的是,通过阅读源码,了解
(1)一个URI处理类是怎样工作的
(2)一系列URI处理类是如何配合工作的。
首先,我们来解决第一个问题。
(一)
所有URI处理类都必须继承DecideRule抽象类:
<span style="font-size:24px;">package org.archive.modules.deciderules; import java.io.Serializable; import org.archive.modules.CrawlURI; import org.archive.spring.HasKeyedProperties; import org.archive.spring.KeyedProperties; public abstract class DecideRule implements Serializable, HasKeyedProperties { // 一个线程安全的HashMap,用于保存一些键值对 protected KeyedProperties kp = new KeyedProperties(); public KeyedProperties getKeyedProperties() { return kp; } { setEnabled(true); } public boolean getEnabled() { return (Boolean) kp.get("enabled"); } public void setEnabled(boolean enabled) { kp.put("enabled",enabled); } protected String comment = ""; public String getComment() { return comment; } public void setComment(String comment) { this.comment = comment; } public DecideRule() { } /** * 为一个URI做决策 * @param uri * @return */ public DecideResult decisionFor(CrawlURI uri) { // enabled的状态为false就返回DecideResult.NONE if (!getEnabled()) { return DecideResult.NONE; } // innerDecide方法才是用来做决策的 DecideResult result = innerDecide(uri); // 我觉得是废话,如果有谁知道用处,希望告知 if (result == DecideResult.NONE) { return result; } return result; } /** * 真正做决策的方法 * @param uri * @return */ protected abstract DecideResult innerDecide(CrawlURI uri); /** * 该方法在该规则只有一个决策结果时有用 * @param uri * @return */ public DecideResult onlyDecision(CrawlURI uri) { return null; } /** * 判断是否接受某个URI * @param uri * @return */ public boolean accepts(CrawlURI uri) { // 通过decisionFor方法的判定结果与DecideResult.ACCEPT作比较 // 来判定是否接受某个URI return DecideResult.ACCEPT == decisionFor(uri); } }</span>enable的值决定了一个处理类是否处理URI,true表示处理,false表示不处理。用来获得该处理类对某个URI的处理结果的方法是decisionFor。这个方法在enable为false时直接返回NONE(它的意义接下来就会给出);如果enable为true,就调用innerDecide方法来对URI进行处理。innerDecide方法在子类中实现。这里还必须提提onlyDecision方法,它在处理类仅会返回一种处理结果时有用。
接下来看看DecideRule中老是出现的DecideResult:
package org.archive.modules.deciderules; /** * The decision of a DecideRule. * * DecideRule决定 * * @author pjack */ public enum DecideResult { /** Indicates the URI was accepted. */ // 表示这个URI是被接受的 ACCEPT, /** Indicates the URI was neither accepted nor rejected. */ // 表示这个URI及没有被接受,也没有被拒绝 NONE, /** Indicates the URI was rejected. */ // 表示这个URI被拒绝了 REJECT; /** * 反转结果 * @param result * @return */ public static DecideResult invert(DecideResult result) { switch (result) { case ACCEPT: return REJECT; case REJECT: return ACCEPT; default: return result; } } }它的作用看一眼就明了了,就不多说了。
接下来,选两个DecideRule的具体子类来说说。先看看RejectDecideRule类,它是配置的第一个具体处理类:
package org.archive.modules.deciderules; import org.archive.modules.CrawlURI; /** * 该类对所有URI返回结果都为DecideResult.REJECT * */ public class RejectDecideRule extends DecideRule { private static final long serialVersionUID = 3L; @Override protected DecideResult innerDecide(CrawlURI uri) { return DecideResult.REJECT; } @Override public DecideResult onlyDecision(CrawlURI uri) { return DecideResult.REJECT; } }
然后看看TooManyHopsDecideRule:
/** * Rule REJECTs any CrawlURIs whose total number of hops (length of the * hopsPath string, traversed links of any type) is over a threshold. * Otherwise returns PASS. * * 规则拒绝所有这样的CrawlURIs:它们的跳数(深度)大于阈值。对于另外的CrawlURIs, * 既不接受,也不拒绝。 * * @author gojomo */ public class TooManyHopsDecideRule extends PredicatedDecideRule { private static final long serialVersionUID = 3L; /** default for this class is to REJECT */ /** * 默认情况下,返回DecideResult.REJECT */ { setDecision(DecideResult.REJECT); } /** * Max path depth for which this filter will match. */ /** * 设置默认最大深度 */ { setMaxHops(20); } public int getMaxHops() { return (Integer) kp.get("maxHops"); } public void setMaxHops(int maxHops) { kp.put("maxHops", maxHops); } /** * Usual constructor. */ public TooManyHopsDecideRule() { } /** * Evaluate whether given object is over the threshold number of * hops. * * 评估给的CrawlURI是否超过了设置的最大深度 * * @param object * @return true if the mx-hops is exceeded */ @Override protected boolean evaluate(CrawlURI uri) { return uri.getHopCount() > getMaxHops(); } }要讲这个类,还必须看看它的直接父类的代码:
/** * Rule which applies the configured decision only if a * test evaluates to true. Subclasses override evaluate() * to establish the test. * * 当evaluate方法返回true时,才应用配置的规则。子类需要重写evaluate * 函数。 * * @author gojomo */ public abstract class PredicatedDecideRule extends DecideRule { { setDecision(DecideResult.ACCEPT); } public DecideResult getDecision() { return (DecideResult) kp.get("decision"); } public void setDecision(DecideResult decision) { kp.put("decision",decision); } public PredicatedDecideRule() { } @Override protected DecideResult innerDecide(CrawlURI uri) { if (evaluate(uri)) { return getDecision(); } return DecideResult.NONE; } protected abstract boolean evaluate(CrawlURI object); }PredicatedDecideRule重写了DecideRule的innerDecide,而innerDecide方法又把决策委托给evaluate方法去做,evaluate方法在TooManyHopsDecideRule中被重写。TooManyHopsDecideRule在URI的深度小于设置的最大深度时,返回ACCEPT;对其它URI返回NONE。
由这几个类的代码阅读可以看出,处理类用于得出结果的方法是innerDecide;当我们需要写我们自己的URI处理类时,只需要继承DecideRule,并重写innerDecide方法就行。
接下来看看,处理序列中的多个处理类是怎样协同工作的。
(2)
我们看看DecideRuleSequence类:
package org.archive.modules.deciderules; import java.util.List; import java.util.logging.Level; import java.util.logging.Logger; import org.archive.modules.CrawlURI; import org.archive.modules.SimpleFileLoggerProvider; import org.archive.modules.net.CrawlHost; import org.archive.modules.net.ServerCache; import org.json.JSONObject; import org.springframework.beans.factory.BeanNameAware; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.context.Lifecycle; public class DecideRuleSequence extends DecideRule implements BeanNameAware, Lifecycle { final private static Logger LOGGER = Logger.getLogger(DecideRuleSequence.class.getName()); private static final long serialVersionUID = 3L; protected transient Logger fileLogger = null; /** * If enabled, log decisions to file named logs/{spring-bean-id}.log. Format * is: [timestamp] [decisive-rule-num] [decisive-rule-class] [decision] * [uri] [extraInfo] * * Relies on Spring Lifecycle to initialize the log. Only top-level * beans get the Lifecycle treatment from Spring, so bean must be top-level * for logToFile to work. (This is true of other modules that support * logToFile, and anything else that uses Lifecycle, as well.) */ /** * 如果logToFile为真,就把决策放到日志文件中:logs/{spring-bean-id}.log。 */ { setLogToFile(false); } public boolean getLogToFile() { return (Boolean) kp.get("logToFile"); } public void setLogToFile(boolean enabled) { kp.put("logToFile",enabled); } /** * Whether to include the "extra info" field for each entry in crawl.log. * "Extra info" is a json object with entries "host", "via", "source" and * "hopPath". */ protected boolean logExtraInfo = false; public boolean getLogExtraInfo() { return logExtraInfo; } public void setLogExtraInfo(boolean logExtraInfo) { this.logExtraInfo = logExtraInfo; } // provided by CrawlerLoggerModule which is in heritrix-engine, inaccessible // from here, thus the need for the SimpleFileLoggerProvider interface protected SimpleFileLoggerProvider loggerModule; public SimpleFileLoggerProvider getLoggerModule() { return this.loggerModule; } @Autowired public void setLoggerModule(SimpleFileLoggerProvider loggerModule) { this.loggerModule = loggerModule; } @SuppressWarnings("unchecked") public List<DecideRule> getRules() { return (List<DecideRule>) kp.get("rules"); } /** * 在这里把规则集合注入了进来 * @param rules */ public void setRules(List<DecideRule> rules) { kp.put("rules", rules); } protected ServerCache serverCache; public ServerCache getServerCache() { return this.serverCache; } @Autowired public void setServerCache(ServerCache serverCache) { this.serverCache = serverCache; } /** * 真正做决定的方法; * 从这个方法可以看出,在规则链的后面的规则得出的非DecideResult.NONE决策 * 会覆盖前面的规则得出的决策。 */ public DecideResult innerDecide(CrawlURI uri) { DecideRule decisiveRule = null; // 真正做决定的规则 int decisiveRuleNumber = -1; // 默认既不拒绝,也不接受 DecideResult result = DecideResult.NONE; List<DecideRule> rules = getRules(); int max = rules.size(); for (int i = 0; i < max; i++) { DecideRule rule = rules.get(i); if (rule.onlyDecision(uri) != result) { DecideResult r = rule.decisionFor(uri); if (LOGGER.isLoggable(Level.FINEST)) { LOGGER.finest("DecideRule #" + i + " " + rule.getClass().getName() + " returned " + r + " for url: " + uri); } if (r != DecideResult.NONE) { result = r; decisiveRule = rule; decisiveRuleNumber = i; } } } decisionMade(uri, decisiveRule, decisiveRuleNumber, result); return result; } /** * 在一个CrawlURI被决定是否接受之后被调用的方法 * @param uri * @param decisiveRule * @param decisiveRuleNumber * @param result */ protected void decisionMade(CrawlURI uri, DecideRule decisiveRule, int decisiveRuleNumber, DecideResult result) { if (fileLogger != null) { JSONObject extraInfo = null; if (logExtraInfo) { CrawlHost crawlHost = getServerCache().getHostFor(uri.getUURI()); String host = "-"; if (crawlHost != null) { host = crawlHost.fixUpName(); } extraInfo = new JSONObject(); extraInfo.put("hopPath", uri.getPathFromSeed()); extraInfo.put("via", uri.getVia()); extraInfo.put("seed", uri.getSourceTag()); extraInfo.put("host", host); } fileLogger.info(decisiveRuleNumber + " " + decisiveRule.getClass().getSimpleName() + " " + result + " " + uri + (extraInfo != null ? " " + extraInfo : "")); } } protected String beanName; public String getBeanName() { return this.beanName; } @Override public void setBeanName(String name) { this.beanName = name; } protected boolean isRunning = false; @Override public boolean isRunning() { return isRunning; } @Override public void start() { // 实例化日志 if (getLogToFile() && fileLogger == null) { fileLogger = loggerModule.setupSimpleLog(getBeanName()); } isRunning = true; } @Override public void stop() { isRunning = false; } }这个类同样是DecideRule的子类,它重写了innerDecide方法,并从该方法的实现可以看出,当后面的处理类的返回结果不为NONE时,新的结果就会覆盖老的结果。这时,我们终于明白了配置文件中的这句话:
<!-- SCOPE: rules for which discovered URIs to crawl; order is very
important because last decision returned other than 'NONE' wins. -->
所以,当我们需要定制我们自己的URI过滤过则时,我们不仅需要控制innerDecide的行为,还需要调整各个处理类的顺序。
(由于各个类的内容少且简单,故把所有代码都贴上来了)