Heritrix 3.1.0 源码解析(三十一)

从BdbFrontier对象的next方法(从某个Classkey标识的BdbWorkQueue工作队列)取出来的CrawlURI uri对象第一步要进入的处理器是Preselector处理器,该处理器主要是对CrawlURI uri对象根据配置文件里面配置的正则表达式进行过滤,通过过滤的CrawlURI uri对象才能进入下一步的处理器,该处理器继承自Scoper类(Scoper类我在前面的文章已经解析过,这不再重复),该类比较简单,我只贴出相关处理的方法


    protected ProcessResult innerProcessResult(CrawlURI puri) {

        CrawlURI curi = (CrawlURI)puri;


        // Check if uris should be blocked

        if (getBlockAll()) {


            return ProcessResult.FINISH;


        // Check if allowed by regular expression

        String regex = getAllowByRegex();

        if (regex != null && !regex.equals("")) {

            if (!TextUtils.matches(regex, curi.toString())) {


                return ProcessResult.FINISH;



        // Check if blocked by regular expression

        regex = getBlockByRegex();

        if (regex != null && !regex.equals("")) {

            if (TextUtils.matches(regex, curi.toString())) {


                return ProcessResult.FINISH;



        // Possibly recheck scope

        if (getRecheckScope()) {

            if (!isInScope(curi)) {

                // Scope rejected


                return ProcessResult.FINISH;




        return ProcessResult.PROCEED;


上面方法里面的最后一步是判断是否还要重新范围筛选(调用父类Scoper类的boolean isInScope(CrawlURI caUri)方法),默认为false

这里需要弄明白的是,该处理器的正则过滤,是在CrawlURI uri对象已经添加到了BdbWorkQueue工作队列而进行正式采集前的处理,不同于CandidatesProcessor处理器是在CrawlURI uri对象进入BdbWorkQueue工作队列之前的筛选,我们配置过滤CrawlURI uri对象的过滤规则,本人推荐在CandidatesProcessor处理器相关模块设置


 <!-- first, processors are declared as top-level named beans -->

 <bean id="preselector" class="org.archive.crawler.prefetch.MyPreselector">

      <!-- <property name="recheckScope" value="false" />-->

     <!--  <property name="blockAll" value="false" />-->

     <!--  <property name="blockByRegex" value="" />-->

     <!--  <property name="allowByRegex" value="" />-->


通过FrontierPreparer处理器的CrawlURI uri对象下一步进入PreconditionEnforcer处理器,该处理器可以称为先决条件处理器,主要是验证DNS,验证Robots规则,进行身份认证等,其相关处理方法如下


    protected ProcessResult innerProcessResult(CrawlURI puri) {

        CrawlURI curi = (CrawlURI)puri;


        if (considerDnsPreconditions(curi)) {

            return ProcessResult.FINISH;


        // make sure we only process schemes we understand (i.e. not dns) 当前CrawlURI puri对象的scheme不是http并且不是https

        String scheme = curi.getUURI().getScheme().toLowerCase();

        if (! (scheme.equals("http") || scheme.equals("https"))) {

            logger.fine("PolitenessEnforcer doesn't understand uri's of type " +

                scheme + " (ignoring)");

            return ProcessResult.PROCEED;



        if (considerRobotsPreconditions(curi)) {

            return ProcessResult.FINISH;


//        System.out.println("!curi.isPrerequisite():"+!curi.isPrerequisite());


        if (!curi.isPrerequisite() && credentialPrecondition(curi)) {

            return ProcessResult.FINISH;


        // OK, it's allowed

        // For all curis that will in fact be fetched, set appropriate delays.

        // TODO: SOMEDAY: allow per-host, per-protocol, etc. factors

        // curi.setDelayFactor(getDelayFactorFor(curi));

        // curi.setMinimumDelay(getMinimumDelayFor(curi));

        return ProcessResult.PROCEED;


如果存在先决条件,则设置当前CrawlURI puri对象的先决条件并退出当前处理器链(FetchChain处理器链)的流程

我们先来分析第一个先决条件:DNS解析验证,boolean considerDnsPreconditions(CrawlURI curi)方法


     * @param curi CrawlURI whose dns prerequisite we're to check.

     * @return true if no further processing in this module should occur


    protected boolean considerDnsPreconditions(CrawlURI curi) {


            // DNS URIs never have a DNS precondition



            return false; 

        } else if (curi.getUURI().getScheme().equals("whois")) {

            return false;



        CrawlServer cs = serverCache.getServerFor(curi.getUURI());        

        if(cs == null) {


//            curi.skipToPostProcessing();

            return true;


        // If we've done a dns lookup and it didn't resolve a host

        // cancel further fetch-processing of this URI, because

        // the domain is unresolvable

        CrawlHost ch = serverCache.getHostFor(curi.getUURI());        

        if (ch == null || ch.hasBeenLookedUp() && ch.getIP() == null) {

            if (logger.isLoggable(Level.FINE)) {

                logger.fine( "no dns for " + ch +

                    " cancelling processing for CrawlURI " + curi.toString());



//            curi.skipToPostProcessing();

            return true;


        // If we haven't done a dns lookup  and this isn't a dns uri

        // shoot that off and defer further processing

        //判断IP是否过期并且当前CrawlURI curi对象的scheme本身不是dns

        if (isIpExpired(curi) && !curi.getUURI().getScheme().equals("dns")) {

            logger.fine("Deferring processing of CrawlURI " + curi.toString()

                + " for dns lookup.");

            String preq = "dns:" + ch.getHostName();

            try {

                // 先决条件 DNS解析


            } catch (URIException e) {

                throw new RuntimeException(e); // shouldn't ever happen


            return true;



        // DNS preconditions OK

        return false;


boolean isIpExpired(CrawlURI curi)方法判断IP是否注册(判断当前CrawlURI curi对象对应的CrawlHost host对象里面IP是否注册)

/** Return true if ip should be looked up.


     * @param curi the URI to check.

     * @return true if ip should be looked up.


    public boolean isIpExpired(CrawlURI curi) {

        CrawlHost host = serverCache.getHostFor(curi.getUURI());

        if (!host.hasBeenLookedUp()) {

            // IP has not been looked up yet.

            return true;


        if (host.getIpTTL() == CrawlHost.IP_NEVER_EXPIRES) {

            // IP never expires (numeric IP)

            return false;


        long duration = getIpValidityDurationSeconds();

        if (duration == 0) {

            // Never expire ip if duration is null (set by user or more likely,

            // set to zero in case where we tried in FetchDNS but failed).

            return false;



        long ttl = host.getIpTTL();

        if (ttl > duration) {

            // Use the larger of the operator-set minimum duration 

            // or the DNS record TTL

            duration = ttl;


        // Duration and ttl are in seconds.  Convert to millis.

        if (duration > 0) {

            duration *= 1000;


        return (duration + host.getIpFetched()) < System.currentTimeMillis();


如果IP没有注册,则设置当前CrawlURI curi对象的先决条件为dns:host,CrawlURI curi对象的CrawlURI markPrerequisite(String preq) 方法如下 


     * Do all actions associated with setting a <code>CrawlURI</code> as

     * requiring a prerequisite.


     * @param lastProcessorChain Last processor chain reference.  This chain is

     * where this <code>CrawlURI</code> goes next.

     * @param preq Object to set a prerequisite.

     * @return the newly created prerequisite CrawlURI

     * @throws URIException


    public CrawlURI markPrerequisite(String preq) 

    throws URIException {

        UURI src = getUURI();

        UURI dest = UURIFactory.getInstance(preq);

        LinkContext lc = LinkContext.PREREQ_MISC;

        Hop hop = Hop.PREREQ;

        Link link = new Link(src, dest, lc, hop);

        CrawlURI caUri = createCrawlURI(getBaseURI(), link);

        // TODO: consider moving some of this to candidate-handling

        int prereqPriority = getSchedulingDirective() - 1;

        if (prereqPriority < 0) {

            prereqPriority = 0;

            logger.severe("Unable to promote prerequisite " + caUri + " above " + this);








        return caUri;


在上述方法里面首先生成先决条件CrawlURI caUri对象,设置高一级的调度级别,最后以key值为String A_PREREQUISITE_URI = "prerequisite-uri"添加到当前CrawlURI curi对象的Map<String,Object> data成员里面 


     * Set a prerequisite for this URI.

     * <p>

     * A prerequisite is a URI that must be crawled before this URI can be

     * crawled.


     * @param link Link to set as prereq.


    public void setPrerequisiteUri(CrawlURI pre) {

        getData().put(A_PREREQUISITE_URI, pre);




    protected void innerProcess(final CrawlURI curi) throws InterruptedException {

        // Handle any prerequisites when S_DEFERRED for prereqs

        if (curi.hasPrerequisiteUri() && curi.getFetchStatus() == S_DEFERRED) {

            CrawlURI prereq = curi.getPrerequisiteUri();



            try {



                getCandidateChain().process(prereq, null);


                if(prereq.getFetchStatus()>=0) {



                } else {



            } finally {








回到PreconditionEnforcer处理器,第二个先决条件为验证Robots规则,处理逻辑与验证DNS相似, boolean considerRobotsPreconditions(CrawlURI curi)方法如下


     * Consider the robots precondition.


     * @param curi CrawlURI we're checking for any required preconditions.

     * @return True, if this <code>curi</code> has a precondition or processing

     *         should be terminated for some other reason.  False if

     *         we can proceed to process this url.


    protected boolean considerRobotsPreconditions(CrawlURI curi) {

        // treat /robots.txt fetches specially

        //忽略验证 return false;

        UURI uuri = curi.getUURI();

        try {

            if (uuri != null && uuri.getPath() != null &&

                    curi.getUURI().getPath().equals("/robots.txt")) {

                // allow processing to continue



                return false;


        } catch (URIException e) {

            logger.severe("Failed get of path for " + curi);



        CrawlServer cs = serverCache.getServerFor(curi.getUURI());

        // require /robots.txt if not present


        if (cs.isRobotsExpired(getRobotsValidityDurationSeconds())) {

            // Need to get robots

            if (logger.isLoggable(Level.FINE)) {

                logger.fine( "No valid robots for " + cs  +

                    "; deferring " + curi);


            // Robots expired - should be refetched even though its already

            // crawled.

            try {

                String prereq = curi.getUURI().resolve("/robots.txt").toString();




            catch (URIException e1) {

                logger.severe("Failed resolve using " + curi);

                throw new RuntimeException(e1); // shouldn't ever happen


            return true;


        // test against robots.txt if available

        if (cs.isValidRobots()) {

            String ua = metadata.getUserAgent();

            RobotsPolicy robots = metadata.getRobotsPolicy();

            if(!robots.allows(ua, curi, cs.getRobotstxt())) {

                if(getCalculateRobotsOnly()) {

                    // annotate URI as excluded, but continue to process normally


                    return false; 


                // mark as precluded; in FetchHTTP, this will

                // prevent fetching and cause a skip to the end

                // of processing (unless an intervening processor

                // overrules)


                curi.setError("robots.txt exclusion");

                logger.fine("robots.txt precluded " + curi);

                return true;


            return false;


        // No valid robots found => Attempt to get robots.txt failed

//        curi.skipToPostProcessing();


        curi.setError("robots.txt prerequisite failed");

        if (logger.isLoggable(Level.FINE)) {

            logger.fine("robots.txt prerequisite failed " + curi);


        return true;


第三个先决条件为身份认证,boolean credentialPrecondition(final CrawlURI curi)方法如下 


    * Consider credential preconditions.


    * Looks to see if any credential preconditions (e.g. html form login

    * credentials) for this <code>CrawlServer</code>. If there are, have they

    * been run already? If not, make the running of these logins a precondition

    * of accessing any other url on this <code>CrawlServer</code>.


    * <p>

    * One day, do optimization and avoid running the bulk of the code below.

    * Argument for running the code everytime is that overrides and refinements

    * may change what comes back from credential store.


    * @param curi CrawlURI we're checking for any required preconditions.

    * @return True, if this <code>curi</code> has a precondition that needs to

    *         be met before we can proceed. False if we can precede to process

    *         this url.



     * 考虑 不同classkey 而host实际相同的情况   应该以host为依据

     * @param curi

     * @return


    protected boolean credentialPrecondition(final CrawlURI curi) {

        boolean result = false;


        CredentialStore cs = getCredentialStore();

        if (cs == null) {

            logger.severe("No credential store for " + curi);

            return result;



        //遍历CredentialStore cs对象存储的Collection<Credential>集合

        for (Credential c: cs.getAll()) {

            //判断当前CrawlURI curi对象是否先决条件

            if (c.isPrerequisite(curi)) {

                // This credential has a prereq. and this curi is it.  Let it

                // through.  Add its avatar to the curi as a mark.  Also, does

                // this curi need to be posted?  Note, we do this test for

                // is it a prereq BEFORE we do the check that curi is of the

                // credential domain because such as yahoo have you go to

                // another domain altogether to login.

                //为CrawlURI curi对象添加当前证书





            //当前Credential c对象的域名与当前CrawlURI curi对象的CrawlServer serv对象的serverName是一致的

            //也就是说 当前Credential c对象的域名与当前CrawlURI curi对象对应的域名是一致的

            if (!c.rootUriMatch(serverCache, curi)) {



            //当前Credential c对象存在先决条件(form验证是登录地址)            

            if (!c.hasPrerequisite(curi)) {




            //预先判断当前CrawlURI curi对象对应的域与当前Credential c认证对象对应的域是一致的


            //获取当前CrawlURI curi对象对应的CrawlServer server对象

            //遍历CrawlServer server对象的Set<Credential> credentials集合,检查是否有与当前Credential c认证对象一致的

            //外层循环是配置文件里面的所有Credential集合     内层循环是服务器已存储的是与当前CrawlURI curi对象对应的Credential集合

            //这里面应该保证同一域名下的CrawlURI curi对象对应的CrawlServer server对象是同一的

            //(不然同一域名下的其他的CrawlURI curi对象同样需要再次验证)

            if (!authenticated(c, curi)) {

                // Han't been authenticated.  Queue it and move on (Assumption

                // is that we can do one authentication at a time -- usually one

                // html form).


                String prereq = c.getPrerequisite(curi);

                if (prereq == null || prereq.length() <= 0) {

                    CrawlServer server = serverCache.getServerFor(curi.getUURI());

                    logger.severe(server.getName() + " has "

                        + " credential(s) of type " + c + " but prereq"

                        + " is null.");

                } else {

                    try {



                    } catch (URIException e) {

                        logger.severe("unable to set credentials prerequisite "+prereq);


                        return false; 


                    result = true;

                    if (logger.isLoggable(Level.FINE)) {

                        logger.fine("Queueing prereq " + prereq + " of type " +

                            c + " for " + curi);







        return result;


boolean authenticated(final Credential credential, final CrawlURI curi)方法判断是否已经身份认证(CrawlServer server对象存在当前认证对象Credential credential)


     * Has passed credential already been authenticated.


     * @param credential Credential to test.

     * @param curi CrawlURI.

     * @return True if already run.


    protected boolean authenticated(final Credential credential, final CrawlURI curi) {

        //获取CrawlURI curi对象对应的CrawlServer server对象

        CrawlServer server = serverCache.getServerFor(curi.getUURI());

        if (!server.hasCredentials()) {

            return false;


        //CrawlServer server对象里面已经持久化的Set<Credential> credentials集合

        Set<Credential> credentials = server.getCredentials();

        for (Credential cred: credentials) {


            if (cred.getKey().equals(credential.getKey()) 

                    && cred.getClass().isInstance(credential)) {

                return true; 



        return false;



本系列Heritrix 3.1.0 源码解析系本人原创

转载请注明出处 博客园 刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/30/3052319.html
