Heritrix 3.1.0 源码解析(三十一)

从BdbFrontier对象的next方法(从某个Classkey标识的BdbWorkQueue工作队列)取出来的CrawlURI uri对象第一步要进入的处理器是Preselector处理器,该处理器主要是对CrawlURI uri对象根据配置文件里面配置的正则表达式进行过滤,通过过滤的CrawlURI uri对象才能进入下一步的处理器,该处理器继承自Scoper类(Scoper类我在前面的文章已经解析过,这不再重复),该类比较简单,我只贴出相关处理的方法

@Override

    protected ProcessResult innerProcessResult(CrawlURI puri) {

        CrawlURI curi = (CrawlURI)puri;

        

        // Check if uris should be blocked

        if (getBlockAll()) {

            curi.setFetchStatus(S_BLOCKED_BY_USER);

            return ProcessResult.FINISH;

        }



        // Check if allowed by regular expression

        String regex = getAllowByRegex();

        if (regex != null && !regex.equals("")) {

            if (!TextUtils.matches(regex, curi.toString())) {

                curi.setFetchStatus(S_BLOCKED_BY_USER);

                return ProcessResult.FINISH;

            }

        }



        // Check if blocked by regular expression

        regex = getBlockByRegex();

        if (regex != null && !regex.equals("")) {

            if (TextUtils.matches(regex, curi.toString())) {

                curi.setFetchStatus(S_BLOCKED_BY_USER);

                return ProcessResult.FINISH;

            }

        }



        // Possibly recheck scope

        if (getRecheckScope()) {

            if (!isInScope(curi)) {

                // Scope rejected

                curi.setFetchStatus(S_OUT_OF_SCOPE);

                return ProcessResult.FINISH;

            }

        }

        

        return ProcessResult.PROCEED;

    }

上面方法里面的最后一步是判断是否还要重新范围筛选(调用父类Scoper类的boolean isInScope(CrawlURI caUri)方法),默认为false

这里需要弄明白的是,该处理器的正则过滤,是在CrawlURI uri对象已经添加到了BdbWorkQueue工作队列而进行正式采集前的处理,不同于CandidatesProcessor处理器是在CrawlURI uri对象进入BdbWorkQueue工作队列之前的筛选,我们配置过滤CrawlURI uri对象的过滤规则,本人推荐在CandidatesProcessor处理器相关模块设置

相应的正则表达式可以在crawler-beans.cxml配置文件中设置Preselector处理器Bean属性

 <!-- first, processors are declared as top-level named beans -->

 <bean id="preselector" class="org.archive.crawler.prefetch.MyPreselector">

      <!-- <property name="recheckScope" value="false" />-->

     <!--  <property name="blockAll" value="false" />-->

     <!--  <property name="blockByRegex" value="" />-->

     <!--  <property name="allowByRegex" value="" />-->

 </bean>

通过FrontierPreparer处理器的CrawlURI uri对象下一步进入PreconditionEnforcer处理器,该处理器可以称为先决条件处理器,主要是验证DNS,验证Robots规则,进行身份认证等,其相关处理方法如下

@Override

    protected ProcessResult innerProcessResult(CrawlURI puri) {

        CrawlURI curi = (CrawlURI)puri;

        //DNS解析验证

        if (considerDnsPreconditions(curi)) {

            return ProcessResult.FINISH;

        }



        // make sure we only process schemes we understand (i.e. not dns) 当前CrawlURI puri对象的scheme不是http并且不是https

        String scheme = curi.getUURI().getScheme().toLowerCase();

        if (! (scheme.equals("http") || scheme.equals("https"))) {

            logger.fine("PolitenessEnforcer doesn't understand uri's of type " +

                scheme + " (ignoring)");

            return ProcessResult.PROCEED;

        }

        //Robots验证

        if (considerRobotsPreconditions(curi)) {

            return ProcessResult.FINISH;

        }

//        System.out.println("!curi.isPrerequisite():"+!curi.isPrerequisite());

        //身份认证

        if (!curi.isPrerequisite() && credentialPrecondition(curi)) {

            return ProcessResult.FINISH;

        }



        // OK, it's allowed



        // For all curis that will in fact be fetched, set appropriate delays.

        // TODO: SOMEDAY: allow per-host, per-protocol, etc. factors

        // curi.setDelayFactor(getDelayFactorFor(curi));

        // curi.setMinimumDelay(getMinimumDelayFor(curi));



        return ProcessResult.PROCEED;

    }

如果存在先决条件,则设置当前CrawlURI puri对象的先决条件并退出当前处理器链(FetchChain处理器链)的流程

我们先来分析第一个先决条件:DNS解析验证,boolean considerDnsPreconditions(CrawlURI curi)方法

/**

     * @param curi CrawlURI whose dns prerequisite we're to check.

     * @return true if no further processing in this module should occur

     */

    protected boolean considerDnsPreconditions(CrawlURI curi) {

        if(curi.getUURI().getScheme().equals("dns")){

            // DNS URIs never have a DNS precondition

            //如果为DNS,本身为先决条件

            curi.setPrerequisite(true);

            return false; 

        } else if (curi.getUURI().getScheme().equals("whois")) {

            return false;

        }



        //serverCache:org.archive.modules.net.BdbServerCache

        CrawlServer cs = serverCache.getServerFor(curi.getUURI());        

        if(cs == null) {

            curi.setFetchStatus(S_UNFETCHABLE_URI);

//            curi.skipToPostProcessing();

            return true;

        }



        // If we've done a dns lookup and it didn't resolve a host

        // cancel further fetch-processing of this URI, because

        // the domain is unresolvable

        CrawlHost ch = serverCache.getHostFor(curi.getUURI());        

        if (ch == null || ch.hasBeenLookedUp() && ch.getIP() == null) {

            if (logger.isLoggable(Level.FINE)) {

                logger.fine( "no dns for " + ch +

                    " cancelling processing for CrawlURI " + curi.toString());

            }

            curi.setFetchStatus(S_DOMAIN_PREREQUISITE_FAILURE);

//            curi.skipToPostProcessing();

            return true;

        }



        // If we haven't done a dns lookup  and this isn't a dns uri

        // shoot that off and defer further processing

        //判断IP是否过期并且当前CrawlURI curi对象的scheme本身不是dns

        if (isIpExpired(curi) && !curi.getUURI().getScheme().equals("dns")) {

            logger.fine("Deferring processing of CrawlURI " + curi.toString()

                + " for dns lookup.");

            String preq = "dns:" + ch.getHostName();

            try {

                // 先决条件 DNS解析

                curi.markPrerequisite(preq);

            } catch (URIException e) {

                throw new RuntimeException(e); // shouldn't ever happen

            }

            return true;

        }

        

        // DNS preconditions OK

        return false;

    }

boolean isIpExpired(CrawlURI curi)方法判断IP是否注册(判断当前CrawlURI curi对象对应的CrawlHost host对象里面IP是否注册)

/** Return true if ip should be looked up.

     *

     * @param curi the URI to check.

     * @return true if ip should be looked up.

     */

    public boolean isIpExpired(CrawlURI curi) {

        CrawlHost host = serverCache.getHostFor(curi.getUURI());

        if (!host.hasBeenLookedUp()) {

            // IP has not been looked up yet.

            return true;

        }



        if (host.getIpTTL() == CrawlHost.IP_NEVER_EXPIRES) {

            // IP never expires (numeric IP)

            return false;

        }



        long duration = getIpValidityDurationSeconds();

        if (duration == 0) {

            // Never expire ip if duration is null (set by user or more likely,

            // set to zero in case where we tried in FetchDNS but failed).

            return false;

        }

        

        long ttl = host.getIpTTL();

        if (ttl > duration) {

            // Use the larger of the operator-set minimum duration 

            // or the DNS record TTL

            duration = ttl;

        }



        // Duration and ttl are in seconds.  Convert to millis.

        if (duration > 0) {

            duration *= 1000;

        }



        return (duration + host.getIpFetched()) < System.currentTimeMillis();

    }

如果IP没有注册,则设置当前CrawlURI curi对象的先决条件为dns:host,CrawlURI curi对象的CrawlURI markPrerequisite(String preq) 方法如下 

/**

     * Do all actions associated with setting a <code>CrawlURI</code> as

     * requiring a prerequisite.

     *

     * @param lastProcessorChain Last processor chain reference.  This chain is

     * where this <code>CrawlURI</code> goes next.

     * @param preq Object to set a prerequisite.

     * @return the newly created prerequisite CrawlURI

     * @throws URIException

     */

    public CrawlURI markPrerequisite(String preq) 

    throws URIException {

        UURI src = getUURI();

        UURI dest = UURIFactory.getInstance(preq);

        LinkContext lc = LinkContext.PREREQ_MISC;

        Hop hop = Hop.PREREQ;

        Link link = new Link(src, dest, lc, hop);

        CrawlURI caUri = createCrawlURI(getBaseURI(), link);

        // TODO: consider moving some of this to candidate-handling

        int prereqPriority = getSchedulingDirective() - 1;

        if (prereqPriority < 0) {

            prereqPriority = 0;

            logger.severe("Unable to promote prerequisite " + caUri + " above " + this);

        }

        caUri.setSchedulingDirective(prereqPriority);

        caUri.setForceFetch(true);

        setPrerequisiteUri(caUri);

        incrementDeferrals();

        setFetchStatus(S_DEFERRED);

        

        return caUri;

    }

在上述方法里面首先生成先决条件CrawlURI caUri对象,设置高一级的调度级别,最后以key值为String A_PREREQUISITE_URI = "prerequisite-uri"添加到当前CrawlURI curi对象的Map<String,Object> data成员里面 

/**

     * Set a prerequisite for this URI.

     * <p>

     * A prerequisite is a URI that must be crawled before this URI can be

     * crawled.

     *

     * @param link Link to set as prereq.

     */

    public void setPrerequisiteUri(CrawlURI pre) {

        getData().put(A_PREREQUISITE_URI, pre);

    }

退出当前处理器链(FetchChain处理器链)后,进入下一轮的处理器链(DispositionChain处理器链)中,在CandidatesProcessor处理器中将先决条件添加到BdbWorkQueue工作队列,相关代码如下:

@Override

    protected void innerProcess(final CrawlURI curi) throws InterruptedException {

        // Handle any prerequisites when S_DEFERRED for prereqs

        if (curi.hasPrerequisiteUri() && curi.getFetchStatus() == S_DEFERRED) {

            CrawlURI prereq = curi.getPrerequisiteUri();

            prereq.setFullVia(curi); 

            sheetOverlaysManager.applyOverlaysTo(prereq);

            try {

                KeyedProperties.clearOverridesFrom(curi); 

                KeyedProperties.loadOverridesFrom(prereq);

                getCandidateChain().process(prereq, null);

                

                if(prereq.getFetchStatus()>=0) {

                    //添加到BdbWorkQueue工作队列

                    frontier.schedule(prereq);

                } else {

                    curi.setFetchStatus(S_PREREQUISITE_UNSCHEDULABLE_FAILURE);

                }

            } finally {

                KeyedProperties.clearOverridesFrom(prereq); 

                KeyedProperties.loadOverridesFrom(curi);

            }

            return;

        }

    //后面部分的代码略

}

回到PreconditionEnforcer处理器,第二个先决条件为验证Robots规则,处理逻辑与验证DNS相似, boolean considerRobotsPreconditions(CrawlURI curi)方法如下

/**

     * Consider the robots precondition.

     *

     * @param curi CrawlURI we're checking for any required preconditions.

     * @return True, if this <code>curi</code> has a precondition or processing

     *         should be terminated for some other reason.  False if

     *         we can proceed to process this url.

     */

    protected boolean considerRobotsPreconditions(CrawlURI curi) {

        // treat /robots.txt fetches specially

        //忽略验证 return false;

        UURI uuri = curi.getUURI();

        try {

            if (uuri != null && uuri.getPath() != null &&

                    curi.getUURI().getPath().equals("/robots.txt")) {

                // allow processing to continue

                //本身为先决条件

                curi.setPrerequisite(true);

                return false;

            }

        } catch (URIException e) {

            logger.severe("Failed get of path for " + curi);

        }

        

        CrawlServer cs = serverCache.getServerFor(curi.getUURI());

        // require /robots.txt if not present

        //验证Robots是否过期

        if (cs.isRobotsExpired(getRobotsValidityDurationSeconds())) {

            // Need to get robots

            if (logger.isLoggable(Level.FINE)) {

                logger.fine( "No valid robots for " + cs  +

                    "; deferring " + curi);

            }



            // Robots expired - should be refetched even though its already

            // crawled.

            try {

                String prereq = curi.getUURI().resolve("/robots.txt").toString();

                //设置先决条件

                curi.markPrerequisite(prereq);

            }

            catch (URIException e1) {

                logger.severe("Failed resolve using " + curi);

                throw new RuntimeException(e1); // shouldn't ever happen

            }

            return true;

        }

        // test against robots.txt if available

        if (cs.isValidRobots()) {

            String ua = metadata.getUserAgent();

            RobotsPolicy robots = metadata.getRobotsPolicy();

            if(!robots.allows(ua, curi, cs.getRobotstxt())) {

                if(getCalculateRobotsOnly()) {

                    // annotate URI as excluded, but continue to process normally

                    curi.getAnnotations().add("robotExcluded");

                    return false; 

                }

                // mark as precluded; in FetchHTTP, this will

                // prevent fetching and cause a skip to the end

                // of processing (unless an intervening processor

                // overrules)

                curi.setFetchStatus(S_ROBOTS_PRECLUDED);

                curi.setError("robots.txt exclusion");

                logger.fine("robots.txt precluded " + curi);

                return true;

            }

            return false;

        }

        // No valid robots found => Attempt to get robots.txt failed

//        curi.skipToPostProcessing();

        curi.setFetchStatus(S_ROBOTS_PREREQUISITE_FAILURE);

        curi.setError("robots.txt prerequisite failed");

        if (logger.isLoggable(Level.FINE)) {

            logger.fine("robots.txt prerequisite failed " + curi);

        }

        return true;

    }

第三个先决条件为身份认证,boolean credentialPrecondition(final CrawlURI curi)方法如下 

/**

    * Consider credential preconditions.

    *

    * Looks to see if any credential preconditions (e.g. html form login

    * credentials) for this <code>CrawlServer</code>. If there are, have they

    * been run already? If not, make the running of these logins a precondition

    * of accessing any other url on this <code>CrawlServer</code>.

    *

    * <p>

    * One day, do optimization and avoid running the bulk of the code below.

    * Argument for running the code everytime is that overrides and refinements

    * may change what comes back from credential store.

    *

    * @param curi CrawlURI we're checking for any required preconditions.

    * @return True, if this <code>curi</code> has a precondition that needs to

    *         be met before we can proceed. False if we can precede to process

    *         this url.

    */

    /**

     * 考虑 不同classkey 而host实际相同的情况   应该以host为依据

     * @param curi

     * @return

     */

    protected boolean credentialPrecondition(final CrawlURI curi) {



        boolean result = false;

        

        CredentialStore cs = getCredentialStore();

        if (cs == null) {

            logger.severe("No credential store for " + curi);

            return result;

        }

        //System.out.println(cs.getAll().size());

        //遍历CredentialStore cs对象存储的Collection<Credential>集合

        for (Credential c: cs.getAll()) {

            //判断当前CrawlURI curi对象是否先决条件

            if (c.isPrerequisite(curi)) {

                // This credential has a prereq. and this curi is it.  Let it

                // through.  Add its avatar to the curi as a mark.  Also, does

                // this curi need to be posted?  Note, we do this test for

                // is it a prereq BEFORE we do the check that curi is of the

                // credential domain because such as yahoo have you go to

                // another domain altogether to login.

                //为CrawlURI curi对象添加当前证书

                c.attach(curi);

                curi.setFetchType(CrawlURI.FetchType.HTTP_POST);

                break;

            }

            //当前Credential c对象的域名与当前CrawlURI curi对象的CrawlServer serv对象的serverName是一致的

            //也就是说 当前Credential c对象的域名与当前CrawlURI curi对象对应的域名是一致的

            if (!c.rootUriMatch(serverCache, curi)) {

                continue;

            }

            //当前Credential c对象存在先决条件(form验证是登录地址)            

            if (!c.hasPrerequisite(curi)) {

                continue;

            }

            //判断是否已经验证            

            //预先判断当前CrawlURI curi对象对应的域与当前Credential c认证对象对应的域是一致的

            

            //获取当前CrawlURI curi对象对应的CrawlServer server对象

            //遍历CrawlServer server对象的Set<Credential> credentials集合,检查是否有与当前Credential c认证对象一致的

            //外层循环是配置文件里面的所有Credential集合     内层循环是服务器已存储的是与当前CrawlURI curi对象对应的Credential集合

            //这里面应该保证同一域名下的CrawlURI curi对象对应的CrawlServer server对象是同一的

            //(不然同一域名下的其他的CrawlURI curi对象同样需要再次验证)

            if (!authenticated(c, curi)) {

                // Han't been authenticated.  Queue it and move on (Assumption

                // is that we can do one authentication at a time -- usually one

                // html form).

                //登录地址

                String prereq = c.getPrerequisite(curi);

                if (prereq == null || prereq.length() <= 0) {

                    CrawlServer server = serverCache.getServerFor(curi.getUURI());

                    logger.severe(server.getName() + " has "

                        + " credential(s) of type " + c + " but prereq"

                        + " is null.");

                } else {

                    try {

                        //添加先决条件

                        curi.markPrerequisite(prereq);

                    } catch (URIException e) {

                        logger.severe("unable to set credentials prerequisite "+prereq);

                        loggerModule.logUriError(e,curi.getUURI(),prereq);

                        return false; 

                    }

                    result = true;

                    if (logger.isLoggable(Level.FINE)) {

                        logger.fine("Queueing prereq " + prereq + " of type " +

                            c + " for " + curi);

                    }

                    //跳出循环

                    break;

                }

            }

        }

        return result;

    }

boolean authenticated(final Credential credential, final CrawlURI curi)方法判断是否已经身份认证(CrawlServer server对象存在当前认证对象Credential credential)

/**

     * Has passed credential already been authenticated.

     *

     * @param credential Credential to test.

     * @param curi CrawlURI.

     * @return True if already run.

     */

    protected boolean authenticated(final Credential credential, final CrawlURI curi) {

        //获取CrawlURI curi对象对应的CrawlServer server对象

        CrawlServer server = serverCache.getServerFor(curi.getUURI());

        if (!server.hasCredentials()) {

            return false;

        }

        //CrawlServer server对象里面已经持久化的Set<Credential> credentials集合

        Set<Credential> credentials = server.getCredentials();

        for (Credential cred: credentials) {

            //两者的key一致并且类型一致

            if (cred.getKey().equals(credential.getKey()) 

                    && cred.getClass().isInstance(credential)) {

                return true; 

            }

        }

        return false;

    }

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源码解析系本人原创

转载请注明出处 博客园 刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/30/3052319.html

你可能感兴趣的:(Heritrix)