Heritrix 3.1.0 源码解析(三)

如果从heritrix3.1.0系统的静态逻辑结构入手,往往看不到系统相关对象的交互作用;如果只从系统的对象动态结构 分析,则又看不到系统的逻辑轮廓

所以源码分析需要动静兼顾,使我们更容易理解它的逻辑与交互,本文采用这个分析方法入手

本文要分析的是spring给Heritrix3.1.0系统bean带来了什么样的管理方式,spring容器的配置文件我们已从上文有了初步的了解

先了解spring容器在系统中是怎样加载配置文件以及怎么初始化的,当我们执行采集任务的build操作时

调用CrawlJob对象的void validateConfiguration()

/**

     * Does the assembled ApplicationContext self-validate? Any failures

     * are reported as WARNING log events in the job log. 

     * 

     * TODO: make these severe? 

     */

    public synchronized void validateConfiguration() {

        instantiateContainer();

        if(ac==null) {

            // fatal errors already encountered and reported

            return; 

        }

        ac.validate();

        HashMap<String,Errors> allErrors = ac.getAllErrors();

        for(String name : allErrors.keySet()) {

            for(Object err : allErrors.get(name).getAllErrors()) {

               LOGGER.log(Level.WARNING,err.toString());

            }

        }

    }

首先加载spring配置文件,初始化spring容器;然后是验证容器

/**

     * Can the configuration yield an assembled ApplicationContext? 

     */

    public synchronized void instantiateContainer() {

        checkXML(); 

        if(ac==null) {

            try {

                ac = new PathSharingContext(new String[] {"file:"+primaryConfig.getAbsolutePath()},false,null);

                ac.addApplicationListener(this);

                ac.refresh();

                getCrawlController(); // trigger NoSuchBeanDefinitionException if no CC

                getJobLogger().log(Level.INFO,"Job instantiated");

            } catch (BeansException be) {

                // Calling doTeardown() and therefore ac.close() here sometimes

                // triggers an IllegalStateException and logs stack trace from

                // within spring, even if ac.isActive(). So, just null it.

                ac = null;

                beansException(be);

            }

        }

    }

上面方法是装载配置文件,添加CrawlJob对象监听器

Heritrix3.1.0的spring容器是经过系统封装的PathSharingContext对象,PathSharingContext类继承自spring的FileSystemXmlApplicationContext类,在它的构造函数里面传入配置文件

public PathSharingContext(String[] configLocations, boolean refresh, ApplicationContext parent) throws BeansException {

        super(configLocations, refresh, parent);

    }

当我们执行采集任务的launch操作时,调用CrawlJob对象的void launch()方法

/**

     * Launch a crawl into 'running' status, assembling if necessary. 

     * 

     * (Note the crawl may have been configured to start in a 'paused'

     * state.) 

     */

    public synchronized void launch() {

        if (isProfile()) {

            throw new IllegalArgumentException("Can't launch profile" + this);

        }

        

        if(isRunning()) {

            getJobLogger().log(Level.SEVERE,"Can't relaunch running job");

            return;

        } else {

            CrawlController cc = getCrawlController();

            if(cc!=null && cc.hasStarted()) {

                getJobLogger().log(Level.SEVERE,"Can't relaunch previously-launched assembled job");

                return;

            }

        }

        

        validateConfiguration();

        if(!hasValidApplicationContext()) {

            getJobLogger().log(Level.SEVERE,"Can't launch problem configuration");

            return;

        }



        //final String job = changeState(j, ACTIVE);

        

        // this temporary thread ensures all crawl-created threads

        // land in the AlertThreadGroup, to assist crawl-wide 

        // logging/alerting

        alertThreadGroup = new AlertThreadGroup(getShortName());

        alertThreadGroup.addLogger(getJobLogger());

        Thread launcher = new Thread(alertThreadGroup, getShortName()+" launchthread") {

            public void run() {

                CrawlController cc = getCrawlController();

                startContext();

                if(cc!=null) {

                    cc.requestCrawlStart();

                }

            }

        };

        getJobLogger().log(Level.INFO,"Job launched");

        scanJobLog();

        launcher.start();

        // look busy (and give startContext/crawlStart a chance)

        try {

            Thread.sleep(1500);

        } catch (InterruptedException e) {

            // do nothing

        }

    }

这里的重要方法是线程对象里面的void startContext()

/**

     * Start the context, catching and reporting any BeansExceptions.

     */

    protected synchronized void startContext() {

        try {

            ac.start(); 

            

            // job log file covering just this launch

            getJobLogger().removeHandler(currentLaunchJobLogHandler);

            File f = new File(ac.getCurrentLaunchDir(), "job.log");

            currentLaunchJobLogHandler = new FileHandler(f.getAbsolutePath(), true);

            currentLaunchJobLogHandler.setFormatter(new JobLogFormatter());

            getJobLogger().addHandler(currentLaunchJobLogHandler);

            

        } catch (BeansException be) {

            doTeardown();

            beansException(be);

        } catch (Exception e) {

            LOGGER.log(Level.SEVERE,e.getClass().getSimpleName()+": "+e.getMessage(),e);

            try {

                doTeardown();

            } catch (Exception e2) {

                e2.printStackTrace(System.err);

            }        

        }

    }

该方法调用PathSharingContext对象的start方法

 @Override

    public void start() {

        initLaunchDir();

        super.start();

    }

在上面方法里面,会执行spring容器里面所有bean(实现Lifecycle接口)的start方法

Lifecycle接口声明的方法如下,定义了bean组件的生命周期

public interface Lifecycle {



    /**

     * Start this component.

     * Should not throw an exception if the component is already running.

     * <p>In the case of a container, this will propagate the start signal

     * to all components that apply.

     */

    void start();



    /**

     * Stop this component.

     * Should not throw an exception if the component isn't started yet.

     * <p>In the case of a container, this will propagate the stop signal

     * to all components that apply.

     */

    void stop();



    /**

     * Check whether this component is currently running.

     * <p>In the case of a container, this will return <code>true</code>

     * only if <i>all</i> components that apply are currently running.

     * @return whether the component is currently running

     */

    boolean isRunning();



}

从这里我们可以知道,Heritrix3.1.0系统是通过spring容器统一管理bean的生命周期(主要是初始化状态)的 

本文通过打印输出了调用了系统哪些bean的start方法

name:scope

name:loggerModule||org.archive.crawler.reporting.CrawlerLoggerModule

name:scope||org.archive.modules.deciderules.DecideRuleSequence



name:candidateScoper

name:candidateScoper||org.archive.crawler.prefetch.CandidateScoper



name:preparer

name:preparer||org.archive.crawler.prefetch.FrontierPreparer



name:candidateProcessors

name:candidateProcessors||org.archive.modules.CandidateChain



name:preselector

name:preselector||org.archive.crawler.prefetch.MyPreselector



name:preconditions

name:bdb||org.archive.bdb.BdbModule

name:serverCache||org.archive.modules.net.BdbServerCache

name:preconditions||org.archive.crawler.prefetch.PreconditionEnforcer



name:fetchDns

name:fetchDns||org.archive.modules.fetcher.FetchDNS



name:fetchHttp

name:cookieStorage||org.archive.modules.fetcher.BdbCookieStorage

name:fetchHttp||org.archive.modules.fetcher.FetchHTTP



name:extractorHttp

name:statisticsTracker||org.archive.crawler.reporting.StatisticsTracker

name:extractorHtml||org.archive.modules.extractor.ExtractorHTML

name:extractorCss||org.archive.modules.extractor.ExtractorCSS

name:extractorJs||org.archive.modules.extractor.ExtractorJS

name:extractorSwf||org.archive.modules.extractor.ExtractorSWF

name:fetchProcessors||org.archive.modules.FetchChain

name:warcWriter||org.archive.modules.writer.MyWriterProcessor

name:candidates||org.archive.crawler.postprocessor.CandidatesProcessor

name:disposition||org.archive.crawler.postprocessor.DispositionProcessor

name:dispositionProcessors||org.archive.modules.DispositionChain

name:crawlController||org.archive.crawler.framework.CrawlController

name:uriUniqFilter||org.archive.crawler.util.BdbUriUniqFilter

name:frontier||org.archive.crawler.frontier.BdbFrontier



name:actionDirectory

name:actionDirectory||org.archive.crawler.framework.ActionDirectory



name:checkpointService

name:checkpointService||org.archive.crawler.framework.CheckpointService

--------------------------------------------------------------------------- 

本系列Heritrix 3.1.0 源码解析系本人原创 

转载请注明出处 博客园 刺猬的温驯 

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/17/3025410.html 

你可能感兴趣的:(Heritrix)