Heritrix 3.1.0 源码解析(四)

如果孤立的发现某类的方法,不免使我们难以理解它的含义;当我们将对象的相互作用串起来的时候,更容易理解方法的意图

在对象之间相互通信时,首先应该了解对象的状态;最基本的入手方式就是 了解它的构造函数或者初始化方法以及执行相关方法后状态的变化,其次是相应方法的输入参数(发送消息)

当我们在后台建立一个爬行任务时,在Heritrix3.1.0系统里面对应一个爬行任务类,当前爬行任务的所有属性和行为都封装在这个爬行任务类里面

这个类为CrawlJob(org.archive.crawler.framework),我们先来熟悉一下该类的相关成员和方法

爬行任务CrawlJob类实现了两接口Comparable<CrawlJob>, ApplicationListener<ApplicationEvent>,前者显然是用于排序,后者是spring的事件监听器接口(事件监听模式)

CrawlJob类具有如下属性:

File primaryConfig; 

PathSharingContext ac; 

int launchCount; 

boolean isLaunchInfoPartial;

DateTime lastLaunch;

AlertThreadGroup alertThreadGroup;

    

DateTime xmlOkAt = new DateTime(0L);

Logger jobLogger;

这些属性我们暂时无从知道它们的具体作用,继续查看它的构造函数

public CrawlJob(File cxml) {

    primaryConfig = cxml; 

    isLaunchInfoPartial = false;

    scanJobLog(); // XXX look at launch directories instead/first? 

    alertThreadGroup = new AlertThreadGroup(getShortName());

  }

在它的构造函数里面,初始化成员变量File primaryConfig 是任务配置文件crawler-beans.cxml,boolean isLaunchInfoPartial是否加载部分,scanJobLog()扫描日志,AlertThreadGroup alertThreadGroup线程组(本身用于发布日志记录)

当我们执行任务的build操作时,实际执行的是CrawlJob对象的void validateConfiguration()方法

/**

     * Does the assembled ApplicationContext self-validate? Any failures

     * are reported as WARNING log events in the job log. 

     * 

     * TODO: make these severe? 

     */

    public synchronized void validateConfiguration() {

        instantiateContainer();

        if(ac==null) {

            // fatal errors already encountered and reported

            return; 

        }

        ac.validate();

        HashMap<String,Errors> allErrors = ac.getAllErrors();

        for(String name : allErrors.keySet()) {

            for(Object err : allErrors.get(name).getAllErrors()) {

               LOGGER.log(Level.WARNING,err.toString());

            }

        }

    }

继续调用void instantiateContainer()方法,这里是是实例化PathSharingContext ac(封装后的spring容器),并且设置当前CrawlJob对象为spring的监听器

 /**

     * Can the configuration yield an assembled ApplicationContext? 

     */

    public synchronized void instantiateContainer() {

        checkXML(); 

        if(ac==null) {

            try {

                ac = new PathSharingContext(new String[] {"file:"+primaryConfig.getAbsolutePath()},false,null);

                ac.addApplicationListener(this);

                ac.refresh();

                getCrawlController(); // trigger NoSuchBeanDefinitionException if no CC

                getJobLogger().log(Level.INFO,"Job instantiated");

            } catch (BeansException be) {

                // Calling doTeardown() and therefore ac.close() here sometimes

                // triggers an IllegalStateException and logs stack trace from

                // within spring, even if ac.isActive(). So, just null it.

                ac = null;

                beansException(be);

            }

        }

    }

后面是验证PathSharingContext ac的有效性(PathSharingContext类的方法

//

    // Cascading self-validation

    //

    HashMap<String,Errors> allErrors; // bean name -> Errors

    public void validate() {

        allErrors = new HashMap<String,Errors>();

            

        for(Entry<String, HasValidator> entry : getBeansOfType(HasValidator.class).entrySet()) {

            String name = entry.getKey();

            HasValidator hv = entry.getValue();

            Validator v = hv.getValidator();

            Errors errors = new BeanPropertyBindingResult(hv,name);

            v.validate(hv, errors);

            if(errors.hasErrors()) {

                allErrors.put(name,errors);

            }

        }

        for(String name : allErrors.keySet()) {

            for(Object obj : allErrors.get(name).getAllErrors()) {

                LOGGER.fine("validation error for '"+name+"': "+obj);

            }

        }

    }

如果没有异常,此时CrawlJob对象的getJobStatusDescription为Ready

下一步我们执行任务的launch操作了,实际执行的是CrawlJob对象的void launch()方法

/**

     * Launch a crawl into 'running' status, assembling if necessary. 

     * 

     * (Note the crawl may have been configured to start in a 'paused'

     * state.) 

     */

    public synchronized void launch() {

        if (isProfile()) {

            throw new IllegalArgumentException("Can't launch profile" + this);

        }

        

        if(isRunning()) {

            getJobLogger().log(Level.SEVERE,"Can't relaunch running job");

            return;

        } else {

            CrawlController cc = getCrawlController();

            if(cc!=null && cc.hasStarted()) {

                getJobLogger().log(Level.SEVERE,"Can't relaunch previously-launched assembled job");

                return;

            }

        }

        

        validateConfiguration();

        if(!hasValidApplicationContext()) {

            getJobLogger().log(Level.SEVERE,"Can't launch problem configuration");

            return;

        }



        //final String job = changeState(j, ACTIVE);

        

        // this temporary thread ensures all crawl-created threads

        // land in the AlertThreadGroup, to assist crawl-wide 

        // logging/alerting

        alertThreadGroup = new AlertThreadGroup(getShortName());

        alertThreadGroup.addLogger(getJobLogger());

        Thread launcher = new Thread(alertThreadGroup, getShortName()+" launchthread") {

            public void run() {

                CrawlController cc = getCrawlController();

                startContext();

                if(cc!=null) {

                    cc.requestCrawlStart();

                }

            }

        };

        getJobLogger().log(Level.INFO,"Job launched");

        scanJobLog();

        launcher.start();

        // look busy (and give startContext/crawlStart a chance)

        try {

            Thread.sleep(1500);

        } catch (InterruptedException e) {

            // do nothing

        }

    }

关键方法是线程类Thread launcher里面的void startContext()和CrawlController对象的void requestCrawlStart()方法 

void startContext()方法是启动spring容器里面的bean(实现了Lifecycle接口),调用bean的start方法)

/**

     * Start the context, catching and reporting any BeansExceptions.

     */

    protected synchronized void startContext() {

        try {

            ac.start(); 

            

            // job log file covering just this launch

            getJobLogger().removeHandler(currentLaunchJobLogHandler);

            File f = new File(ac.getCurrentLaunchDir(), "job.log");

            currentLaunchJobLogHandler = new FileHandler(f.getAbsolutePath(), true);

            currentLaunchJobLogHandler.setFormatter(new JobLogFormatter());

            getJobLogger().addHandler(currentLaunchJobLogHandler);

            

        } catch (BeansException be) {

            doTeardown();

            beansException(be);

        } catch (Exception e) {

            LOGGER.log(Level.SEVERE,e.getClass().getSimpleName()+": "+e.getMessage(),e);

            try {

                doTeardown();

            } catch (Exception e2) {

                e2.printStackTrace(System.err);

            }        

        }

    }

CrawlController对象的void requestCrawlStart()方法 

/** 

     * Operator requested crawl begin

     */

    public void requestCrawlStart() {

        hasStarted = true; 

        sendCrawlStateChangeEvent(State.PREPARING, CrawlStatus.PREPARING);

        

        if(recoveryCheckpoint==null) {

            // only announce (trigger scheduling of) seeds

            // when doing a cold (non-recovery) start

            getSeeds().announceSeeds();

        }

        

        setupToePool();



        // A proper exit will change this value.

        this.sExit = CrawlStatus.FINISHED_ABNORMAL;

        

        if (getPauseAtStart()) {

            // frontier is already paused unless started, so just 

            // 'complete'/ack pause

            completePause();

        } else {

            getFrontier().run();

        }

    }

该方法里面是导入seed种子文件,然后启动线程

protected void setupToePool() {

        toePool = new ToePool(alertThreadGroup,this);

        // TODO: make # of toes self-optimizing

        toePool.setSize(getMaxToeThreads());

        toePool.waitForAll();

    }

当我们执行任务的unpause操作时,实际执行的是CrawlController对象的void requestCrawlResume()方法

/**

     * Resume crawl from paused state

     */

    public void requestCrawlResume() {

        if (state != State.PAUSING && state != State.PAUSED) {

            // Can't resume if not been told to pause

            return;

        }

        

        assert toePool != null;

        

        Frontier f = getFrontier();

        f.unpause();

        sendCrawlStateChangeEvent(State.RUNNING, CrawlStatus.RUNNING);

    }

pause指令 CrawlController void requestCrawlPause()

/**

     * Stop the crawl temporarly.

     */

    public synchronized void requestCrawlPause() {

        if (state == State.PAUSING || state == State.PAUSED) {

            // Already about to pause

            return;

        }

        sExit = CrawlStatus.WAITING_FOR_PAUSE;

        getFrontier().pause();

        sendCrawlStateChangeEvent(State.PAUSING, this.sExit);

        // wait for pause to come via frontier changes

    }

terminate指令 CrawlJob对象 void terminate() 

public void terminate() {

        getCrawlController().requestCrawlStop();

    }

继续调用CrawlController对象的void requestCrawlStop()方法

/**

     * Operator requested for crawl to stop.

     */

    public synchronized void requestCrawlStop() {

        if(state == State.STOPPING) {

            // second stop request; nudge the threads with interrupts

            getToePool().cleanup();

        }

        requestCrawlStop(CrawlStatus.ABORTED);

    }

teardown指令 CrawlJob对象 boolean teardown()

/**

     * Ensure a fresh start for any configuration changes or relaunches,

     * by stopping and discarding an existing ApplicationContext.

     * 

     * @return true if teardown is complete when method returns, false if still in progress

     */

    public synchronized boolean teardown() {

        CrawlController cc = getCrawlController();

        if (cc != null) {

            cc.requestCrawlStop();

            needTeardown = true;

            

            // wait up to 3 seconds for stop

            for(int i = 0; i < 11; i++) {

                if(cc.isStopComplete()) {

                    break;

                }

                try {

                    Thread.sleep(300);

                } catch (InterruptedException e) {

                    // do nothing

                }

            }

            

            if (cc.isStopComplete()) {

                doTeardown();

            }

        }

        

        assert needTeardown == (ac != null);

        return !needTeardown; 

    }

---------------------------------------------------------------------------

本系列Heritrix 3.1.0 源码解析系本人原创

转载请注明出处 博客园 刺猬的温驯

本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/17/3025413.html

你可能感兴趣的:(Heritrix)