Nutch1.7源码再研究之---17 再论Parse

上一节主要讲了html的解析工作,继续跟踪代码。

----------------------------------------------------  

parseResult = runParser(parsers[i], content);

上一节执行到这里,返回parseResult.

然后返回它。

 if (parseResult != null && !parseResult.isEmpty())

        return parseResult; 

这样就返回parseResult对象。

------------------------------------------

这样,就执行完了下面的代码了。

 

ParseResult parseResult = null;

    try {

      parseResult = new ParseUtil(getConf()).parse(content);

    } catch (Exception e) {

      LOG.warn("Error parsing: " + key + ": " + StringUtils.stringifyException(e));

      return;

    }

 ---最后就是输出各种数据了,代码如下:

for (Entry<Text, Parse> entry : parseResult) {

     

      Text url = entry.getKey();

      Parse parse = entry.getValue();

      ParseStatus parseStatus = parse.getData().getStatus();

      long start = System.currentTimeMillis();

      reporter.incrCounter("ParserStatus", ParseStatus.majorCodes[parseStatus.getMajorCode()], 1);

      if (!parseStatus.isSuccess()) {

        LOG.warn("Error parsing: " + key + ": " + parseStatus);

        parse = parseStatus.getEmptyParse(getConf());

      }

      // pass segment name to parse data

      parse.getData().getContentMeta().set(Nutch.SEGMENT_NAME_KEY

                                           getConf().get(Nutch.SEGMENT_NAME_KEY));

      // compute the new signature

      byte[] signature = 

        SignatureFactory.getSignature(getConf()).calculate(content, parse); 

      parse.getData().getContentMeta().set(Nutch.SIGNATURE_KEY

          StringUtil.toHexString(signature));

     try {

        scfilters.passScoreAfterParsing(url, content, parse);

      } catch (ScoringFilterException e) {

        if (LOG.isWarnEnabled()) {

          LOG.warn("Error passing score: "+ url +": "+e.getMessage());

        }

      }

           long end = System.currentTimeMillis();

      LOG.info("Parsed (" + Long.toString(end - start) + "ms):" + url);

      output.collect(url, new ParseImpl(new ParseText(parse.getText()), 

                                        parse.getData(), parse.isCanonical()));

细节不去管了,直接看到底输出了哪些信息吧。

--------------------
url---http://www.redis.io/
parse.getText()---Redis Commands Clients Documentation Community Download Issues Support License Redis is an open source, BSD licensed, advanced key-value cache and store . It is often referred to as a data structure server since keys can contain strings , hashes , lists , sets , sorted sets , bitmaps and hyperloglogs . Learn more → Try it Ready for a test drive? Check this interactive tutorial that will walk you through the most important features of Redis. Download it Redis 2.8.17 is the latest stable version. Interested in release candidates or unstable versions? Check the downloads page. Quick links The Redis Twitter account is a good source of fresh info. Our code is at Github for you to follow the development daily. Get help or help other users subscribing to the Redis Google Group , we are 5000 and counting! What people are saying More... This website is open source software developed by Citrusbyte . The Redis logo was designed by Carlos Prioglio . See more credits . Sponsored by
parse.getData()---Version: 5
Status: success(1,0)
Title: Redis
Outlinks: 37
  outlink: toUrl: http://www.redis.io/styles.css?1333128600 anchor:
  outlink: toUrl: http://www.redis.io/images/favicon.png anchor:
  outlink: toUrl: http://www.redis.io/opensearch.xml anchor:
  outlink: toUrl: http://www.redis.io/ anchor: Redis
  outlink: toUrl: http://www.redis.io/images/redis.png anchor: Redis
  outlink: toUrl: http://www.redis.io/commands anchor: Commands
  outlink: toUrl: http://www.redis.io/clients anchor: Clients
  outlink: toUrl: http://www.redis.io/documentation anchor: Documentation
  outlink: toUrl: http://www.redis.io/community anchor: Community
  outlink: toUrl: http://www.redis.io/download anchor: Download
  outlink: toUrl: https://github.com/antirez/redis/issues anchor: Issues
  outlink: toUrl: http://www.redis.io/support anchor: Support
  outlink: toUrl: http://www.redis.io/topics/license anchor: License
  outlink: toUrl: http://www.redis.io/topics/data-types-intro#strings anchor: strings
  outlink: toUrl: http://www.redis.io/topics/data-types-intro#hashes anchor: hashes
  outlink: toUrl: http://www.redis.io/topics/data-types-intro#lists anchor: lists
  outlink: toUrl: http://www.redis.io/topics/data-types-intro#sets anchor: sets
  outlink: toUrl: http://www.redis.io/topics/data-types-intro#sorted-sets anchor: sorted sets
  outlink: toUrl: http://www.redis.io/topics/data-types-intro#bitmaps anchor: bitmaps
  outlink: toUrl: http://www.redis.io/topics/data-types-intro#hyperloglogs anchor: hyperloglogs
  outlink: toUrl: http://www.redis.io/topics/introduction anchor: Learn more →
  outlink: toUrl: http://try.redis.io anchor: interactive tutorial
  outlink: toUrl: http://download.redis.io/releases/redis-2.8.17.tar.gz anchor: Redis 2.8.17 is the latest stable version.
  outlink: toUrl: http://www.redis.io/download anchor: Check the downloads page.
  outlink: toUrl: http://twitter.com/redisfeed anchor: Redis Twitter account
  outlink: toUrl: http://github.com/antirez/redis anchor: code is at Github
  outlink: toUrl: https://groups.google.com/forum/?fromgroups#!forum/redis-db anchor: the Redis Google Group
  outlink: toUrl: http://www.redis.io/buzz anchor: More...
  outlink: toUrl: https://github.com/antirez/redis-io anchor: open source software
  outlink: toUrl: http://citrusbyte.com anchor: Citrusbyte
  outlink: toUrl: http://www.carlosprioglio.com/ anchor: Carlos Prioglio
  outlink: toUrl: http://redis.io/topics/sponsors anchor: credits
  outlink: toUrl: http://www.pivotal.io/big-data/redis anchor: Redis Support
  outlink: toUrl: http://www.redis.io/images/pivotal.png anchor: Redis Support
  outlink: toUrl: http://ajax.googleapis.com/ajax/libs/jquery/1.4/jquery.min.js anchor:
  outlink: toUrl: http://www.redis.io/app.js?1375789679 anchor:
  outlink: toUrl: http://demo.lloogg.com/l.js?c=20bb9c026e anchor:
Content Metadata: Status=200 OK nutch.content.digest=7d9b7315cecba5db6b579f64967988c6 Vary=Accept-Encoding Date=Fri, 17 Oct 2014 14:08:39 GMT Content-Length=1872 nutch.crawl.score=100.0 Content-Encoding=gzip _fst_=33 Via=1.0 redis.io nutch.segment.name=20141017221045 Connection=close Content-Type=text/html
Parse Metadata: CharEncodingForConversion=windows-1252 OriginalCharEncoding=windows-1252

--------------------以上内容就是解析的所有结果。通过output.collect会存在于文件当中。

 

Map过程结束后,就是reduce阶段了

代码也是送福利,很简单!

 public void reduce(Text key, Iterator<Writable> values,

                     OutputCollector<Text, Writable> output, Reporter reporter)

    throws IOException {

    output.collect(key, values.next()); // collect first value

  }

----------------------

至于输出部分,请参考网上一篇不错的文章。

http://www.cnblogs.com/ibook360/archive/2011/10/24/2222171.html

 

 

你可能感兴趣的:(Nutch,parse)