Nutch1.7源码再研究之---16 HtmlParser.getParse()源码分析

现在开始讲解具体的解析器解析网页内容 content的原理。

----------------------先构造若干变量

 HTMLMetaTags metaTags = new HTMLMetaTags();

    URL base;

    try {

      base = new URL(content.getBaseUrl());

    } catch (MalformedURLException e) {

      return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());

    }

    String text = "";

    String title = "";

    Outlink[] outlinks = new Outlink[0];

    Metadata metadata = new Metadata();

这些都没啥好说的!

-----------------

 

 // parse the content

    DocumentFragment root;

    try {

      byte[] contentInOctets = content.getContent();

      InputSource input = new InputSource(new ByteArrayInputStream(contentInOctets));

      EncodingDetector detector = new EncodingDetector(conf);

      detector.autoDetectClues(content, true);

      detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");

      String encoding = detector.guessEncoding(content, defaultCharEncoding);

      metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);

      metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);

      input.setEncoding(encoding);

      if (LOG.isTraceEnabled()) { LOG.trace("Parsing..."); }

      root = parse(input);

    } catch (IOException e) {

      return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());

    } catch (DOMException e) {

      return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());

    } catch (SAXException e) {

      return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());

    } catch (Exception e) {

      LOG.error("Error: ", e);

      return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());

    }

这个就是通过别的类解析成DOM类对象,这个也不解释,不难理解!

----------------------------------------------------------------------

接下来是获取meta

 

// get meta directives

    HTMLMetaProcessor.getMetaTags(metaTags, root, base);

    if (LOG.isTraceEnabled()) {

      LOG.trace("Meta tags for " + base + ": " + metaTags.toString());

    } 

这个就是根据root这个对象提取出若干标签。

小知识点:

 

  <meta name="Robots" content="All|None|Index|Noindex|Follow|Nofollow">
     all:文件将被检索,且页面上的链接可以被查询;
     none:文件将不被检索,且页面上的链接不可以被查询,它和 "noindex, no follow" 起相同作用
  index:文件将被检索;(让robot/spider登录)  follow:页面上的链接可以被查询;  noindex:文件将不被检索,但页面上的链接可以被查询;(不让robot/spider登录)  nofollow:文件将不被检索,页面上的链接可以被查询。(不让robot/spider顺着此页的连接往下探找)

--------------------------------------- 

  然后需要的话提取出title/text

 // check meta directives

    if (!metaTags.getNoIndex()) {               // okay to index

      StringBuffer sb = new StringBuffer();

      if (LOG.isTraceEnabled()) { LOG.trace("Getting text..."); }

      utils.getText(sb, root);          // extract text

      text = sb.toString();

      sb.setLength(0);

      if (LOG.isTraceEnabled()) { LOG.trace("Getting title..."); }

      utils.getTitle(sb, root);         // extract title

      title = sb.toString().trim();

    }

这里都是分别遍历text和title.

------------------------------------

接下来是获取外链outlink,代码如下:

 

if (!metaTags.getNoFollow()) { // okay to follow links

ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract outlinks

URL baseTag = utils.getBase(root);

if (LOG.isTraceEnabled()) {

LOG.trace("Getting links...");

}

utils.getOutlinks(baseTag != null ? baseTag : base, l, root);

outlinks = l.toArray(new Outlink[l.size()]);

if (LOG.isTraceEnabled()) {

LOG.trace("found " + outlinks.length + " outlinks in "

+ content.getUrl());

}

}

 www.redis.io/为例,outlinks内容为:

toUrl: http://www.redis.io/styles.css?1333128600 anchor:
toUrl: http://www.redis.io/images/favicon.png anchor:
toUrl: http://www.redis.io/opensearch.xml anchor:
toUrl: http://www.redis.io/ anchor: Redis
toUrl: http://www.redis.io/images/redis.png anchor: Redis
toUrl: http://www.redis.io/commands anchor: Commands
toUrl: http://www.redis.io/clients anchor: Clients
toUrl: http://www.redis.io/documentation anchor: Documentation
toUrl: http://www.redis.io/community anchor: Community
toUrl: http://www.redis.io/download anchor: Download
toUrl: https://github.com/antirez/redis/issues anchor: Issues
toUrl: http://www.redis.io/support anchor: Support
toUrl: http://www.redis.io/topics/license anchor: License
toUrl: http://www.redis.io/topics/data-types-intro#strings anchor: strings
toUrl: http://www.redis.io/topics/data-types-intro#hashes anchor: hashes
toUrl: http://www.redis.io/topics/data-types-intro#lists anchor: lists
toUrl: http://www.redis.io/topics/data-types-intro#sets anchor: sets
toUrl: http://www.redis.io/topics/data-types-intro#sorted-sets anchor: sorted sets
toUrl: http://www.redis.io/topics/data-types-intro#bitmaps anchor: bitmaps
toUrl: http://www.redis.io/topics/data-types-intro#hyperloglogs anchor: hyperloglogs
toUrl: http://www.redis.io/topics/introduction anchor: Learn more →
toUrl: http://try.redis.io anchor: interactive tutorial
toUrl: http://download.redis.io/releases/redis-2.8.17.tar.gz anchor: Redis 2.8.17 is the latest stable version.
toUrl: http://www.redis.io/download anchor: Check the downloads page.
toUrl: http://twitter.com/redisfeed anchor: Redis Twitter account
toUrl: http://github.com/antirez/redis anchor: code is at Github
toUrl: https://groups.google.com/forum/?fromgroups#!forum/redis-db anchor: the Redis Google Group
toUrl: http://www.redis.io/buzz anchor: More...
toUrl: https://github.com/antirez/redis-io anchor: open source software
toUrl: http://citrusbyte.com anchor: Citrusbyte
toUrl: http://www.carlosprioglio.com/ anchor: Carlos Prioglio
toUrl: http://redis.io/topics/sponsors anchor: credits
toUrl: http://www.pivotal.io/big-data/redis anchor: Redis Support
toUrl: http://www.redis.io/images/pivotal.png anchor: Redis Support
toUrl: http://ajax.googleapis.com/ajax/libs/jquery/1.4/jquery.min.js anchor:
toUrl: http://www.redis.io/app.js?1375789679 anchor:
toUrl: http://demo.lloogg.com/l.js?c=20bb9c026e anchor:

-----------------------------------------------------------------------------------------------------

 然后是构造初始parseResult...

 

ParseStatus status = new ParseStatus(ParseStatus.SUCCESS);

if (metaTags.getRefresh()) {

status.setMinorCode(ParseStatus.SUCCESS_REDIRECT);

status.setArgs(new String[] { metaTags.getRefreshHref().toString(),

Integer.toString(metaTags.getRefreshTime()) });

}

ParseData parseData = new ParseData(status, title, outlinks,

content.getMetadata(), metadata);

ParseResult parseResult = ParseResult.createParseResult(

content.getUrl(), new ParseImpl(text, parseData));

-------------其它的就是自己写解析插件的地方了,怎么写插件请参考我之前的文章。

代码如下:

 

// run filters on parse

ParseResult filteredParse = this.htmlParseFilters.filter(content,

parseResult, metaTags, root);

if (metaTags.getNoCache()) { // not okay to cache

for (Map.Entry<org.apache.hadoop.io.Text, Parse> entry : filteredParse)

entry.getValue().getData().getParseMeta()

.set(Nutch.CACHING_FORBIDDEN_KEYcachingPolicy);

}

return filteredParse;

你可能感兴趣的:(Nutch,parse)