现在开始讲解具体的解析器解析网页内容 content的原理。
----------------------先构造若干变量
HTMLMetaTags metaTags = new HTMLMetaTags();
URL base;
try {
base = new URL(content.getBaseUrl());
} catch (MalformedURLException e) {
return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
}
String text = "";
String title = "";
Outlink[] outlinks = new Outlink[0];
Metadata metadata = new Metadata();
这些都没啥好说的!
-----------------
// parse the content
DocumentFragment root;
try {
byte[] contentInOctets = content.getContent();
InputSource input = new InputSource(new ByteArrayInputStream(contentInOctets));
EncodingDetector detector = new EncodingDetector(conf);
detector.autoDetectClues(content, true);
detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
String encoding = detector.guessEncoding(content, defaultCharEncoding);
metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);
metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);
input.setEncoding(encoding);
if (LOG.isTraceEnabled()) { LOG.trace("Parsing..."); }
root = parse(input);
} catch (IOException e) {
return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
} catch (DOMException e) {
return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
} catch (SAXException e) {
return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
} catch (Exception e) {
LOG.error("Error: ", e);
return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());
}
这个就是通过别的类解析成DOM类对象,这个也不解释,不难理解!
----------------------------------------------------------------------
接下来是获取meta
// get meta directives
HTMLMetaProcessor.getMetaTags(metaTags, root, base);
if (LOG.isTraceEnabled()) {
LOG.trace("Meta tags for " + base + ": " + metaTags.toString());
}
这个就是根据root这个对象提取出若干标签。
小知识点:
<meta name="Robots" content="All|None|Index|Noindex|Follow|Nofollow"> all:文件将被检索,且页面上的链接可以被查询; none:文件将不被检索,且页面上的链接不可以被查询,它和 "noindex, no follow" 起相同作用 index:文件将被检索;(让robot/spider登录) follow:页面上的链接可以被查询; noindex:文件将不被检索,但页面上的链接可以被查询;(不让robot/spider登录) nofollow:文件将不被检索,页面上的链接可以被查询。(不让robot/spider顺着此页的连接往下探找)
---------------------------------------
然后需要的话提取出title/text
// check meta directives
if (!metaTags.getNoIndex()) { // okay to index
StringBuffer sb = new StringBuffer();
if (LOG.isTraceEnabled()) { LOG.trace("Getting text..."); }
utils.getText(sb, root); // extract text
text = sb.toString();
sb.setLength(0);
if (LOG.isTraceEnabled()) { LOG.trace("Getting title..."); }
utils.getTitle(sb, root); // extract title
title = sb.toString().trim();
}
这里都是分别遍历text和title.
------------------------------------
接下来是获取外链outlink,代码如下:
if (!metaTags.getNoFollow()) { // okay to follow links
ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract outlinks
URL baseTag = utils.getBase(root);
if (LOG.isTraceEnabled()) {
LOG.trace("Getting links...");
}
utils.getOutlinks(baseTag != null ? baseTag : base, l, root);
outlinks = l.toArray(new Outlink[l.size()]);
if (LOG.isTraceEnabled()) {
LOG.trace("found " + outlinks.length + " outlinks in "
+ content.getUrl());
}
}
以www.redis.io/为例,outlinks内容为:
toUrl: http://www.redis.io/styles.css?1333128600 anchor:
toUrl: http://www.redis.io/images/favicon.png anchor:
toUrl: http://www.redis.io/opensearch.xml anchor:
toUrl: http://www.redis.io/ anchor: Redis
toUrl: http://www.redis.io/images/redis.png anchor: Redis
toUrl: http://www.redis.io/commands anchor: Commands
toUrl: http://www.redis.io/clients anchor: Clients
toUrl: http://www.redis.io/documentation anchor: Documentation
toUrl: http://www.redis.io/community anchor: Community
toUrl: http://www.redis.io/download anchor: Download
toUrl: https://github.com/antirez/redis/issues anchor: Issues
toUrl: http://www.redis.io/support anchor: Support
toUrl: http://www.redis.io/topics/license anchor: License
toUrl: http://www.redis.io/topics/data-types-intro#strings anchor: strings
toUrl: http://www.redis.io/topics/data-types-intro#hashes anchor: hashes
toUrl: http://www.redis.io/topics/data-types-intro#lists anchor: lists
toUrl: http://www.redis.io/topics/data-types-intro#sets anchor: sets
toUrl: http://www.redis.io/topics/data-types-intro#sorted-sets anchor: sorted sets
toUrl: http://www.redis.io/topics/data-types-intro#bitmaps anchor: bitmaps
toUrl: http://www.redis.io/topics/data-types-intro#hyperloglogs anchor: hyperloglogs
toUrl: http://www.redis.io/topics/introduction anchor: Learn more →
toUrl: http://try.redis.io anchor: interactive tutorial
toUrl: http://download.redis.io/releases/redis-2.8.17.tar.gz anchor: Redis 2.8.17 is the latest stable version.
toUrl: http://www.redis.io/download anchor: Check the downloads page.
toUrl: http://twitter.com/redisfeed anchor: Redis Twitter account
toUrl: http://github.com/antirez/redis anchor: code is at Github
toUrl: https://groups.google.com/forum/?fromgroups#!forum/redis-db anchor: the Redis Google Group
toUrl: http://www.redis.io/buzz anchor: More...
toUrl: https://github.com/antirez/redis-io anchor: open source software
toUrl: http://citrusbyte.com anchor: Citrusbyte
toUrl: http://www.carlosprioglio.com/ anchor: Carlos Prioglio
toUrl: http://redis.io/topics/sponsors anchor: credits
toUrl: http://www.pivotal.io/big-data/redis anchor: Redis Support
toUrl: http://www.redis.io/images/pivotal.png anchor: Redis Support
toUrl: http://ajax.googleapis.com/ajax/libs/jquery/1.4/jquery.min.js anchor:
toUrl: http://www.redis.io/app.js?1375789679 anchor:
toUrl: http://demo.lloogg.com/l.js?c=20bb9c026e anchor:
-----------------------------------------------------------------------------------------------------
然后是构造初始parseResult...
ParseStatus status = new ParseStatus(ParseStatus.SUCCESS);
if (metaTags.getRefresh()) {
status.setMinorCode(ParseStatus.SUCCESS_REDIRECT);
status.setArgs(new String[] { metaTags.getRefreshHref().toString(),
Integer.toString(metaTags.getRefreshTime()) });
}
ParseData parseData = new ParseData(status, title, outlinks,
content.getMetadata(), metadata);
ParseResult parseResult = ParseResult.createParseResult(
content.getUrl(), new ParseImpl(text, parseData));
-------------其它的就是自己写解析插件的地方了,怎么写插件请参考我之前的文章。
代码如下:
// run filters on parse
ParseResult filteredParse = this.htmlParseFilters.filter(content,
parseResult, metaTags, root);
if (metaTags.getNoCache()) { // not okay to cache
for (Map.Entry<org.apache.hadoop.io.Text, Parse> entry : filteredParse)
entry.getValue().getData().getParseMeta()
.set(Nutch.CACHING_FORBIDDEN_KEY, cachingPolicy);
}
return filteredParse;