今天来看看Nutch如何Parse网页的:
Nutch使用了两种Html parser工具(NekoHTML和TagSoup)来实现html的提取,这两种工具是可通过配置来选择的。
当然你要自己实现Parser你还可以选择HTMLParser[基于visitor访问者模式同时也提供了Event driver的接口]来
提取网页。如果你用惯了XML一套处理方法,使用NekoHTML和TagSoup应该会比较顺手的。
我们来看看类public class HtmlParser implements Parser的实现:
首先为了更好的理解下面的代码先看看成员变量:
private static final int CHUNK_SIZE = 2000;
private static Pattern metaPattern =
Pattern.compile("<meta\\s+([^>]*http-equiv=\"?content-type\"?[^>]*)>",
Pattern.CASE_INSENSITIVE);
private static Pattern charsetPattern =
Pattern.compile("charset=\\s*([a-z][_\\-0-9a-z]*)",
Pattern.CASE_INSENSITIVE);
private String parserImpl;
CHUNK_SIZE提取html meta tag部分的html片断的长度,一般meta tag没有超过2000bytes的,所以只需要从这部分
提取就行了
metaPattern为meta tag匹的正则模式
charsetPattern为字符集编码的正则模式
parserImpl是具体使用的是NekoHTML还是TagSoup来parser html.如果parserImpl为"tagsoup"就使用TagSoup,否则就使用NekoHTML。
用来从html在meta tag里面提取出charset或Content-Type中指定的编码:
length限定在meta tag部分提取,通过正则表达式很容易提取出编码
private static String sniffCharacterEncoding(byte[] content) {
int length = content.length < CHUNK_SIZE ?
content.length : CHUNK_SIZE;
// We don't care about non-ASCII parts so that it's sufficient
// to just inflate each byte to a 16-bit value by padding.
// For instance, the sequence {0x41, 0x82, 0xb7} will be turned into
// {U+0041, U+0082, U+00B7}.
String str = new String(content, 0, 0, length);
Matcher metaMatcher = metaPattern.matcher(str);
String encoding = null;
if (metaMatcher.find()) {
Matcher charsetMatcher = charsetPattern.matcher(metaMatcher.group(1));
if (charsetMatcher.find())
encoding = new String(charsetMatcher.group(1));
}
return encoding;
}
最重要的一个方法是:
public Parse getParse(Content content)
这个方法返回了包含了提取所有结果Parse对象:
这个方法写的比较长,近100行,其实整个方法可以分解成几个小方法:
提取base url,提取encoding,根据提取出的编码提取content,提取meta tags,提取outlinks,最后根据提取得到的
text和parseDate构造Parse对象
下面我们一个一个看:
提取base url
URL base;
try {
base = new URL(content.getBaseUrl());
} catch (MalformedURLException e) {
return new ParseStatus(e).getEmptyParse(getConf());
}
提取encoding:
//直接从content中的metadata中提取
byte[] contentInOctets = content.getContent();
InputSource input = new InputSource(new ByteArrayInputStream(contentInOctets));
String contentType = content.getMetadata().get(Response.CONTENT_TYPE);
String encoding = StringUtil.parseCharacterEncoding(contentType);
if ((encoding != null) && !("".equals(encoding))) {
metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);
if ((encoding = StringUtil.resolveEncodingAlias(encoding)) != null) {
metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);
if (LOG.isTraceEnabled()) {
LOG.trace(base + ": setting encoding to " + encoding);
}
}
}
//如果从metadata中没有提取到,使用前面sniffCharacterEncoding从meta tag提取
// sniff out 'charset' value from the beginning of a document
if ((encoding == null) || ("".equals(encoding))) {
encoding = sniffCharacterEncoding(contentInOctets);
if (encoding!=null) {
metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);
if ((encoding = StringUtil.resolveEncodingAlias(encoding)) != null) {
metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);
if (LOG.isTraceEnabled()) {
LOG.trace(base + ": setting encoding to " + encoding);
}
}
}
}
//如果还没有提取到,使用默认的编码
if (encoding == null) {
// fallback encoding.
// FIXME : In addition to the global fallback value,
// we should make it possible to specify fallback encodings for each ccTLD.
// (e.g. se: windows-1252, kr: x-windows-949, cn: gb18030, tw: big5
// doesn't work for jp because euc-jp and shift_jis have about the
// same share)
encoding = defaultCharEncoding;
metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, defaultCharEncoding);
if (LOG.isTraceEnabled()) {
LOG.trace(base + ": falling back to " + defaultCharEncoding);
}
}
设置好编码方式,从content中提取DocumentFragment
input.setEncoding(encoding);
if (LOG.isTraceEnabled()) { LOG.trace("Parsing..."); }
root = parse(input);
} catch (IOException e) {
return new ParseStatus(e).getEmptyParse(getConf());
} catch (DOMException e) {
return new ParseStatus(e).getEmptyParse(getConf());
} catch (SAXException e) {
return new ParseStatus(e).getEmptyParse(getConf());
} catch (Exception e) {
e.printStackTrace(LogUtil.getWarnStream(LOG));
return new ParseStatus(e).getEmptyParse(getConf());
}
提取meta tag,并检查meta指令
HTMLMetaProcessor.getMetaTags(metaTags, root, base);
if (LOG.isTraceEnabled()) {
LOG.trace("Meta tags for " + base + ": " + metaTags.toString());
}
// check meta directives
if (!metaTags.getNoIndex()) { // okay to index
StringBuffer sb = new StringBuffer();
if (LOG.isTraceEnabled()) { LOG.trace("Getting text..."); }
utils.getText(sb, root); // extract text
text = sb.toString();
sb.setLength(0);
if (LOG.isTraceEnabled()) { LOG.trace("Getting title..."); }
utils.getTitle(sb, root); // extract title
title = sb.toString().trim();
}
提取出outlinks:
if (!metaTags.getNoFollow()) { // okay to follow links
ArrayList l = new ArrayList(); // extract outlinks
URL baseTag = utils.getBase(root);
if (LOG.isTraceEnabled()) { LOG.trace("Getting links..."); }
utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
outlinks = (Outlink[])l.toArray(new Outlink[l.size()]);
if (LOG.isTraceEnabled()) {
LOG.trace("found "+outlinks.length+" outlinks in "+content.getUrl());
}
}
构建parse对象:
ParseStatus status = new ParseStatus(ParseStatus.SUCCESS);
if (metaTags.getRefresh()) {
status.setMinorCode(ParseStatus.SUCCESS_REDIRECT);
status.setMessage(metaTags.getRefreshHref().toString());
}
ParseData parseData = new ParseData(status, title, outlinks,
content.getMetadata(), metadata);
parseData.setConf(this.conf);
Parse parse = new ParseImpl(text, parseData);
// run filters on parse
parse = this.htmlParseFilters.filter(content, parse, metaTags, root);
if (metaTags.getNoCache()) { // not okay to cache
parse.getData().getParseMeta().set(Nutch.CACHING_FORBIDDEN_KEY, cachingPolicy);
}
下面这个方法根据parserImpl字段,使用NekoHTML或TagSoup来提取content得到DocumentFragment对象
private DocumentFragment parse(InputSource input) throws Exception {
if (parserImpl.equalsIgnoreCase("tagsoup"))
return parseTagSoup(input);
else return parseNeko(input);
}
网页抓取部分到此基本结束,必要的部分相应再作补充。等研究好google的map-reduce再继续其他部分。