最近写一个小爬虫, 用的htmlparser来解析HTML, 不过, 在解析Object标签时有些不方便,不能准确地拿到子标签对应的理想对象。
下面这样的一段HTML,
<object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0" height="406" width="980"> <param name="quality" value="high" /> <param name="movie" value="/flash/index.swf" /> <param name="quality" value="high" /> <param name="wmode" value="transparent" /> <param name="movie" value="/flash/index.swf" /> <embed height="406" pluginspage="http://www.macromedia.com/go/getflashplayer" quality="high" src="/flashRepository/d973f054-ae5d-453d-bbfb-9b9c825fd7df" type="application/x-shockwave-flash" width="980" wmode="transparent"></embed> </object>
我用HtmlParser解析后, 可以成功地拿到Object标签对应的对象, 可再往下就拿不到了, Param和Embed标签都是TagNode类型的, 而不是我想要的ParamTag和EmbedTag,这两个类的实现在下面, 是我自己定义的。
解析的代码是这样的:
PrototypicalNodeFactory factory = new PrototypicalNodeFactory(); factory.registerTag(new LocalObjectTag()); factory.registerTag(new EmbedTag()); factory.registerTag(new ParamTag()); Parser parser = new Parser(); parser.setNodeFactory(factory); try { parser.setInputHTML(testHTML); } catch (ParserException e) { e.printStackTrace(); } parser.setFeedback(new DefaultParserFeedback(DefaultParserFeedback.QUIET)); NodeFilter[] srcFilters = { new NodeClassFilter(EmbedTag.class), new NodeClassFilter(LocalObjectTag.class),new NodeClassFilter(ParamTag.class) }; OrFilter linkFilter = new OrFilter(srcFilters); // 得到所有经过过滤的标签 try { NodeList list = parser.extractAllNodesThatMatch(linkFilter); for (int i = 0; i < list.size(); i++) { Node n = list.elementAt(i); if (n instanceof ParamTag) { ParamTag p = (ParamTag) n; System.out.println("src: " + p.getSrc()); } } } catch (ParserException e) { e.printStackTrace(); } System.out.println("exit");
由于Parser里没有自带的EmbedTag和ParamTag, 我自写了这两个类。
public class ParamTag extends CompositeTag { public String getSrc() { String result = null; //先看data属性里有没有值。 String srcValue = getAttribute("SRC"); if (StringUtils.isNotBlank(srcValue)) { return getPage ().getAbsoluteURL (srcValue); } return result; } public boolean isMovie() { return null != getAttribute("MOVIE"); } } public class EmbedTag extends CompositeTag { public String getSrc() { String result = null; //先看data属性里有没有值。 String srcValue = getAttribute("SRC"); if (StringUtils.isNotBlank(srcValue)) { return getPage ().getAbsoluteURL (srcValue); } return result; } }
另, 为了方便地使用ObjectTag, 我又继承了下, 搞了个新类LocalObjectTag。
public class LocalObjectTag extends ObjectTag { public String extractUrl() { String result = null; //先看data属性里有没有值。 String dataValue = getAttribute("data"); if (StringUtils.isNotBlank(dataValue)) { return getPage ().getAbsoluteURL (dataValue); } result = fromChildren(); if (StringUtils.isNotBlank(result)) { return result; } return result; } private String fromChildren() { String result = null; NodeList nList = this.getChildren(); for(int i=0;i<nList.size();i++) { Node n = nList.elementAt(i); if (n instanceof TagNode) { TagNode tNode = (TagNode)n; String value = tNode.getAttribute("VALUE"); String nameAttri = tNode.getAttribute("name"); if (StringUtils.isNotBlank(value) && "movie".equalsIgnoreCase(nameAttri)) { return value; } String src = tNode.getAttribute("src"); String name = tNode.getTagName(); if (StringUtils.isNotBlank(src) && "embed".equalsIgnoreCase(name)) { return src; } } } return result; } }
PrototypicalNodeFactory factory = new PrototypicalNodeFactory(); factory.registerTag(new LocalObjectTag()); factory.registerTag(new EmbedTag()); factory.registerTag(new ParamTag());