由于公司需要,开发了一个抓取网上数据爬虫的程序,如抓取点评网、阿里巴巴网和慧聪网城市和行业信息,采用的技术是:htmlparser。本文是简单的介绍htmlparser抓取的常用代码示例,具体详见:htmlparser下载包中的api文档。
下面理清一下Node节点与节点之间的关系及NodeFilter的全部实现类。
Interface Node
|||All Known Subinterfaces:
Remark(RemarkNode ),
Tag(AppletTag, BaseHrefTag, BodyTag, Bullet, BulletList, CompositeTag, DefinitionList, DefinitionListBullet, Div, DoctypeTag, FormTag, FrameSetTag, FrameTag, HeadingTag, HeadTag, Html, ImageTag, InputTag, JspTag, LabelTag, LinkTag, MetaTag, ObjectTag, OptionTag, ParagraphTag, ProcessingInstructionTag, ScriptTag, SelectTag, Span, StyleTag, TableColumn, TableHeader, TableRow, TableTag, TagNode, TextareaTag, TitleTag),
Text(TextNode)
Interface NodeFilter
|||All Known Implementing Classes:
AndFilter, AndFilterWrapper, CssSelectorNodeFilter, Filter, HasAttributeFilter, HasAttributeFilterWrapper, HasChildFilter, HasChildFilterWrapper, HasParentFilter, HasParentFilterWrapper, HasSiblingFilter, HasSiblingFilterWrapper, IsEqualFilter, LinkRegexFilter, LinkStringFilter, NodeClassFilter, NodeClassFilterWrapper, NotFilter, NotFilterWrapper, OrFilter, OrFilterWrapper, RegexFilter, RegexFilterWrapper, StringFilter, StringFilterWrapper, TagNameFilter, TagNameFilterWrapper
|||基本思路:前提是对整个html代码的分析,特别是需要抓取的html内容的分析。
第一步:Parser对象的创建并且设置编码,parser.setEncoding("UTF-8"); //UTF-8为html文件中的编码格式,保持一致。
第二步:创建合适的Filter过滤器
第三步:解析获取NodeList对象,然后该对象的toHtml()方法获取字符串,又可以重新创建Parser对象,如果可以一次定位到抓取的内容是最好的,如果不可以,方法是:逐步缩小范围。
第四步:对抓取的内容进行字符串处理,数据库操作等。NodeList对象的toNodeArray()方法获取Node[]节点数组,如LinkTag link = (LinkTag)node[0]; link.getLinkText()//获取链接文本 link.getLink(); //获取链接
|||Detail:
1. 创建Parser对象的方法:(有的时候会抛出网络异常,可以尝试下面三种方法解决问题)
1.1最普通常规的方式
Parser(String resource)
Creates a Parser object with the location of the resource (URL or file).
Parser(URLConnection connection)
Construct a parser using the provided URLConnection.
static Parser createParser(String html, String charset)
Creates the parser on an input string.
1.2 使用java网络链接代理方式
public static URLConnection getUrlAgent(String strUrl){
HttpURLConnection connection = null;
try{
URL url = new URL(strUrl);
connection = (HttpURLConnection) url.openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)");
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return connection;
}
Parser parser = new Parser(getUrlAgent(strUrl));
或 //存在中文转码的情况
String url = "http://localhost:8081/company/kw/%CB%FE%B5%F5.html";
url = java.net.URLDecoder.decode(url, "gb2312");
System.out.println(url);
URLConnection conn = getUrlAgent(url);
Parser parser = new Parser(conn);
1.3使用httpclient抓取网页内容流方式
public static String convertStreamToString(InputStream is)
throws UnsupportedEncodingException {
BufferedReader reader = new BufferedReader(new InputStreamReader(is,
"gbk"));
StringBuilder sb = new StringBuilder();
String line = null;
try {
while ((line = reader.readLine()) != null) {
sb.append(line + "\n");
}
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
is.close();
} catch (IOException e) {
e.printStackTrace();
}
}
return sb.toString();
}
// 下载内容
public static String urlContent(String urlString) throws HttpException,
IOException {
HttpClient client = new HttpClient();
GetMethod get = new GetMethod(urlString);
client.executeMethod(get);
// System.out.print("aaaaa:"+get.getResponseCharSet()); //GBK
InputStream iStream = get.getResponseBodyAsStream();
String contentString = convertStreamToString(iStream);
get.releaseConnection();
return contentString;
}
String url = "http://localhost:8081/company/c-1031646_province-%B9%E3%B6%AB_n-y.html/";
Parser parser = new Parser(urlContent(url));
2. NodeList对象
2.1单个标签本身过滤的情况
TagNameFilter filter = new TagNameFilter(tag);
NodeList nodeList = parser.parse(filter);
2.2单个标签同级(即标签与标签之间是兄弟平行关系)过滤的情况
TagNameFilter filter = new TagNameFilter(tag);
HasSiblingFilter hasSiblingFilter = new HasSiblingFilter(filter);
NodeList nodeList = parser.parse(hasSiblingFilter);
2.3单个标签上级(即标签与标签之间是父子关系)过滤的情况
TagNameFilter filter = new TagNameFilter(tag);
HasChildFilter hasChildFilter = new HasChildFilter(filter);
NodeList nodeList = parser.parse(hasChildFilter);
2.4单个标签下级(即标签与标签之间是父子关系)过滤的情况
TagNameFilter filter = new TagNameFilter(tag);
HasParentFilter hasParentFilter = new HasParentFilter(filter);
NodeList nodeList = parser.parse(hasParentFilter);
3.两个标签组合的情况,组合分为:AndFilter, OrFilter, NotFilter,同上也分为:本身,同级HasSiblingFilter,上级HasChildFilter和下级HasParentFilter过滤
AndFilter filter = new AndFilter (
new TagNameFilter (tag),
new TagNameFilter (tagother)
);
AndFilter filter = new AndFilter (
new HasSiblingFilter (
new TagNameFilter (tag)),
new HasSiblingFilter (
new TagNameFilter (tagother))
);
AndFilter filter = new AndFilter (
new HasChildFilter (
new TagNameFilter (tag)),
new HasChildFilter (
new TagNameFilter (tagother))
);
AndFilter filter = new AndFilter (
new HasParentFilter (
new TagNameFilter (tag)),
new HasParentFilter (
new TagNameFilter (tagother))
);
OrFilter filter = new OrFilter (
new TagNameFilter (tag),
new TagNameFilter (tagother)
);
OrFilter filter = new OrFilter (
new HasSiblingFilter (
new TagNameFilter (tag)),
new HasSiblingFilter (
new TagNameFilter (tagother))
);
OrFilter filter = new OrFilter (
new HasChildFilter (
new TagNameFilter (tag)),
new HasChildFilter (
new TagNameFilter (tagother))
);
OrFilter filter = new OrFilter (
new HasParentFilter (
new TagNameFilter (tag)),
new HasParentFilter (
new TagNameFilter (tagother))
);
AndFilter filter = new AndFilter (
new TagNameFilter (tag),
new NotFilter(new TagNameFilter (tagother))
);
AndFilter filter = new AndFilter (
new HasSiblingFilter (
new TagNameFilter (tag)),
new NotFilter (
new TagNameFilter (tagother))
);
AndFilter filter = new AndFilter (
new HasChildFilter (
new TagNameFilter (tag)),
new NotFilter (
new TagNameFilter (tagother))
);
AndFilter filter = new AndFilter (
new HasParentFilter (
new TagNameFilter (tag)),
new NotFilter (
new TagNameFilter (tagother))
);
NodeList nodeList = parser.parse(filter);
4.根据标签属性或标签属性和属性值过滤
HasAttributeFilter filter = new HasAttributeFilter (attribute);
或
HasAttributeFilter filter = new HasAttributeFilter (attribute,value);
NodeList nodeList = parser.parse(filter);
5.标签类过滤的情况
NodeFilter filter = new NodeClassFilter(LinkTag.class); //如链接标签
或
NodeFilter filter = new NodeClassFilter(TextNode.class); //如文本标签
NodeList nodeList = parser.parse(filter);
Node[] nodes = nodeList.toNodeArray(); //返回Node[]节点数组的情况
6.对表格的过滤获取
NodeClassFilter filter = new NodeClassFilter(TableTag.class);
NodeList nodeList = parser.parse(filter);
TableTag tableTag = (TableTag) nodeList.elementAt(0);
TableRow[] rows = tableTag.getRows();
for (int j = 0; j < rows.length; j++) {
TableRow tr = (TableRow) rows[j];
TableColumn[] td = tr.getColumns();
for (int k = 0; k < td.length; k++) {
LinkTag lt = (LinkTag)td[k].getFirstChild();
…… //字符串操作,数据库操作
}
}