HtmlCleaner是一个开源的Java语言的Html文档解析器。相当强大且简单易用。这里不介绍它的使用,具体使用可以到它的官网去看(http://htmlcleaner.sourceforge.net/javause.php)
这里说一个HtmlCleaner的bug.
问题现象:
在用htmlCleaner抓取网页内容时,如果不知道网页的编码,可以不设置编码。代码如下:
HtmlCleaner cleaner = new HtmlCleaner(); URL url = new URL("http://www.qq.com/"); TagNode node = cleaner.clean(url);
这样htmlCleaner会自动获取页面编码,但htmlCleaner在获取页面编码时,有一种情况没有考虑到。当页面的编码是以下面形式给出时
<meta charset="UTF-8" />
这时,htmlcleaner将无法获取页面编码,而使用系统编码。如果系统编码和网页编码不一致就会出现乱码。
解决方法:
public static String getCharset(URL url) throws Exception { URLConnection urlConnection = url.openConnection(); String charset = null; if (charset == null) { charset = getCharsetFromContentTypeString( urlConnection.getHeaderField("Content-Type") ); } if (charset == null) { charset = getCharsetFromContent(url); } if (charset == null) { charset = getCharsetFromMeta(url); } if (charset == null) { charset = HtmlCleaner.DEFAULT_CHARSET; } return charset; } public static String getCharsetFromContentTypeString(String contentType) { if (contentType != null) { String pattern = "charset=([a-z\\d\\-]*)"; Matcher matcher = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE).matcher(contentType); if (matcher.find()) { String charset = matcher.group(1); if (Charset.isSupported(charset)) { return charset; } } } return null; } public static String getCharsetFromContent(URL url) throws IOException { InputStream stream = url.openStream(); byte chunk[] = new byte[2048]; int bytesRead = stream.read(chunk); if (bytesRead > 0) { String startContent = new String(chunk); String pattern = "\\<meta\\s*http-equiv=[\\\"\\']content-type[\\\"\\']\\s*content\\s*=\\s*[\"']text/html\\s*;\\s*charset=([a-z\\d\\-]*)[\\\"\\'\\>]"; Matcher matcher = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE).matcher(startContent); if (matcher.find()) { String charset = matcher.group(1); if (Charset.isSupported(charset)) { return charset; } } } return null; } public static String getCharsetFromMeta(URL url) throws Exception { InputStream stream = url.openStream(); byte chunk[] = new byte[2048]; int bytesRead = stream.read(chunk); if (bytesRead > 0) { String startContent = new String(chunk); String pattern = "\\<meta\\s*[\\\"\\']charset=([a-z\\d\\-]*)[\\\"\\'\\>]"; Matcher matcher = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE).matcher(startContent); if (matcher.find()) { String charset = matcher.group(1); if (Charset.isSupported(charset)) { return charset; } } } return null; }
注:getCharsetFromContentTypeString和 getCharsetFromContent方法是htmlCleaner包中提供的方法
使用getCharset方法获取编码,在初始化htmlCleaner时,设置网页编码:
HtmlCleaner cleaner = new HtmlCleaner(); URL url = new URL("http://www.qq.com/"); TagNode node = cleaner.clean(url,getCharset(url));