今天使用Nutch1.7抓取中文网站的时候,发现抓取后的数据是乱码,网上找了很多资料都没有解决。于是查看源代码发现,Nutch解析文件使用的是HtmlParser类,此类中有获取网页编码的代码:
// NUTCH-1006 Meta equiv with single quotes not accepted private static Pattern metaPattern = Pattern.compile("<meta\\s+([^>]*http-equiv=(\"|')?content-type(\"|')?[^>]*)>", Pattern.CASE_INSENSITIVE); private static Pattern charsetPattern = Pattern.compile("charset=\\s*([a-z][_\\-0-9a-z]*)", Pattern.CASE_INSENSITIVE);
private static String sniffCharacterEncoding(byte[] content) { int length = content.length < CHUNK_SIZE ? content.length : CHUNK_SIZE; // We don't care about non-ASCII parts so that it's sufficient // to just inflate each byte to a 16-bit value by padding. // For instance, the sequence {0x41, 0x82, 0xb7} will be turned into // {U+0041, U+0082, U+00B7}. String str = ""; try { str = new String(content, 0, length, Charset.forName("ASCII").toString()); } catch (UnsupportedEncodingException e) { // code should never come here, but just in case... return null; } Matcher metaMatcher = metaPattern.matcher(str); String encoding = null; if (metaMatcher.find()) { Matcher charsetMatcher = charsetPattern.matcher(metaMatcher.group(1)); if (charsetMatcher.find()) encoding = new String(charsetMatcher.group(1)); } return encoding; }
获得网页中的charset后,会通过detector.guessEncoding方法获取最匹配的编码
EncodingDetector detector = new EncodingDetector(conf); detector.autoDetectClues(content, true); detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed"); String encoding = detector.guessEncoding(content, defaultCharEncoding);
其中EncodingDetector类的guessEncoding方法:
public String guessEncoding(Content content, String defaultValue) { /* * This algorithm could be replaced by something more sophisticated; * ideally we would gather a bunch of data on where various clues * (autodetect, HTTP headers, HTML meta tags, etc.) disagree, tag each with * the correct answer, and use machine learning/some statistical method * to generate a better heuristic. */ String base = content.getBaseUrl(); if (LOG.isTraceEnabled()) { findDisagreements(base, clues); } /* * Go down the list of encoding "clues". Use a clue if: * 1. Has a confidence value which meets our confidence threshold, OR * 2. Doesn't meet the threshold, but is the best try, * since nothing else is available. */ EncodingClue defaultClue = new EncodingClue(defaultValue, "default"); EncodingClue bestClue = defaultClue; for (EncodingClue clue : clues) { if (LOG.isTraceEnabled()) { LOG.trace(base + ": charset " + clue); } String charset = clue.value; if (minConfidence >= 0 && clue.confidence >= minConfidence) { if (LOG.isTraceEnabled()) { LOG.trace(base + ": Choosing encoding: " + charset + " with confidence " + clue.confidence); } return resolveEncodingAlias(charset).toLowerCase(); } else if (clue.confidence == NO_THRESHOLD && bestClue == defaultClue) { bestClue = clue; } } if (LOG.isTraceEnabled()) { LOG.trace(base + ": Choosing encoding: " + bestClue); } return bestClue.value.toLowerCase(); }
debug发现,网页中的charset是GBK,而最终获取的编码是GB18030。造成此结果的原因是EncodingDetector的默认设置,将GBK使用GB18030来解析:
static { DETECTABLES.add("text/html"); DETECTABLES.add("text/plain"); DETECTABLES.add("text/richtext"); DETECTABLES.add("text/rtf"); DETECTABLES.add("text/sgml"); DETECTABLES.add("text/tab-separated-values"); DETECTABLES.add("text/xml"); DETECTABLES.add("application/rss+xml"); DETECTABLES.add("application/xhtml+xml"); /* * the following map is not an alias mapping table, but * maps character encodings which are often used in mislabelled * documents to their correct encodings. For instance, * there are a lot of documents labelled 'ISO-8859-1' which contain * characters not covered by ISO-8859-1 but covered by windows-1252. * Because windows-1252 is a superset of ISO-8859-1 (sharing code points * for the common part), it's better to treat ISO-8859-1 as * synonymous with windows-1252 than to reject, as invalid, documents * labelled as ISO-8859-1 that have characters outside ISO-8859-1. */ ALIASES.put("ISO-8859-1", "windows-1252"); ALIASES.put("EUC-KR", "x-windows-949"); ALIASES.put("x-EUC-CN", "GB18030"); ALIASES.put("GBK", "GB18030"); //ALIASES.put("Big5", "Big5HKSCS"); //ALIASES.put("TIS620", "Cp874"); //ALIASES.put("ISO-8859-11", "Cp874"); }
修改代码
static { DETECTABLES.add("text/html"); DETECTABLES.add("text/plain"); DETECTABLES.add("text/richtext"); DETECTABLES.add("text/rtf"); DETECTABLES.add("text/sgml"); DETECTABLES.add("text/tab-separated-values"); DETECTABLES.add("text/xml"); DETECTABLES.add("application/rss+xml"); DETECTABLES.add("application/xhtml+xml"); /* * the following map is not an alias mapping table, but * maps character encodings which are often used in mislabelled * documents to their correct encodings. For instance, * there are a lot of documents labelled 'ISO-8859-1' which contain * characters not covered by ISO-8859-1 but covered by windows-1252. * Because windows-1252 is a superset of ISO-8859-1 (sharing code points * for the common part), it's better to treat ISO-8859-1 as * synonymous with windows-1252 than to reject, as invalid, documents * labelled as ISO-8859-1 that have characters outside ISO-8859-1. */ ALIASES.put("ISO-8859-1", "windows-1252"); ALIASES.put("EUC-KR", "x-windows-949"); ALIASES.put("x-EUC-CN", "GB18030"); ALIASES.put("GBK", "GBK"); //ALIASES.put("Big5", "Big5HKSCS"); //ALIASES.put("TIS620", "Cp874"); //ALIASES.put("ISO-8859-11", "Cp874"); }
就解决乱码问题了