Nutch爬去中文网站乱码

阅读更多

    今天使用Nutch1.7抓取中文网站的时候,发现抓取后的数据是乱码,网上找了很多资料都没有解决。于是查看源代码发现,Nutch解析文件使用的是HtmlParser类,此类中有获取网页编码的代码:

 

 // NUTCH-1006 Meta equiv with single quotes not accepted
  private static Pattern metaPattern =
    Pattern.compile("]*http-equiv=(\"|')?content-type(\"|')?[^>]*)>",
                    Pattern.CASE_INSENSITIVE);
  private static Pattern charsetPattern =
    Pattern.compile("charset=\\s*([a-z][_\\-0-9a-z]*)",
                    Pattern.CASE_INSENSITIVE);

 

private static String sniffCharacterEncoding(byte[] content) {
    int length = content.length < CHUNK_SIZE ? 
                 content.length : CHUNK_SIZE;

    // We don't care about non-ASCII parts so that it's sufficient
    // to just inflate each byte to a 16-bit value by padding. 
    // For instance, the sequence {0x41, 0x82, 0xb7} will be turned into 
    // {U+0041, U+0082, U+00B7}. 
    String str = "";
    try {
      str = new String(content, 0, length,
                       Charset.forName("ASCII").toString());
    } catch (UnsupportedEncodingException e) {
      // code should never come here, but just in case... 
      return null;
    }

    Matcher metaMatcher = metaPattern.matcher(str);
    String encoding = null;
    if (metaMatcher.find()) {
      Matcher charsetMatcher = charsetPattern.matcher(metaMatcher.group(1));
      if (charsetMatcher.find()) 
        encoding = new String(charsetMatcher.group(1));
    }

    return encoding;
  }

 获得网页中的charset后,会通过detector.guessEncoding方法获取最匹配的编码

 

 

EncodingDetector detector = new EncodingDetector(conf);
detector.autoDetectClues(content, true);
detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
String encoding = detector.guessEncoding(content, defaultCharEncoding);

 其中EncodingDetector类的guessEncoding方法:

public String guessEncoding(Content content, String defaultValue) {
    /*
     * This algorithm could be replaced by something more sophisticated;
     * ideally we would gather a bunch of data on where various clues
     * (autodetect, HTTP headers, HTML meta tags, etc.) disagree, tag each with
     * the correct answer, and use machine learning/some statistical method
     * to generate a better heuristic.
     */

    String base = content.getBaseUrl();

    if (LOG.isTraceEnabled()) {
      findDisagreements(base, clues);
    }

    /*
     * Go down the list of encoding "clues". Use a clue if:
     *  1. Has a confidence value which meets our confidence threshold, OR
     *  2. Doesn't meet the threshold, but is the best try,
     *     since nothing else is available.
     */
    EncodingClue defaultClue = new EncodingClue(defaultValue, "default");
    EncodingClue bestClue = defaultClue;

    for (EncodingClue clue : clues) {
      if (LOG.isTraceEnabled()) {
        LOG.trace(base + ": charset " + clue);
      }
      String charset = clue.value;
      if (minConfidence >= 0 && clue.confidence >= minConfidence) {
        if (LOG.isTraceEnabled()) {
          LOG.trace(base + ": Choosing encoding: " + charset +
                    " with confidence " + clue.confidence);
        }
        return resolveEncodingAlias(charset).toLowerCase();
      } else if (clue.confidence == NO_THRESHOLD && bestClue == defaultClue) {
        bestClue = clue;
      }
    }

    if (LOG.isTraceEnabled()) {
      LOG.trace(base + ": Choosing encoding: " + bestClue);
    }
    return bestClue.value.toLowerCase();
  }

 debug发现,网页中的charset是GBK,而最终获取的编码是GB18030。造成此结果的原因是EncodingDetector的默认设置,将GBK使用GB18030来解析:

static {
    DETECTABLES.add("text/html");
    DETECTABLES.add("text/plain");
    DETECTABLES.add("text/richtext");
    DETECTABLES.add("text/rtf");
    DETECTABLES.add("text/sgml");
    DETECTABLES.add("text/tab-separated-values");
    DETECTABLES.add("text/xml");
    DETECTABLES.add("application/rss+xml");
    DETECTABLES.add("application/xhtml+xml");
    /*
     * the following map is not an alias mapping table, but
     * maps character encodings which are often used in mislabelled
     * documents to their correct encodings. For instance,
     * there are a lot of documents labelled 'ISO-8859-1' which contain
     * characters not covered by ISO-8859-1 but covered by windows-1252.
     * Because windows-1252 is a superset of ISO-8859-1 (sharing code points
     * for the common part), it's better to treat ISO-8859-1 as
     * synonymous with windows-1252 than to reject, as invalid, documents
     * labelled as ISO-8859-1 that have characters outside ISO-8859-1.
     */
    ALIASES.put("ISO-8859-1", "windows-1252");
    ALIASES.put("EUC-KR", "x-windows-949");
    ALIASES.put("x-EUC-CN", "GB18030");
    ALIASES.put("GBK", "GB18030");
    //ALIASES.put("Big5", "Big5HKSCS");
    //ALIASES.put("TIS620", "Cp874");
    //ALIASES.put("ISO-8859-11", "Cp874");

  }

 修改代码

  static {
    DETECTABLES.add("text/html");
    DETECTABLES.add("text/plain");
    DETECTABLES.add("text/richtext");
    DETECTABLES.add("text/rtf");
    DETECTABLES.add("text/sgml");
    DETECTABLES.add("text/tab-separated-values");
    DETECTABLES.add("text/xml");
    DETECTABLES.add("application/rss+xml");
    DETECTABLES.add("application/xhtml+xml");
    /*
     * the following map is not an alias mapping table, but
     * maps character encodings which are often used in mislabelled
     * documents to their correct encodings. For instance,
     * there are a lot of documents labelled 'ISO-8859-1' which contain
     * characters not covered by ISO-8859-1 but covered by windows-1252.
     * Because windows-1252 is a superset of ISO-8859-1 (sharing code points
     * for the common part), it's better to treat ISO-8859-1 as
     * synonymous with windows-1252 than to reject, as invalid, documents
     * labelled as ISO-8859-1 that have characters outside ISO-8859-1.
     */
    ALIASES.put("ISO-8859-1", "windows-1252");
    ALIASES.put("EUC-KR", "x-windows-949");
    ALIASES.put("x-EUC-CN", "GB18030");
    ALIASES.put("GBK", "GBK");
    //ALIASES.put("Big5", "Big5HKSCS");
    //ALIASES.put("TIS620", "Cp874");
    //ALIASES.put("ISO-8859-11", "Cp874");

  }

 就解决乱码问题了

 

你可能感兴趣的:(nutch,乱码,HtmlParser)