java charset detector

https://code.google.com/p/juniversalchardet/downloads/list

java移植mozilla的编码自动检测库(源码为c++),准确率高。

通过svn签出只读版本的代码:

# Non-members may check out a read-only working copy anonymously over HTTP.
svn checkout http://juniversalchardet.googlecode.com/svn/trunk/ juniversalchardet-read-only

package myjava;

import java.io.File;
import java.io.IOException;

import org.mozilla.universalchardet.UniversalDetector;

public class TestDetector {
    public static void main(String[] args) throws java.io.IOException {
        String folder = "/home/hadoop/test/charset/";
        File file = new File(folder);
        for (File _file : file.listFiles())
            detectCharset(_file.getAbsolutePath());
    }

    static void detectCharset(String fileName) throws IOException {
        byte[] buf = new byte[4096];
        java.io.FileInputStream fis = new java.io.FileInputStream(fileName);

        // (1)
        UniversalDetector detector = new UniversalDetector(null);

        // (2)
        int nread;
        while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
            detector.handleData(buf, 0, nread);
        }
        // (3)
        detector.dataEnd();

        // (4)
        String encoding = detector.getDetectedCharset();
        if (encoding != null) {
            System.out.println("Detected encoding = " + encoding);
        } else {
            System.out.println("No encoding detected.");
        }

        // (5)
        detector.reset();
    }
}

可以结合另外一个java的字符集检测库来保证更好的结果,因为对于短文来说,上面的检测方法可能无法得出结论。

同时因为这个算法来自于mozilla,它应该能更好地作用于html等标签文件的检测。

http://cpdetector.sourceforge.net/usage.shtml

你可能感兴趣的:(charset)