rchardet,是python-chardet的port. 而python-chardet, 是mozilla browser的encoding auto-detection实现的port.关于细节, 可以看这里: http://nextlib.lifegoo.com/user/sishen/article/2605 : A composite approach to language/encoding detection
安装:
$gem install rchardet
使用:
$irb -rubygems irb(main):001:0> require 'rchardet' => true irb(main):002:0> CharDet.detect("\xA4\xCF") => {"encoding"=>"EUC-JP", "confidence"=>0.99} irb(main):003:0> CharDet.detect("中国") => {"encoding"=>"utf-8", "confidence"=>0.7525}
针对网页, 发起http request得到rawdata, 然后用rchardet去detect即可.