Jsoup爬取网页乱码编码格式gb2312转utf8

最近做的一个项目需要爬取股票公告并存储于mongodb中用来显示,当我在用jsoup爬取新浪财经股票公告的时候,发现了乱码问题。网页链接如下http://vip.stock.finance.sina.com.cn/corp/view/vCB_AllBulletinDetail.php?stockid=600958&id=3735125,打开控制台可以看到新浪财经采用的是gb2312的编码方式,mongodb默认utf8,其实简体中文不做编码转换也是可以存储,但是例如繁体中文以及特殊字符就会出现乱码问题。于是写了段程序来统一编码格式,代码如下:

//获取公告
public String getAnnouncementFromSina(){
	String text = "";
	String url = "http://vip.stock.finance.sina.com.cn/corp/view/vCB_AllBulletinDetail.php?stockid=600958&id=3735125";
	try{
		Document doc = Jsoup.parse(new URL(url).openStream(), "GBK", url);
		Element element = doc.select("div#content").first().getElementsByTag("pre").first();
		
		//调用转换方法
		text = getUTF8BytesFromGBKString(element.text());
	}catch (Exception e){
		e.printStackTrace();
		return null;
	}	


	return text

}


//有损转换
public String getUTF8BytesFromGBKString(String gbkStr) throws UnsupportedEncodingException {
    int n = gbkStr.length();
    byte[] utfBytes = new byte[3 * n];
    int k = 0;
    for (int i = 0; i < n; i++) {
        int m = gbkStr.charAt(i);
        if (m < 128 && m >= 0) {
            utfBytes[k++] = (byte) m;
            continue;
        }
        utfBytes[k++] = (byte) (0xe0 | (m >> 12));
        utfBytes[k++] = (byte) (0x80 | ((m >> 6) & 0x3f));
        utfBytes[k++] = (byte) (0x80 | (m & 0x3f));
    }
    if (k < utfBytes.length) {
        byte[] tmp = new byte[k];
        System.arraycopy(utfBytes, 0, tmp, 0, k);
        utfBytes = tmp;


    }
    return new String(utfBytes,"UTF-8");
}
如有错误,欢迎纠正!


你可能感兴趣的:(Java)