URLConnection的编码问题

前一阵抓网页遇到编码问题,于是写了个方法在每次抓取之前确认一下网页的编码,代码如下:

private static String getEncode(String strUrl){
String encode = HttpClient.encode;
InputStream in = null;
HttpURLConnection con = null;
try{
log.debug("检查url编码:" + strUrl);
URL url = new URL(strUrl);
con = (HttpURLConnection)url.openConnection();
// String[] s = strurl.split("/");
System.out.printf("编码:%s \n" ,con.getContentEncoding());
if(con.getContentEncoding()!=null){
return con.getContentEncoding();
}
in = con.getInputStream();
con.setConnectTimeout(5*1000);
con.setReadTimeout(10*1000);


BufferedReader read = new BufferedReader(new InputStreamReader(in));
String inStr = null;

String reg = "meta http-equiv=\"Content-Type\" content=\".*?charset=(.*?)\"";
Pattern p = Pattern.compile(reg);

while ((inStr = read.readLine()) != null) {
Matcher m = p.matcher(inStr);
if(m.find()){
encode = m.group(1);
log.debug("code:" + encode);
break;
}
}

}catch(Exception e){
log.error(e.getMessage(),e);
}finally{
try {
in.close();
con.disconnect();
} catch (Exception e) {
}
}
return encode;
}

你可能感兴趣的:(URLConnection的编码问题)