解决下载网页乱码的方法

之前看到有很多朋友在下载网页的时候会出现乱码的问题,也有很多朋友提出了解决方案,但是觉得都不是很正规,比如很常见的使用正则表达式抓取的那个方法.其实我们可以使用WenRequest和reponse的方法来实现.代码如下:

private static string DownloadHtml(string url)
{
    string content = string.Empty;
    HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
    request.Timeout = 600000;
    request.AllowAutoRedirect = true;
    request.ContentType = "application/x-www-form-urlencoded";
    request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2";
    HttpWebResponse response = (HttpWebResponse)request.GetResponse();
    Stream stream = response.GetResponseStream();
    StreamReader srHtml = new StreamReader(stream, 
        Encoding.GetEncoding(response.CharacterSet));
    content = srHtml.ReadToEnd();
    response.Close();
    stream.Close();
    srHtml.Close();
    return content;
}

其实网页的编码就藏在response.CharacterSet里面,不需要使用正则来截取了.

你可能感兴趣的:(数据挖掘)