HtmlAgilityPack是用C#写的开源Html Parser。不过它的某些方面设计不尽完善,比如,按照其正常模式抓取中文网页,往往获得的是乱码。比如,抓取新华网首页(。模仿HtmlAgilityPack示例,爬取代码如下:
HtmlWeb hw = new HtmlWeb();
string url = @"";
HtmlDocument doc = hw.Load(url);
穿越HtmlAgilityPack的代码迷宫,最后发现问题出在HtmlWeb类的Get(Uri uri, string method, string path, HtmlDocument doc)方法中。该方法有以下代码:
HttpWebResponse resp; try { resp = req.GetResponse() as HttpWebResponse; } …… if ((resp.ContentEncoding != null) && (resp.ContentEncoding.Length>0)) { respenc = System.Text.Encoding.GetEncoding(resp.ContentEncoding); } else { respenc = null; } …… Stream s = resp.GetResponseStream(); if (s != null) { if (UsingCache) { // NOTE: LastModified does not contain milliseconds, so we remove them to the file SaveStream(s, cachePath, RemoveMilliseconds(resp.LastModified), _streamBufferSize); // save headers SaveCacheHeaders(req.RequestUri, resp); if (path != null) { // copy and touch the file IOLibrary.CopyAlways(cachePath, path); File.SetLastWriteTime(path, File.GetLastWriteTime(cachePath)); } } else { // try to work in-memory if ((doc != null) && (html)) { if (respenc != null) { doc.Load(s, respenc); } } else { doc.Load(s, true); } } } resp.Close(); }
其中resp是http请求的response。设置断点发现resp.ContentEncoding为空。于是最后的加载行为便变成了doc.Load(s, true);而这个load方法也可能出了问题,最后得到的是乱码。
HttpWebRequest req; req = WebRequest.Create(new Uri(@"")) as HttpWebRequest; req.Method = "GET"; WebResponse rs = req.GetResponse(); Stream rss = rs.GetResponseStream(); String url = @""; try { HtmlDocument doc = new HtmlDocument(); doc.Load(rss); doc.Save("output.html"); } catch (Exception e) { Console.WriteLine(e.Message.ToString()); Console.WriteLine(e.StackTrace); }
上面代码中,doc.Load(…) 使用的编码为System.Text.Encoding.Default,在我机器上为gb2312编码。
HtmlDocument也可以指定编码load stream。获得指定编码有两种方法:
(1)在HttpWebResponse 对象中可以获取html代码中设置的charset;