为了简单实现,将不附加其他请求信息,如HEAD中的用户浏览器信息等。
一、使用GET方法
使用GET方法应该算是最简单,最好操作的。以开心网的用户首页为例,网址统一为:http://www.kaixin001.com/home/?uid=xxxxxxx。其中的xxxxxxxx表示该用户的用户ID。用户首页,在GET方法,没有加入任何其他请求数据时,请求会进行重定向,定向到开心网的登录页。
新建一个控制台,并写入GET方法的代码,代码如下:
static void Main(string[] args) { string getUrl = "http://www.kaixin001.com/home/?uid=66878410"; //string getUrl = "http://www.jueceji.com/login/AjaxRegValid?type=email&[email protected]"; HttpWebRequest req = (HttpWebRequest)WebRequest.Create(getUrl); req.Method = "GET"; HttpWebResponse res = null; Stream st = null; StreamReader sr = null; string html = string.Empty; try { res = (HttpWebResponse)req.GetResponse(); st = res.GetResponseStream(); sr = new StreamReader(st,System.Text.Encoding.UTF8); Console.WriteLine(sr.CurrentEncoding); html = sr.ReadToEnd(); } catch (IOException ex) { html = ex.Message; } catch (Exception ex) { html = ex.Message; } finally { if (res != null) { res.Close(); } if (st != null) { st.Close(); } if (sr != null) { sr.Close(); } } Console.WriteLine(html); Console.ReadKey(); }
因未登录,可以看到,输出的页面,是登录页面的HTML源码:
二、使用POST方法,进行页面抓取
POST方法与GET方法相似,只是Method的方法重新设置为“POST”,把要POST的数据使用编码转化为Byte[]格式,再进行长度设置。在请求前,先获取HTTPWebRequest的请求流,把POST的数据写入,再执行请求。例子如下:
static void Main(string[] args) { string postUrl = "http://www.XXXXX.com/login/AjaxRegValid"; HttpWebRequest req = (HttpWebRequest)WebRequest.Create(postUrl); req.Method = "POST"; req.ContentType = "application/x-www-form-urlencoded"; byte[] postData = Encoding.UTF8.GetBytes("type=email&[email protected]"); req.ContentLength = postData.Length; Stream st = null, postSt = null ; StreamReader sr = null; HttpWebResponse res = null; string html = string.Empty; try { postSt = req.GetRequestStream();//获取请求流 postSt.Write(postData, 0, postData.Length); postSt.Close(); res = (HttpWebResponse)req.GetResponse(); st = res.GetResponseStream(); sr = new StreamReader(st, System.Text.Encoding.UTF8); html = sr.ReadToEnd(); } catch (IOException ex) { html = ex.Message; } catch (Exception ex) { html = ex.Message; } finally { if (postSt != null) { postSt.Close(); } if (st != null) { st.Close(); } if (sr != null) { sr.Close(); } if (res != null) { res.Close(); } } Console.WriteLine(html); Console.ReadKey(); }
注:以上请自己寻找一个可以进行POST验证的网址进行抓取。
注意,POST时必须加以下代码,否则服务端无法获取到POST的数据
req.ContentType = "application/x-www-form-urlencoded";
三、抓取必须登录的页面
其实无论是以SESSION验证的,还是以COOKIE验证的,都必须得使用在请求中带相关COOKIE的值。要想取得COOKIE值,必须先用浏览器登录,然后再查看该登录所必须用到的COOKIE。例子以开心网不带ID的个人首页,http://www.kaixin001.com/home/。进行抓取。
若不加入COOKIE,则抓取到的HTML是重定向的HTML,即登录页面的HTML。例子代码如下:
static void Main(string[] args) { string getUrl = "http://www.kaixin001.com/home/"; HttpWebRequest req = (HttpWebRequest)WebRequest.Create(getUrl); req.Method = "Get"; req.ContentType = "text/html"; req.Timeout = 6000; CookieContainer container = new CookieContainer(); CookieCollection cc = new CookieCollection(); cc.Add(new Cookie("_ref", "a7e407367420c2d58daf212seebb40974.hao123.Fd866.4ddff06777d201")); cc.Add(new Cookie("_cpmuid", "10069406920")); cc.Add(new Cookie("SERVERID", "_srv102-143_")); cc.Add(new Cookie("_vid", "C4C97DEEB480FE0001D27DD17B01D1B10BA")); cc.Add(new Cookie("noBirthList", "1")); cc.Add(new Cookie("_uid", "668728415")); cc.Add(new Cookie("_email", "ansenwork%40gmail.com")); cc.Add(new Cookie("_laid", "0")); cc.Add(new Cookie("_kx", "7e48ed27593905scfade21a76e4c27f19f_668784015")); cc.Add(new Cookie("onlinenum", "c%3A0")); cc.Add(new Cookie("_user", "cbb2eme02bcdfbv0ef2864c5a974fd340f9f_668378415_13061467630")); cc.Add(new Cookie("_openname", "%D6%DC%BB%AA")); cc.Add(new Cookie("_openlogo", "http%3A%2F%2Fimg.kaixin001.com.cn%2Fi%2F20_0_0.gif")); cc.Add(new Cookie("_kxt", "0")); cc.Add(new Cookie("presence", "VXfToHl9FfU8Pq6GV1Qf6VzgywAxQz5Vs1ATd8dNg.NjY4Nzg0MTU")); Uri root = new Uri(getUrl); container.Add(root, cc); req.CookieContainer = container; HttpWebResponse res = null; Stream st = null; string html = ""; StreamReader sr = null; try { res = (HttpWebResponse)req.GetResponse(); st = res.GetResponseStream(); sr = new StreamReader(st); html = sr.ReadToEnd(); } catch (IOException ex) { } catch (Exception ex) { } finally { if (res != null) { res.Close(); } if (st != null) { st.Close(); } if (sr != null) { sr.Close(); } } Console.WriteLine(html); Console.ReadKey(); }
因COOKIE涉及个人隐私,已对COOKIE进行了字符修改,若有需要用到的朋友,请自行登录,查找自己登录后的COOKIE值。
运行界面如下: