分析JD搜索报文
搜索关键字 女装
第二页,分2次加载。 rt=1&stop=1&click=&psort=&page=3
http://search.jd.com/Search?keyword=%E5%A5%B3%E8%A3%85&enc=utf-8#keyword=%E5%A5%B3%E8%A3%85&enc=utf-8&qrst=UNEXPAND&as=1&qk=title_key%2C%2C%E5%A5%B3%E8%A3%85&rt=1&stop=1&sttr=1&cid2=1343&click=2-1343&psort=&page=3
http://search.jd.com/s.php?keyword=%E5%A5%B3%E8%A3%85&enc=utf-8&qrst=UNEXPAND&as=1&qk=title_key%2C%2C%E5%A5%B3%E8%A3%85&rt=1&stop=1&sttr=1&cid2=1343&click=2-1343&psort=&page=4&scrolling=y&start=30&log_id=1422445952.81302&tpl=3_L&vt=2
主要参数:
keyword=女装 //搜索关键字
enc=utf-8
qrst=UNEXPAND
pr=9706%28288%2C303%29%3B9712%28179%2C242%29%3B //有的关键字还有pr参数
as=1
qk=title_key%2C%2C女装 //title_key%2C%2C加关键字
rt=1
stop=1
sttr=1
cid2=1343 //跟类目有关,我们的业务不需要关注它
click=2-1343
psort=
page=4 //页数,3第二页上半数据,4第二页下半数据,每个30条数据。
scrolling=y
start=30 //测试没效果默认30吧
log_id=1422445952.81302 //时间
tpl=3_L //样式,这里替换成1_M,方便采集数据。
vt=2 //样式,这里固定2即可。
模拟解析说明:
第一步,模拟请求http://search.jd.com/Search?keyword=搜索关键字&enc=utf-8
在正文中取出“SEARCH.top_url = 'search?keyword=……'”链接地址和“SEARCH.click = ''”。
SEARCH.top_url = 'search?keyword=%E5%A5%B3%E8%A3%85&enc=utf-8&qrst=UNEXPAND&as=1&qk=title_key%2C%2C%E5%A5%B3%E8%A3%85&rt=1&stop=1&sttr=1';
SEARCH.list_category = '';
SEARCH.click = '2-1343';
第二步,修改访问地址,删除重复#keyword=123&enc=utf-8(必须),
search? 替换s.php?,添加参数cid2、click、page、scrolling、start、log_id、tpl、vt。
第三步,模拟访问新地址,HOST:search.jd.com,获取正文(包含商品标题、图片地址、评价、shop_id、sku)。
这里通过正则取html a标签等信息太复杂了,给大家推荐一款插件HtmlAgilityPack.1.4.6,使用方法就不在这里细说了。
第四步,模拟请求http://search.jd.com/ShopName.php,参数ids,ids多个值用“%2C”分割。
例:http://search.jd.com/ShopName.php?ids=11111%2C2222,参数值为shop_id。
返回json数据,反序列化可获取店铺名称。
/// <summary>
/// 店铺model
/// </summary>
public class ShopName
{
public string id { get; set; } //shop_id
public string title { get; set; } //店铺名字
public string url { get; set; } //店铺地址
public int venderId { get; set; } //
}
注意:传30个shop_id不是一定返回30个数据,相同店铺名称相同id值,自动过滤重复shop_id。
第五步,模拟请求http://p.3.cn/prices/mgets,参数skuids、area、type,callback、_,skuids多个值用“,”分割,
例:http://p.3.cn/prices/mgets?skuids=J_1229193627,J_1083650580&area=&type=1&callback=&_=1422292699860
返回json数据,反序列化可获取商品价格和定价。
/// <summary>
/// 价格model
/// </summary>
public class Price
{
public string id { get; set; } //price_id
public string p { get; set; } //价格
public string m { get; set; } //定价
}
注意:反序列化之前要去掉两边括号,去除bom报头。返回“skuids input error\n”表示查询出错。
=======================================================
分析JD搜索报文 2015.8.11更新
搜索关键字 女装
搜索女装:http://search.jd.com/Search?keyword=%E5%A5%B3%E8%A3%85&enc=utf-8&wq=%E5%A5%B3%E8%A3%85&pvid=aro2g7di.6zerm5
第二页链接:<a onclick="searchlog(1, 2, 0, 56);" href="search?keyword=%E5%A5%B3%E8%A3%85&enc=utf-8&qrst=1&ps=addr&rt=1&stop=1&sttr=1&cid2=1343&page=2#filter">2</a>
跳转链接:http://search.jd.com/Search?keyword=%E5%A5%B3%E8%A3%85&enc=utf-8&wq=%E5%A5%B3%E8%A3%85&pvid=aro2g7di.6zerm5#keyword=%E5%A5%B3%E8%A3%85&enc=utf-8&qrst=1&ps=addr&rt=1&stop=1&sttr=1&cid2=1343&click=2-1343&psort=&page=3
后半页链接:http://search.jd.com/s.php?keyword=%E5%A5%B3%E8%A3%85&enc=utf-8&qrst=1&ps=addr&rt=1&stop=1&sttr=1&cid2=1343&click=2-1343&psort=&page=4&scrolling=y&start=30&log_id=1439304095.81302&tpl=3_L&vt=2
每页,分2次加载,1排4个,一页60个商品,每次加载30个商品,默认访问第一页,跟回到第一页不一定结果都一样,有些关键字会附加ev参数等,结果会有不同的变化。
Search主要参数:
keyword=女装 //搜索关键字
enc=utf-8
wq=女装
pvid= 2015版新增加的,经测试没有对搜索结果有任何影响
#keyword=女装
enc=utf-8
qrst=1
ps=addr
rt=1
stop=1
sttr=1
cid2=1343 //全部结果=》导航, 不指定它会改变最终的搜索结果,跟筛选商品有关,很重要
click=2-1343
psort=
page=4 //页数,3第二页上半数据,4第二页下半数据,每个30条数据。
ev= //有些关键字还会产生ev参数,会改变当前搜索结果
s.php主要参数:
keyword=女装 //*搜索关键字
enc=utf-8 //*
qrst=1
ps=addr
rt=1
stop=1
sttr=1
cid2=1343 //*这里值通过按关键字搜索返回的报文中获取
click=2-1343 //这个参数测试不会改变搜索结果
psort=
page=4 //*页数,3第二页上半数据,4第二页下半数据,每个30条数据。
scrolling=y
start=30 //测试没效果
log_id=1439234928.52874 //时间
tpl=3_L //*默认样式 有些关键字是3_M,这里统一用3_L,方便采集数据。
vt=2
备注:加*部分是必须的。
模拟解析说明:
通过s.php也可以获得上半页数据,结果相同,数据量小,更适合快速批量查询
第一步,模拟请求http://search.jd.com/Search?keyword=搜索关键字&enc=utf-8&pvid=kx1x86di.u737zi
在正文中取出第2页超链接地址,使用正则取href值,并截取search?……&page=2#filter中间部分。
添加s.php主要参数,&psort=&page=4&scrolling=y&start=30&log_id=1439304095.81302&tpl=3_L&vt=2
<a[\s]+onclick[\s]*=[\s]*"[\s]*searchlog\(1, 2, 0, 56\);"[\s]*href[\s]*=[\s]*"search\?(?<url>.+)&page=2#filter">2</a>[\s\S]+
第二步,模拟请求http://search.jd.com/s.php?新的连接
HOST:search.jd.com,获取正文(包含商品标题、图片地址、评价、shop_id、sku)。
这里通过正则取html a标签等信息太复杂了,给大家推荐一款插件HtmlAgilityPack.1.4.6,使用方法就不在这里细说了。
第三步,模拟请求http://search.jd.com/ShopName.php,参数ids,ids多个值用“%2C”分割。
例:http://search.jd.com/ShopName.php?ids=11111%2C2222,参数值为shop_id。
返回json数据,反序列化可获取店铺名称。
http://search.jd.com/ShopName.php?ids=56210
[{"id":"56210","title":"MOMO\u5973\u88c5\u4e13\u8425\u5e97","url":"http:\/\/mengbuluo.jd.com","venderId":60335}]
/// <summary>
/// 店铺model
/// </summary>
public class ShopName
{
public string id { get; set; } //shop_id
public string title { get; set; } //店铺名字
public string url { get; set; } //店铺地址
public int venderId { get; set; } //
}
注意:传30个shop_id不是一定返回30个数据,相同店铺名称相同id值,自动过滤重复shop_id。
注意:反序列化之前要去掉两边括号,去除bom报头。返回“skuids input error\n”表示查询出错。
好了,现在我们通过指定关键字模拟请求返回数据中,拿到所有宝贝标题、宝贝价格、宝贝店铺名、评价等信息。
接下来就是使用C#编写小工具来实现我们需求的业务逻辑,可通过宝贝关键字,店铺名称,价格范围,访问页数范围等。
HttpHelper.cs
using System; using System.Collections.Generic; using System.Text; using System.Net; using System.IO; using System.Text.RegularExpressions; using System.IO.Compression; using System.Security.Cryptography.X509Certificates; using System.Net.Security; using System.Linq; namespace SelectedProduct { /// <summary> /// Http连接操作帮助类 /// </summary> public class HttpHelper { #region 预定义方法或者变更 //默认的编码 private Encoding encoding = Encoding.Default; //HttpWebRequest对象用来发起请求 private HttpWebRequest request = null; //获取影响流的数据对象 private HttpWebResponse response = null; /// <summary> /// 根据相传入的数据,得到相应页面数据 /// </summary> /// <param name="objhttpitem">参数类对象</param> /// <returns>返回HttpResult类型</returns> private HttpResult GetHttpRequestData(HttpItem objhttpitem) { //返回参数 HttpResult result = new HttpResult(); try { #region 得到请求的response using (response = (HttpWebResponse)request.GetResponse()) { result.StatusCode = response.StatusCode; result.StatusDescription = response.StatusDescription; result.Header = response.Headers; if (response.Cookies != null) { result.CookieCollection = response.Cookies; } if (response.Headers["set-cookie"] != null) { result.Cookie = response.Headers["set-cookie"]; } MemoryStream _stream = new MemoryStream(); //GZIIP处理 if (response.ContentEncoding != null && response.ContentEncoding.Equals("gzip", StringComparison.InvariantCultureIgnoreCase)) { //开始读取流并设置编码方式 //new GZipStream(response.GetResponseStream(), CompressionMode.Decompress).CopyTo(_stream, 10240); //.net4.0以下写法 _stream = GetMemoryStream(new GZipStream(response.GetResponseStream(), CompressionMode.Decompress)); } else { //开始读取流并设置编码方式 //response.GetResponseStream().CopyTo(_stream, 10240); //.net4.0以下写法 _stream = GetMemoryStream(response.GetResponseStream()); } //获取Byte byte[] RawResponse = _stream.ToArray(); _stream.Close(); //是否返回Byte类型数据 if (objhttpitem.ResultType == ResultType.Byte) result.ResultByte = RawResponse; //从这里开始我们要无视编码了 if (encoding == null) { Match meta = Regex.Match(Encoding.Default.GetString(RawResponse), "<meta([^<]*)charset=([^<]*)[\"']", RegexOptions.IgnoreCase); string charter = (meta.Groups.Count > 1) ? meta.Groups[2].Value.ToLower() : string.Empty; charter = charter.Replace("\"", "").Replace("'", "").Replace(";", "").Replace("iso-8859-1", "gbk"); if (charter.Length > 2) encoding = Encoding.GetEncoding(charter.Trim()); else { if (string.IsNullOrEmpty(response.CharacterSet)) encoding = Encoding.UTF8; else encoding = Encoding.GetEncoding(response.CharacterSet); } } //得到返回的HTML result.Html = encoding.GetString(RawResponse); } #endregion } catch (WebException ex) { //这里是在发生异常时返回的错误信息 response = (HttpWebResponse)ex.Response; result.Html = ex.Message; if (response != null) { result.StatusCode = response.StatusCode; result.StatusDescription = response.StatusDescription; } } catch (Exception ex) { result.Html = ex.Message; } if (objhttpitem.IsToLower) result.Html = result.Html.ToLower(); return result; } /// <summary> /// 4.0以下.net版本取数据使用 /// </summary> /// <param name="streamResponse">流</param> private static MemoryStream GetMemoryStream(Stream streamResponse) { MemoryStream _stream = new MemoryStream(); int Length = 256; Byte[] buffer = new Byte[Length]; int bytesRead = streamResponse.Read(buffer, 0, Length); // write the required bytes while (bytesRead > 0) { _stream.Write(buffer, 0, bytesRead); bytesRead = streamResponse.Read(buffer, 0, Length); } return _stream; } /// <summary> /// 为请求准备参数 /// </summary> ///<param name="objhttpItem">参数列表</param> /// <param name="_Encoding">读取数据时的编码方式</param> private void SetRequest(HttpItem objhttpItem) { // 验证证书 SetCer(objhttpItem); //设置Header参数 if (objhttpItem.Header != null && objhttpItem.Header.Count > 0) { foreach (string item in objhttpItem.Header.AllKeys) { request.Headers.Add(item, objhttpItem.Header[item]); } } // 设置代理 SetProxy(objhttpItem); //请求方式Get或者Post request.Method = objhttpItem.Method; request.Timeout = objhttpItem.Timeout; request.ReadWriteTimeout = objhttpItem.ReadWriteTimeout; //Accept request.Accept = objhttpItem.Accept; //ContentType返回类型 request.ContentType = objhttpItem.ContentType; //UserAgent客户端的访问类型,包括浏览器版本和操作系统信息 request.UserAgent = objhttpItem.UserAgent; // 编码 encoding = objhttpItem.Encoding; //设置Cookie SetCookie(objhttpItem); //来源地址 request.Referer = objhttpItem.Referer; //是否执行跳转功能 request.AllowAutoRedirect = objhttpItem.Allowautoredirect; //设置Post数据 SetPostData(objhttpItem); //设置最大连接 if (objhttpItem.Connectionlimit > 0) request.ServicePoint.ConnectionLimit = objhttpItem.Connectionlimit; } /// <summary> /// 设置证书 /// </summary> /// <param name="objhttpItem"></param> private void SetCer(HttpItem objhttpItem) { if (!string.IsNullOrEmpty(objhttpItem.CerPath)) { //这一句一定要写在创建连接的前面。使用回调的方法进行证书验证。 ServicePointManager.ServerCertificateValidationCallback = new System.Net.Security.RemoteCertificateValidationCallback(CheckValidationResult); //初始化对像,并设置请求的URL地址 request = (HttpWebRequest)WebRequest.Create(objhttpItem.URL); //将证书添加到请求里 request.ClientCertificates.Add(new X509Certificate(objhttpItem.CerPath)); } else //初始化对像,并设置请求的URL地址 request = (HttpWebRequest)WebRequest.Create(objhttpItem.URL); } /// <summary> /// 设置Cookie /// </summary> /// <param name="objhttpItem">Http参数</param> private void SetCookie(HttpItem objhttpItem) { if (!string.IsNullOrEmpty(objhttpItem.Cookie)) //Cookie request.Headers[HttpRequestHeader.Cookie] = objhttpItem.Cookie; //设置Cookie if (objhttpItem.CookieCollection != null) { request.CookieContainer = new CookieContainer(); request.CookieContainer.Add(objhttpItem.CookieCollection); } } /// <summary> /// 设置Post数据 /// </summary> /// <param name="objhttpItem">Http参数</param> private void SetPostData(HttpItem objhttpItem) { //验证在得到结果时是否有传入数据 if (request.Method.Trim().ToLower().Contains("post")) { byte[] buffer = null; //写入Byte类型 if (objhttpItem.PostDataType == PostDataType.Byte && objhttpItem.PostdataByte != null && objhttpItem.PostdataByte.Length > 0) { //验证在得到结果时是否有传入数据 buffer = objhttpItem.PostdataByte; }//写入文件 else if (objhttpItem.PostDataType == PostDataType.FilePath && !string.IsNullOrEmpty(objhttpItem.Postdata)) { StreamReader r = new StreamReader(objhttpItem.Postdata, encoding); buffer = Encoding.Default.GetBytes(r.ReadToEnd()); r.Close(); } //写入字符串 else if (!string.IsNullOrEmpty(objhttpItem.Postdata)) { buffer = Encoding.Default.GetBytes(objhttpItem.Postdata); } if (buffer != null) { request.ContentLength = buffer.Length; request.GetRequestStream().Write(buffer, 0, buffer.Length); } } } /// <summary> /// 设置代理 /// </summary> /// <param name="objhttpItem">参数对象</param> private void SetProxy(HttpItem objhttpItem) { if (!string.IsNullOrEmpty(objhttpItem.ProxyIp)) { //设置代理服务器 if (objhttpItem.ProxyIp.Contains(":")) { string[] plist = objhttpItem.ProxyIp.Split(':'); WebProxy myProxy = new WebProxy(plist[0].Trim(), Convert.ToInt32(plist[1].Trim())); //建议连接 myProxy.Credentials = new NetworkCredential(objhttpItem.ProxyUserName, objhttpItem.ProxyPwd); //给当前请求对象 request.Proxy = myProxy; } else { WebProxy myProxy = new WebProxy(objhttpItem.ProxyIp, false); //建议连接 myProxy.Credentials = new NetworkCredential(objhttpItem.ProxyUserName, objhttpItem.ProxyPwd); //给当前请求对象 request.Proxy = myProxy; } //设置安全凭证 request.Credentials = CredentialCache.DefaultNetworkCredentials; } } /// <summary> /// 回调验证证书问题 /// </summary> /// <param name="sender">流对象</param> /// <param name="certificate">证书</param> /// <param name="chain">X509Chain</param> /// <param name="errors">SslPolicyErrors</param> /// <returns>bool</returns> public bool CheckValidationResult(object sender, X509Certificate certificate, X509Chain chain, SslPolicyErrors errors) { // 总是接受 return true; } #endregion #region 普通类型 ///<summary> ///采用https协议访问网络,根据传入的URl地址,得到响应的数据字符串。 ///</summary> ///<param name="objhttpItem">参数列表</param> ///<returns>String类型的数据</returns> public HttpResult GetHtml(HttpItem objhttpItem) { try { //准备参数 SetRequest(objhttpItem); } catch (Exception ex) { HttpResult Result = new HttpResult() { Cookie = "", Header = null, Html = ex.Message, StatusDescription = "配置参考时报错" }; return Result; } //调用专门读取数据的类 return GetHttpRequestData(objhttpItem); } #endregion } /// <summary> /// Http请求参考类 /// </summary> public class HttpItem { string _URL = string.Empty; /// <summary> /// 请求URL必须填写 /// </summary> public string URL { get { return _URL; } set { _URL = value; } } string _Method = "GET"; /// <summary> /// 请求方式默认为GET方式,当为POST方式时必须设置Postdata的值 /// </summary> public string Method { get { return _Method; } set { _Method = value; } } int _Timeout = 100000; /// <summary> /// 默认请求超时时间 /// </summary> public int Timeout { get { return _Timeout; } set { _Timeout = value; } } int _ReadWriteTimeout = 30000; /// <summary> /// 默认写入Post数据超时间 /// </summary> public int ReadWriteTimeout { get { return _ReadWriteTimeout; } set { _ReadWriteTimeout = value; } } string _Accept = "text/html, application/xhtml+xml, */*"; /// <summary> /// 请求标头值 默认为text/html, application/xhtml+xml, */* /// </summary> public string Accept { get { return _Accept; } set { _Accept = value; } } string _ContentType = "text/html"; /// <summary> /// 请求返回类型默认 text/html /// </summary> public string ContentType { get { return _ContentType; } set { _ContentType = value; } } string _UserAgent = "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"; /// <summary> /// 客户端访问信息默认Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0) /// </summary> public string UserAgent { get { return _UserAgent; } set { _UserAgent = value; } } Encoding _Encoding = null; /// <summary> /// 返回数据编码默认为NUll,可以自动识别,一般为utf-8,gbk,gb2312 /// </summary> public Encoding Encoding { get { return _Encoding; } set { _Encoding = value; } } private PostDataType _PostDataType = PostDataType.String; /// <summary> /// Post的数据类型 /// </summary> public PostDataType PostDataType { get { return _PostDataType; } set { _PostDataType = value; } } string _Postdata = string.Empty; /// <summary> /// Post请求时要发送的字符串Post数据 /// </summary> public string Postdata { get { return _Postdata; } set { _Postdata = value; } } private byte[] _PostdataByte = null; /// <summary> /// Post请求时要发送的Byte类型的Post数据 /// </summary> public byte[] PostdataByte { get { return _PostdataByte; } set { _PostdataByte = value; } } CookieCollection cookiecollection = null; /// <summary> /// Cookie对象集合 /// </summary> public CookieCollection CookieCollection { get { return cookiecollection; } set { cookiecollection = value; } } string _Cookie = string.Empty; /// <summary> /// 请求时的Cookie /// </summary> public string Cookie { get { return _Cookie; } set { _Cookie = value; } } string _Referer = string.Empty; /// <summary> /// 来源地址,上次访问地址 /// </summary> public string Referer { get { return _Referer; } set { _Referer = value; } } string _CerPath = string.Empty; /// <summary> /// 证书绝对路径 /// </summary> public string CerPath { get { return _CerPath; } set { _CerPath = value; } } private Boolean isToLower = false; /// <summary> /// 是否设置为全文小写,默认为不转化 /// </summary> public Boolean IsToLower { get { return isToLower; } set { isToLower = value; } } private Boolean allowautoredirect = false; /// <summary> /// 支持跳转页面,查询结果将是跳转后的页面,默认是不跳转 /// </summary> public Boolean Allowautoredirect { get { return allowautoredirect; } set { allowautoredirect = value; } } private int connectionlimit = 1024; /// <summary> /// 最大连接数 /// </summary> public int Connectionlimit { get { return connectionlimit; } set { connectionlimit = value; } } private string proxyusername = string.Empty; /// <summary> /// 代理Proxy 服务器用户名 /// </summary> public string ProxyUserName { get { return proxyusername; } set { proxyusername = value; } } private string proxypwd = string.Empty; /// <summary> /// 代理 服务器密码 /// </summary> public string ProxyPwd { get { return proxypwd; } set { proxypwd = value; } } private string proxyip = string.Empty; /// <summary> /// 代理 服务IP /// </summary> public string ProxyIp { get { return proxyip; } set { proxyip = value; } } private ResultType resulttype = ResultType.String; /// <summary> /// 设置返回类型String和Byte /// </summary> public ResultType ResultType { get { return resulttype; } set { resulttype = value; } } private WebHeaderCollection header = new WebHeaderCollection(); //header对象 public WebHeaderCollection Header { get { return header; } set { header = value; } } } /// <summary> /// Http返回参数类 /// </summary> public class HttpResult { string _Cookie = string.Empty; /// <summary> /// Http请求返回的Cookie /// </summary> public string Cookie { get { return _Cookie; } set { _Cookie = value; } } CookieCollection cookiecollection = new CookieCollection(); /// <summary> /// Cookie对象集合 /// </summary> public CookieCollection CookieCollection { get { return cookiecollection; } set { cookiecollection = value; } } private string html = string.Empty; /// <summary> /// 返回的String类型数据 只有ResultType.String时才返回数据,其它情况为空 /// </summary> public string Html { get { return html; } set { html = value; } } private byte[] resultbyte = null; /// <summary> /// 返回的Byte数组 只有ResultType.Byte时才返回数据,其它情况为空 /// </summary> public byte[] ResultByte { get { return resultbyte; } set { resultbyte = value; } } private WebHeaderCollection header = new WebHeaderCollection(); //header对象 public WebHeaderCollection Header { get { return header; } set { header = value; } } private string statusDescription = ""; /// <summary> /// 返回状态说明 /// </summary> public string StatusDescription { get { return statusDescription; } set { statusDescription = value; } } private HttpStatusCode statusCode = HttpStatusCode.OK; /// <summary> /// 返回状态码,默认为OK /// </summary> public HttpStatusCode StatusCode { get { return statusCode; } set { statusCode = value; } } } /// <summary> /// 返回类型 /// </summary> public enum ResultType { /// <summary> /// 表示只返回字符串 只有Html有数据 /// </summary> String, /// <summary> /// 表示返回字符串和字节流 ResultByte和Html都有数据返回 /// </summary> Byte } /// <summary> /// Post的数据格式默认为string /// </summary> public enum PostDataType { /// <summary> /// 字符串类型,这时编码Encoding可不设置 /// </summary> String, /// <summary> /// Byte类型,需要设置PostdataByte参数的值编码Encoding可设置为空 /// </summary> Byte, /// <summary> /// 传文件,Postdata必须设置为文件的绝对路径,必须设置Encoding的值 /// </summary> FilePath } }