最近业余时间都在学习Cefsharp实现本地客户端,发现Cefsharp可以很方便的爬取一些网站的信息,于是就一些爬取过程做一下记录。
拼多多商品搜索的链接是:拼多多,我们在CefSharp发起请求时加上log打印相关的请求信息,经过分析log,发现这条有关键字的信息是在一次MimeType为json的HTTP资源请求。代码段和日志如下
protected override IResponseFilter GetResourceResponseFilter(IWebBrowser chromiumWebBrowser, IBrowser browser, IFrame frame, IRequest request, IResponse response)
{
logger.Debug(" request_url=" + request.Url + ";request_id=" + request.Identifier + ";response_MimeType=" + response.MimeType + ";response_charset=" + response.Charset + ";response_status=" + response.StatusText);
return base.GetResourceResponseFilter(chromiumWebBrowser, browser, frame, request, response);
}
2022-07-09 09:46:18.6335 DEBUG 20076-12 Chrome.MyChrome.CefHandlers.MyResourceRequestHandler.GetResourceResponseFilter request_url=https://mobile.yangkeduo.com/proxy/api/search_hotquery?pdduid=0&plat=h5&source=index;request_id=759816;response_MimeType=application/json;response_charset=utf-8;response_status=
使用我上传的资源下载器也可以加载到对应的json文件.使用CefSharp结合vue3实现简单URL资源下载器-C#文档类资源-CSDN下载
获取关键词对应的http请求后,我们就可以在C#里面对本次的资源请求进行截取分析,关键步骤如下。
重写CefSharp.Handler.RequestHandler类的GetResourceRequestHandler方法返回自定义的资源处理类
protected override IResourceRequestHandler GetResourceRequestHandler(IWebBrowser chromiumWebBrowser, IBrowser browser, IFrame frame, IRequest request, bool isNavigation, bool isDownload, string requestInitiator, ref bool disableDefaultHandling)
{
chrome.Logger.Debug("request_url=" + request.Url + ";request_id=" + request.Identifier + ";TransitionType=" + request.TransitionType + ";ReferrerUrl=" + request.ReferrerUrl + ";Method=" + request.Method + ";IsReadOnly=" + request.IsReadOnly + ";isNavigation=" + isNavigation + ";isDownload=" + isDownload + ";requestInitiator=" + requestInitiator);
return new MyResourceRequestHandler(webRequest);
}
重写CefSharp.Handler.ResourceRequestHandler类的GetResourceResponseFilter方法,将自定义的Stream传入。
Stream DataStream = new MemoryStream();
protected override IResponseFilter GetResourceResponseFilter(IWebBrowser chromiumWebBrowser, IBrowser browser, IFrame frame, IRequest request, IResponse response)
{
logger.Debug(" request_url=" + request.Url + ";request_id=" + request.Identifier + ";response_MimeType=" + response.MimeType + ";response_charset=" + response.Charset + ";response_status=" + response.StatusText);
return new CefSharp.ResponseFilter.StreamResponseFilter(DataStream);
}
重写CefSharp.Handler.ResourceRequestHandler类的OnResourceLoadComplete方法对Stream存储的数据进行解析处理
protected override void OnResourceLoadComplete(IWebBrowser chromiumWebBrowser, IBrowser browser, IFrame frame, IRequest request, IResponse response, UrlRequestStatus status, long receivedContentLength)
{
logger.Debug("request_url=" + request.Url + ";request_id=" + request.Identifier + ";request_status=" + status + ";recv_length=" + receivedContentLength);
var ms = DataStream as MemoryStream;
string Response2String = "";
var bytes = ms.ToArray();
if (Charset.IndexOf("utf-8", System.StringComparison.OrdinalIgnoreCase) >= 0)
{
Response2String = System.Text.Encoding.UTF8.GetString(bytes);
}
else if (Charset.IndexOf("gbk", System.StringComparison.OrdinalIgnoreCase) >= 0)
{
Response2String = System.Text.Encoding.GetEncoding("GB2312").GetString(bytes);
}
else
{
Response2String = System.Text.Encoding.UTF8.GetString(bytes);
Logger.Error("unknow_charset Charset=" + Charset);
}
JObject jsonObj = JObject.Parse(Response2String);
Logger.Debug("parse_json_success json_str=" + jsonObj["items"].ToString());
}
json处理我选用的是Newtonsoft.Json.Linq的JObject进行动态解析。