从头学习爬虫(四十六)高阶篇----selenium获取network

本文主要帮助解决selenium获取network

一 需求

从头学习爬虫(四十六)高阶篇----selenium获取network_第1张图片

想用selenium获取network拿到请求头 可以通过请求头方式去请求提高效率。

二 分析技术难点

查了很多资料,也看了源码没有找到network的工具类或者接口。

但是换个思路找到dev-tools可以通过http接口去获取。

相关资料

https://stackoverflow.com/questions/6509628/how-to-get-http-response-code-using-selenium-webdriver?utm_source=hacpai.com
https://hacpai.com/article/1546004185689#1563761318517
--源码
https://github.com/bayandin/chromedriver/blob/master/server/http_handler.cc
--webdriver操作 api
https://w3c.github.io/webdriver/
--webdriver excute api
https://chromedevtools.github.io/devtools-protocol/tot/Network

三 代码

import java.io.File;
import java.io.IOException;
import java.util.Iterator;
import java.util.logging.Level;

import org.apache.http.HttpResponse;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.entity.StringEntity;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.util.EntityUtils;
import org.json.JSONException;
import org.json.JSONObject;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeDriverService;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.logging.LogEntries;
import org.openqa.selenium.logging.LogEntry;
import org.openqa.selenium.logging.LogType;
import org.openqa.selenium.logging.LoggingPreferences;
import org.openqa.selenium.remote.CapabilityType;
import org.openqa.selenium.remote.DesiredCapabilities;

/**
 * chromedriver
 * dev-tool
 */
public class ChromedriverNetwork {
	public static final String port = "1527";
	public static final String filePath = "C:\\Users\\xxx\\Downloads\\chromedriver2.44\\chromedriver.exe";
	public static void main(String[] args) {
		
		String sessionId ;
		String url = "https://www.baidu.com";
		//System.setProperty("webdriver.chrome.driver",filePath);
		ChromeDriver driver = null;
		try {
			ChromeOptions options = new ChromeOptions();
			DesiredCapabilities cap = new DesiredCapabilities();
			ChromeDriverService service = new ChromeDriverService.Builder().usingPort(Integer.valueOf(port))
					.usingDriverExecutable(new File(filePath))
					.build();
			options.addArguments("disable-infobars");
			cap.setCapability(ChromeOptions.CAPABILITY, options);
			//配置日志
			LoggingPreferences logPrefs = new LoggingPreferences();
			logPrefs.enable(LogType.PERFORMANCE, Level.ALL);
			/**flash 扩展*/
			cap.setCapability("profile.managed_default_content_settings.images",1);
			cap.setCapability("profile.content_settings.plugin_whitelist.adobe-flash-player",1);
			cap.setCapability("profile.content_settings.exceptions.plugins.*,*.per_resource.adobe-flash-player",1);
			cap.setCapability(CapabilityType.LOGGING_PREFS, logPrefs);
			driver = new ChromeDriver(service, cap);
			//or driver.get(url)
			driver.navigate().to(url);
			System.out.println("session id :" + driver.getSessionId());
			sessionId = driver.getSessionId().toString();
			LogEntries logs = driver.manage().logs().get(LogType.PERFORMANCE);
			for (Iterator it = logs.iterator(); it.hasNext();) {
				LogEntry entry = it.next();
				try {
					//System.out.println(entry.toString());
					JSONObject json = new JSONObject(entry.getMessage());
					System.out.println(json.toString());
					JSONObject message = json.getJSONObject("message");
					String method = message.getString("method");
					//如果是响应
					if (method != null && "Network.responseReceived".equals(method)) {
						JSONObject params = message.getJSONObject("params");
						JSONObject response = params.getJSONObject("response");
						String messageUrl = response.getString("url");
						System.out.println("-----------------------------");
						System.err.println("url:" + url + " params: " + params.toString() + " response: " + response.toString());
						//打印调用chromedriver 调chrome httpapi 结果
						System.out.println(getResponseBody(params.getString("requestId"),port,sessionId));
						System.out.println("-----------------------------");
					}
				} catch (Exception e) {
					e.printStackTrace();
				}
			} 
		} catch (Exception e) {
			e.printStackTrace();
		}finally{
            if (driver != null)
            {
                driver.quit();
            }
        }
	}
	
	
	// 根据请求ID获取返回内容
    public static String getResponseBody(String requestId,String port,String sessionId) {
        try {
            // CHROME_DRIVER_PORT chromeDriver提供的端口
            String url = String.format("http://localhost:%s/session/%s/goog/cdp/execute", 
            		port, sessionId);
    
            HttpPost httpPost = new HttpPost(url);
            JSONObject object = new JSONObject();
            JSONObject params = new JSONObject();
            params.put("requestId", requestId);
            //api https://chromedevtools.github.io/devtools-protocol/tot/Network
			object.put("cmd", "Network.getResponseBody");

            object.put("params", params);
    
            httpPost.setEntity(new StringEntity(object.toString()));
    
            RequestConfig requestConfig = RequestConfig
                    .custom()
                    .setSocketTimeout(60 * 1000)
                    .setConnectTimeout(60 * 1000).build();
                    
            CloseableHttpClient httpClient = HttpClientBuilder.create()
                    .setDefaultRequestConfig(requestConfig).build();
    
            HttpResponse response = httpClient.execute(httpPost);
    
            return EntityUtils.toString(response.getEntity());
        } catch (Exception e) {
           e.printStackTrace();
        }
        return null;
    }
}

结果

从头学习爬虫(四十六)高阶篇----selenium获取network_第2张图片

四 分析过程

首先将请求过程通过日志方式记录下来,将日志json格式化

从头学习爬虫(四十六)高阶篇----selenium获取network_第3张图片

接着调用http去请求chrome返回需要的内容(接口要看参考资料)

url:https://www.baidu.com params: {"frameId":"1E338F62626FC6EB5904C1C65A8955D0","requestId":"35C0B1A63F4485F2DA6DB9A6B79BE1C8","response":{"headers":{"Transfer-Encoding":"chunked","Server":"BWS/1.1","X-Ua-Compatible":"IE=Edge,chrome=1","Connection":"Keep-Alive","P3p":"CP=\" OTI DSP COR IVA OUR IND COM \"","Date":"Wed, 28 Aug 2019 07:50:34 GMT","Strict-Transport-Security":"max-age=172800","Cache-Control":"private","Bdpagetype":"1","Content-Encoding":"gzip","Cxy_all":"baidu+5822756d9d974a76a68fa39c039eb113","Set-Cookie":"BAIDUID=5B4A7B7A1C7BF82D739D8EE8EDC7F493:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com\nBIDUPSID=5B4A7B7A1C7BF82D739D8EE8EDC7F493; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com\nPSTM=1566978634; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com\ndelPer=0; path=/; domain=.baidu.com\nBDSVRTM=0; path=/\nBD_HOME=0; path=/\nH_PS_PSSID=1427_21126_29523_29521_29098_29567_29220_26350_29588; path=/; domain=.baidu.com","Vary":"Accept-Encoding","Bdqid":"0xebefc6660002f5ba","Expires":"Wed, 28 Aug 2019 07:49:44 GMT","Content-Type":"text/html"},"securityDetails":{"cipher":"AES_128_GCM","signedCertificateTimestampList":[{"signatureData":"304502202C7B4DC0F985478A2D0AC0793BD6B4B566F8AAFB8258AD2336FE16BCA6839921022100C02FCD9C9920CB7D915FD28BC6131073B5C1540333419FA66AC51493CF692B6B","logDescription":"Google 'Skydiver' log","origin":"Embedded in certificate","logId":"BBD9DFBC1F8A71B593942397AA927B473857950AAB52E81A909664368E1ED185","hashAlgorithm":"SHA-256","signatureAlgorithm":"ECDSA","status":"Verified","timestamp":1.557364924826E12},{"signatureData":"304502200332689E39D0EB5F1961DBA712696F28448102A53CC2A313D57E98265F201AA0022100A78B62B3B0B44432E211FF458D55112C36AB299344C8345CCE7C355731AEAB12","logDescription":"Comodo 'Mammoth' CT log","origin":"Embedded in certificate","logId":"6F5376AC31F03119D89900A45115FF77151C11D902C10029068DB2089A37D913","hashAlgorithm":"SHA-256","signatureAlgorithm":"ECDSA","status":"Verified","timestamp":1.557364923983E12}],"protocol":"TLS 1.2","certificateId":0,"certificateTransparencyCompliance":"unknown","sanList":["baidu.com","click.hm.baidu.com","cm.pos.baidu.com","log.hm.baidu.com","update.pan.baidu.com","wn.pos.baidu.com","*.91.com","*.aipage.cn","*.aipage.com","*.apollo.auto","*.baidu.com","*.baidubce.com","*.baiducontent.com","*.baidupcs.com","*.baidustatic.com","*.baifae.com","*.baifubao.com","*.bce.baidu.com","*.bcehost.com","*.bdimg.com","*.bdstatic.com","*.bdtjrcv.com","*.bj.baidubce.com","*.chuanke.com","*.dlnel.com","*.dlnel.org","*.dueros.baidu.com","*.eyun.baidu.com","*.fanyi.baidu.com","*.gz.baidubce.com","*.hao123.baidu.com","*.hao123.com","*.hao222.com","*.im.baidu.com","*.map.baidu.com","*.mbd.baidu.com","*.mipcdn.com","*.news.baidu.com","*.nuomi.com","*.safe.baidu.com","*.smartapps.cn","*.ssl2.duapps.com","*.su.baidu.com","*.trustgo.com","*.xueshu.baidu.com","apollo.auto","baifae.com","baifubao.com","dwz.cn","mct.y.nuomi.com","www.baidu.cn","www.baidu.com.cn"],"validFrom":1557364922,"issuer":"GlobalSign Organization Validation CA - SHA256 - G2","keyExchange":"ECDHE_RSA","keyExchangeGroup":"P-256","subjectName":"baidu.com","validTo":1593063062},"connectionReused":true,"timing":{"pushEnd":0,"workerStart":-1,"proxyEnd":-1,"workerReady":-1,"sslEnd":-1,"pushStart":0,"requestTime":1392123.590895,"sslStart":-1,"dnsStart":-1,"sendEnd":901.956,"connectEnd":-1,"connectStart":-1,"sendStart":901.859,"dnsEnd":-1,"receiveHeadersEnd":921.037,"proxyStart":-1},"encodedDataLength":1081,"remotePort":443,"mimeType":"text/html","headersText":"HTTP/1.1 200 OK\r\nBdpagetype: 1\r\nBdqid: 0xebefc6660002f5ba\r\nCache-Control: private\r\nConnection: Keep-Alive\r\nContent-Encoding: gzip\r\nContent-Type: text/html\r\nCxy_all: baidu+5822756d9d974a76a68fa39c039eb113\r\nDate: Wed, 28 Aug 2019 07:50:34 GMT\r\nExpires: Wed, 28 Aug 2019 07:49:44 GMT\r\nP3p: CP=\" OTI DSP COR IVA OUR IND COM \"\r\nServer: BWS/1.1\r\nSet-Cookie: BAIDUID=5B4A7B7A1C7BF82D739D8EE8EDC7F493:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com\r\nSet-Cookie: BIDUPSID=5B4A7B7A1C7BF82D739D8EE8EDC7F493; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com\r\nSet-Cookie: PSTM=1566978634; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com\r\nSet-Cookie: delPer=0; path=/; domain=.baidu.com\r\nSet-Cookie: BDSVRTM=0; path=/\r\nSet-Cookie: BD_HOME=0; path=/\r\nSet-Cookie: H_PS_PSSID=1427_21126_29523_29521_29098_29567_29220_26350_29588; path=/; domain=.baidu.com\r\nStrict-Transport-Security: max-age=172800\r\nVary: Accept-Encoding\r\nX-Ua-Compatible: IE=Edge,chrome=1\r\nTransfer-Encoding: chunked\r\n\r\n","securityState":"secure","requestHeadersText":"GET / HTTP/1.1\r\nHost: www.baidu.com\r\nConnection: keep-alive\r\nUpgrade-Insecure-Requests: 1\r\nUser-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8\r\nAccept-Encoding: gzip, deflate, br\r\nAccept-Language: zh-CN,zh;q=0.9\r\n","url":"https://www.baidu.com/","protocol":"http/1.1","fromDiskCache":false,"fromServiceWorker":false,"requestHeaders":{"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8","Upgrade-Insecure-Requests":"1","Connection":"keep-alive","User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36","Host":"www.baidu.com","Accept-Encoding":"gzip, deflate, br","Accept-Language":"zh-CN,zh;q=0.9"},"remoteIPAddress":"112.80.248.75","statusText":"OK","connectionId":19,"status":200},"loaderId":"35C0B1A63F4485F2DA6DB9A6B79BE1C8","type":"Document","timestamp":1392124.512758} response: {"headers":{"Transfer-Encoding":"chunked","Server":"BWS/1.1","X-Ua-Compatible":"IE=Edge,chrome=1","Connection":"Keep-Alive","P3p":"CP=\" OTI DSP COR IVA OUR IND COM \"","Date":"Wed, 28 Aug 2019 07:50:34 GMT","Strict-Transport-Security":"max-age=172800","Cache-Control":"private","Bdpagetype":"1","Content-Encoding":"gzip","Cxy_all":"baidu+5822756d9d974a76a68fa39c039eb113","Set-Cookie":"BAIDUID=5B4A7B7A1C7BF82D739D8EE8EDC7F493:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com\nBIDUPSID=5B4A7B7A1C7BF82D739D8EE8EDC7F493; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com\nPSTM=1566978634; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com\ndelPer=0; path=/; domain=.baidu.com\nBDSVRTM=0; path=/\nBD_HOME=0; path=/\nH_PS_PSSID=1427_21126_29523_29521_29098_29567_29220_26350_29588; path=/; domain=.baidu.com","Vary":"Accept-Encoding","Bdqid":"0xebefc6660002f5ba","Expires":"Wed, 28 Aug 2019 07:49:44 GMT","Content-Type":"text/html"},"securityDetails":{"cipher":"AES_128_GCM","signedCertificateTimestampList":[{"signatureData":"304502202C7B4DC0F985478A2D0AC0793BD6B4B566F8AAFB8258AD2336FE16BCA6839921022100C02FCD9C9920CB7D915FD28BC6131073B5C1540333419FA66AC51493CF692B6B","logDescription":"Google 'Skydiver' log","origin":"Embedded in certificate","logId":"BBD9DFBC1F8A71B593942397AA927B473857950AAB52E81A909664368E1ED185","hashAlgorithm":"SHA-256","signatureAlgorithm":"ECDSA","status":"Verified","timestamp":1.557364924826E12},{"signatureData":"304502200332689E39D0EB5F1961DBA712696F28448102A53CC2A313D57E98265F201AA0022100A78B62B3B0B44432E211FF458D55112C36AB299344C8345CCE7C355731AEAB12","logDescription":"Comodo 'Mammoth' CT log","origin":"Embedded in certificate","logId":"6F5376AC31F03119D89900A45115FF77151C11D902C10029068DB2089A37D913","hashAlgorithm":"SHA-256","signatureAlgorithm":"ECDSA","status":"Verified","timestamp":1.557364923983E12}],"protocol":"TLS 1.2","certificateId":0,"certificateTransparencyCompliance":"unknown","sanList":["baidu.com","click.hm.baidu.com","cm.pos.baidu.com","log.hm.baidu.com","update.pan.baidu.com","wn.pos.baidu.com","*.91.com","*.aipage.cn","*.aipage.com","*.apollo.auto","*.baidu.com","*.baidubce.com","*.baiducontent.com","*.baidupcs.c-----------------------------
om","*.baidustatic.com","*.baifae.com","*.baifubao.com","*.bce.baidu.com","*.bcehost.com","*.bdimg.com","*.bdstatic.com","*.bdtjrcv.com","*.bj.baidubce.com","*.chuanke.com","*.dlnel.com","*.dlnel.org","*.dueros.baidu.com","*.eyun.baidu.com","*.fanyi.baidu.com","*.gz.baidubce.com","*.hao123.baidu.com","*.hao123.com","*.hao222.com","*.im.baidu.com","*.map.baidu.com","*.mbd.baidu.com","*.mipcdn.com","*.news.baidu.com","*.nuomi.com","*.safe.baidu.com","*.smartapps.cn","*.ssl2.duapps.com","*.su.baidu.com","*.trustgo.com","*.xueshu.baidu.com","apollo.auto","baifae.com","baifubao.com","dwz.cn","mct.y.nuomi.com","www.baidu.cn","www.baidu.com.cn"],"validFrom":1557364922,"issuer":"GlobalSign Organization Validation CA - SHA256 - G2","keyExchange":"ECDHE_RSA","keyExchangeGroup":"P-256","subjectName":"baidu.com","validTo":1593063062},"connectionReused":true,"timing":{"pushEnd":0,"workerStart":-1,"proxyEnd":-1,"workerReady":-1,"sslEnd":-1,"pushStart":0,"requestTime":1392123.590895,"sslStart":-1,"dnsStart":-1,"sendEnd":901.956,"connectEnd":-1,"connectStart":-1,"sendStart":901.859,"dnsEnd":-1,"receiveHeadersEnd":921.037,"proxyStart":-1},"encodedDataLength":1081,"remotePort":443,"mimeType":"text/html","headersText":"HTTP/1.1 200 OK\r\nBdpagetype: 1\r\nBdqid: 0xebefc6660002f5ba\r\nCache-Control: private\r\nConnection: Keep-Alive\r\nContent-Encoding: gzip\r\nContent-Type: text/html\r\nCxy_all: baidu+5822756d9d974a76a68fa39c039eb113\r\nDate: Wed, 28 Aug 2019 07:50:34 GMT\r\nExpires: Wed, 28 Aug 2019 07:49:44 GMT\r\nP3p: CP=\" OTI DSP COR IVA OUR IND COM \"\r\nServer: BWS/1.1\r\nSet-Cookie: BAIDUID=5B4A7B7A1C7BF82D739D8EE8EDC7F493:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com\r\nSet-Cookie: BIDUPSID=5B4A7B7A1C7BF82D739D8EE8EDC7F493; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com\r\nSet-Cookie: PSTM=1566978634; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com\r\nSet-Cookie: delPer=0; path=/; domain=.baidu.com\r\nSet-Cookie: BDSVRTM=0; path=/\r\nSet-Cookie: BD_HOME=0; path=/\r\nSet-Cookie: H_PS_PSSID=1427_21126_29523_29521_29098_29567_29220_26350_29588; path=/; domain=.baidu.com\r\nStrict-Transport-Security: max-age=172800\r\nVary: Accept-Encoding\r\nX-Ua-Compatible: IE=Edge,chrome=1\r\nTransfer-Encoding: chunked\r\n\r\n","securityState":"secure","requestHeadersText":"GET / HTTP/1.1\r\nHost: www.baidu.com\r\nConnection: keep-alive\r\nUpgrade-Insecure-Requests: 1\r\nUser-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8\r\nAccept-Encoding: gzip, deflate, br\r\nAccept-Language: zh-CN,zh;q=0.9\r\n","url":"https://www.baidu.com/","protocol":"http/1.1","fromDiskCache":false,"fromServiceWorker":false,"requestHeaders":{"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8","Upgrade-Insecure-Requests":"1","Connection":"keep-alive","User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36","Host":"www.baidu.com","Accept-Encoding":"gzip, deflate, br","Accept-Language":"zh-CN,zh;q=0.9"},"remoteIPAddress":"112.80.248.75","statusText":"OK","connectionId":19,"status":200}

得到结果(ps:上面两次请求id不一样返回内容也是不一样的)

同样通过以上api 我可以看一下session具体配置参数

从头学习爬虫(四十六)高阶篇----selenium获取network_第4张图片

五 总结


如果想用把chrome各种功能都用起来要学会通过api命令方式去获取需要的内容

你可能感兴趣的:(selenium,网络爬虫)