【爬虫】爬取带有cookie才能获取网页内容的新闻网站

工作任务:

今天老大让我跑取一个新闻网站:https://www.yidaiyilu.gov.cn/

采坑记录:

  • https协议,如果利用http协议去请求会报出如下信息:【爬虫】爬取带有cookie才能获取网页内容的新闻网站_第1张图片

错误:SSLHandshake错误就知道了,客户端与服务端进行连接时,需要通过SSL协议进行握手

(坑)改用:重写DefaultHttpClient方法使其支持SSL协议

 

package httpsParse;
import java.security.cert.CertificateException;  
import java.security.cert.X509Certificate;  
import javax.net.ssl.SSLContext;  
import javax.net.ssl.TrustManager;  
import javax.net.ssl.X509TrustManager;  
import org.apache.http.conn.ClientConnectionManager;  
import org.apache.http.conn.scheme.Scheme;  
import org.apache.http.conn.scheme.SchemeRegistry;  
import org.apache.http.conn.ssl.SSLSocketFactory;  
import org.apache.http.impl.client.DefaultHttpClient;  
//用于进行Https请求的HttpClient  
public class SSLClient extends DefaultHttpClient{  
    public SSLClient() throws Exception{  
        super();
//传输协议需要根据自己的判断   
        SSLContext ctx = SSLContext.getInstance("TLSv1.2");  
        X509TrustManager tm = new X509TrustManager() {  
                @Override  
                public void checkClientTrusted(X509Certificate[] chain,  
                        String authType) throws CertificateException {  
                }  
                @Override  
                public void checkServerTrusted(X509Certificate[] chain,  
                        String authType) throws CertificateException {  
                }  
                @Override  
                public X509Certificate[] getAcceptedIssuers() {  
                    return null;  
                }  
        };  
        ctx.init(null, new TrustManager[]{tm}, null);  
        SSLSocketFactory ssf = new SSLSocketFactory(ctx,SSLSocketFactory.ALLOW_ALL_HOSTNAME_VERIFIER);  
        ClientConnectionManager ccm = this.getConnectionManager();  
        SchemeRegistry sr = ccm.getSchemeRegistry();  
        sr.register(new Scheme("https", 443, ssf));  
    }  
}

(坑)然后再利用HttpClient去请求获取网页源代码:


    public static void main(String[] args) throws Exception {
    	HttpClientUtil httpClientUtil = new HttpClientUtil();
    	String url = "https://www.yidaiyilu.gov.cn/zchj.htm";
		String html = httpClientUtil.doGet(url);
		System.out.println(html);
	}

 

最后发现得到的结果:是一段js代码

  • 开始怀疑是cookie的原因,然后在浏览器中将cookie带上去最后请求出结果,但是cookie是有有效期的,隔一段时间cookie就失效了,因此这种方法行不通
  • 后来分析发现浏览器访问该网站首先会加载js然后生成cookie,再将这次生成的cookie带上请求头再次请求,所以为什么第一次上面的请求会出现js代码,但是js是动态加载的,因此需要利用java模拟浏览的方式去实现
  • 最终通过htmlunit实现的代码:
package cn.server;


import org.openqa.selenium.htmlunit.HtmlUnitDriver;


public class GFDynamicWeb {
	public static HtmlUnitDriver driver = new HtmlUnitDriver();
	public static boolean isGetCookie = false;
//	public static boolean isRepeatExec = false;
	public static String GetContent(String url) {
		if(!isGetCookie) {
			driver.setJavascriptEnabled(true);
			//第一次加载js获取cookie
			driver.get(url);
		}
		driver.setJavascriptEnabled(false);
		//第二次加载网页源码
		driver.get(url);
        String pageSource = driver.getPageSource();
        isGetCookie = true;
		return pageSource;
	}
	public static void renewIsGetCookie() {
		isGetCookie = false;
	}
	public static void closeDriver() {
		driver.close();
	}
    public static void main(String[] args) {
    	long s = System.currentTimeMillis();
    	for(int i = 0; i < 100; i ++) {
        	String url = "https://www.yidaiyilu.gov.cn/";
    		String content = GetContent(url);
    		System.out.println(content);
    	}
    	long e = System.currentTimeMillis();
    	System.out.println((e - s)/1000 + "秒");
    	renewIsGetCookie();
    	closeDriver();
    }
}

期间利用的网址:

在线接口测试

521状态码作用

521出错问题解决办法

你可能感兴趣的:(解决方案,工作,网络爬虫)