建立https链接的SLL验证证书失效问题

爬取网页遇到的目标站点证书不合法问题。

使用jsoup爬取解析网页时,出现了如下的异常情况。

javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
        at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
        at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1627)
        at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:204)
        at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:198)
        at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:994)
        at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:142)
        at sun.security.ssl.Handshaker.processLoop(Handshaker.java:533)
        at sun.security.ssl.Handshaker.process_record(Handshaker.java:471)
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:904)
        at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1132)
        at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:643)

查明是无效的SSL证书问题。由于现在很多网站由http全站升级到https,可能是原站点SSL没有部署好,导致证书无效,也有可能是其证书本身就不被认可。 对于爬取其网页就会出现证书验证出错的问题。
对于使用Jsoup自带接口来下载网页的,最新版本的1.9.2有validateTLSCertificates(boolean false)接口即可。
Jsoup.connect(url).timeout(30000).userAgent(UA).validateTLSCertificates(false).get()
java默认的证书集合里面不存在对于多数自注册的证书,对于不使用第三方库来做http请求的话,我们可以手动
创建 TrustManager 来解决。确定要建立的链接的站点,否则不推荐这种方式
public static InputStream getByDisableCertValidation(String url) {
		TrustManager[] trustAllCerts = new TrustManager[] {new X509TrustManager() {
			public X509Certificate[] getAcceptedIssuers() {
				return new X509Certificate[0];
			}
			public void checkClientTrusted(X509Certificate[] certs, String authType) {
			}
			public void checkServerTrusted(X509Certificate[] certs, String authType) {
			}
		} };

		HostnameVerifier hv = new HostnameVerifier() {
			public boolean verify(String hostname, SSLSession session) {
				return true;
			}
		};

		try {
			SSLContext sc = SSLContext.getInstance("SSL");
			sc.init(null, trustAllCerts, new SecureRandom());
			HttpsURLConnection.setDefaultSSLSocketFactory(sc.getSocketFactory());
			HttpsURLConnection.setDefaultHostnameVerifier(hv);

			URL uRL = new URL(url);
			HttpsURLConnection urlConnection = (HttpsURLConnection) uRL.openConnection();
			InputStream is = urlConnection.getInputStream();
			return is;
		} catch (Exception e) {
		}
		return null;
	}


refer:

http://snowolf.iteye.com/blog/391931

http://stackoverflow.com/questions/1828775/how-to-handle-invalid-ssl-certificates-with-apache-httpclient

你可能感兴趣的:(Java,spider)