httpclient抓取https网页数据

近日接到一个任务,需要采集某个https网站的部分内容,用到了httpclient(4.5.X),它是Apache Jakarta Common下的子项目,用来提供高效的、最新的、功能丰富的支持HTTP协议的客户端编程工具包,并且它支持HTTP协议最新的版本和建议。先看一下httpclient的简单使用。

  • 封装一个httpclient查询方法:
public String getHtml(String url) {
        String html = null;
        for (int i = 1; i <= 3; i++) {
            CloseableHttpClient httpclient = HttpClients.createDefault();// 创建httpClient对象
            HttpGet httpget;
            CloseableHttpResponse response = null;
            httpget = new HttpGet(url);// 以get方式请求该URL
            httpget.addHeader(HttpHeaders.USER_AGENT,
                    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:33.0) Gecko/20100101 Firefox/33.0");
            RequestConfig requestConfig = RequestConfig.custom()
                    .setSocketTimeout(10000).setConnectTimeout(10000).build();// 设置请求和传输超时时间
            httpget.setConfig(requestConfig);
            try {
                response = httpclient.execute(httpget);// 得到response对象
                int resStatu = response.getStatusLine().getStatusCode();// 返回码
                System.out.println("状态码" + resStatu);
                if (resStatu == HttpStatus.SC_OK) {// 200正常 
                    // 获得相应实体
                    HttpEntity entity = response.getEntity();
                    if (entity != null) {
                        html = EntityUtils.toString(entity, "UTF-8");
                        html = html.replace(" ", " ");
                        break;
                    }
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
            finally {
                httpclient.getConnectionManager().shutdown();
            }
        }
        return html;
    }

用这个方法抓取一般的http网页没问题.但是如果用来抓取某些https的网页便会出现如下异常:
unable to find valid certification path to requested target
异常提示你需要导入一个网站的证书.下面正式来抓取https网页,以本人最近经常使用的一个google镜像网站(https://www.xichuan.pub/scholar?hl=zh-CN&q=hand&btnG=&lr=)为例.
1.导出网站的证书(谷歌浏览器):
这里写图片描述
点击浏览器地址栏里的锁标志,在右侧弹框内点击view certificate,点击详细信息,导出base64编码x.509(.cer)(s)证书即可。
2. 导入keystore证书:
使用Java自带的keytool工具将导出的.cer证书导入为httpclient可以使用的keystore证书.cmd内进入jdk的bin目录。
使用如下命令:keytool -import -alias Root -file d:/Root.cer -keystore "d:/trust.keystore" -storepass 123456
3.使用带有ssl的httpclient实例访问https网站.

public class SSLHttpClient {

    public static String gethtml(String url) {
        String html = "";
        CloseableHttpClient httpclient = null;
        CloseableHttpResponse response = null;
        try {
            SSLConnectionSocketFactory sslsf = createSSLConnSocketFactory();
            httpclient = HttpClients.custom()
                .setSSLSocketFactory(sslsf).build();
            HttpGet httpget = new HttpGet(url);
            httpget.addHeader(HttpHeaders.USER_AGENT,
                    "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0");
            RequestConfig requestConfig = RequestConfig.custom()
                    .setSocketTimeout(10000).setConnectTimeout(10000).build();// 设置请求和传输超时时间
            httpget.setConfig(requestConfig);
            System.out.println("Executing request " + httpget.getRequestLine());
            response = httpclient.execute(httpget);
            HttpEntity entity = response.getEntity();
            System.out.println("----------------------------------------");
            System.out.println(response.getStatusLine());
            int resStatu = response.getStatusLine().getStatusCode();// 返回码
            if (resStatu == HttpStatus.SC_OK) {// 200正常 其他就不对
                // 获得相应实体
                if (entity != null) {
                    html = EntityUtils.toString(entity, "UTF-8");
                    html = html.replace(" ", " ");
                }
            }
            EntityUtils.consume(entity);
        } catch(Exception e){
            e.printStackTrace();
        }finally{
            if(response!=null){
                try {
                    response.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
            if(httpclient!=null){
                try {
                    httpclient.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
        return html;
    }

    // ssl通道证书的创建
    private static SSLConnectionSocketFactory createSSLConnSocketFactory()
            throws Exception {
        SSLContext sslcontext = SSLContexts
                .custom()
                .loadTrustMaterial(
                        new File(
                                "C://Users//cloud//Desktop//证书//trust.keystore"),
                        "123456".toCharArray(), new TrustSelfSignedStrategy())
                .build();
        SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(
                sslcontext, new String[] { "TLSv1" }, null,
                SSLConnectionSocketFactory.getDefaultHostnameVerifier());
        return sslsf;
    }
}

测试方法:

public static void main(String[] args){
        String html = SSLHttpClient.gethtml("https://www.xichuan.pub/scholar?hl=zh-CN&q=hand&btnG=&lr=");
        if(html!=null&&!html.equals("")){
            Document doc = Jsoup.parse(html);
            if(doc!=null){
                Elements eles = doc.select("#gs_ccl_results div.gs_r h3.gs_rt a");
                if(eles!=null&&eles.size()!=0){
                    for(int i=0;iout.println(i+1+"-"+eles.get(i).text());
                    }
                }
            }
        }
    }

httpclient抓取https网页数据_第1张图片

完成收工.

你可能感兴趣的:(java,web服务器,httpclient,爬虫,http协议,apache,java,编程)