Java爬虫系列一:HttpClient请求工具,IP代理模式

IP代理模式顾名思义,使用非本机IP来请求目标数据,两大好处:

  • 1.作为爬虫项目,有效防止IP风控
  • 2.不多说,你懂得~

特此声明:本人所有文章都只供大家学习,任何个人或组织不得直接或间接使用本文所有文章中的技术内容干违背国家法律规定的业务。如因此造成的一切后果本人概不承担。

另附《中华人民共和国网络安全法》大家以此为底线,一定要保持职业操守,做合法社会主义好公民


废话不多,直接上源码。

1.Maven依赖


    org.apache.httpcomponents
    httpclient
    4.5.3

2.为了支持Https协议,所以我们还需要写个绕过SSL验证的工具

//添加主机名验证程序类,设置不验证主机
private final static HostnameVerifier DO_NOT_VERIFY = new HostnameVerifier() {
   public boolean verify(String hostname, SSLSession session) {
      return true;
   }
};

/**
  * 创建SSL安全连接
  *
  * @return
  */
private static SSLConnectionSocketFactory createSSLConnSocketFactory() {
   SSLConnectionSocketFactory sslsf = null;
        try {
            SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial(null, 
            new TrustStrategy() {
                    public boolean isTrusted(X509Certificate[] chain, String authType) {
                        return true;
                    }
            }).build();
            sslsf = new SSLConnectionSocketFactory(sslContext, new HostnameVerifier() {

                @Override
                public boolean verify(String arg0, SSLSession arg1) {
                    return true;
                }
            });
        } catch (GeneralSecurityException e) {
            e.printStackTrace();
        }
   return sslsf;
}

3.为了解决很多莫名其妙的的异常,我们有必要详细点来捕获各种可能的异常,并选择抛出或者返回,方便后续处理。

  • ConnectTimeoutException,SocketTimeoutException异常:连接超时
  • 其它的都不重要,可以统一Exception捕获

4.Get方式请求

全局设置超时时间,大家根据自己实际情况设置

private final static int CONNECTION_TIME_OUT = 6000;
 /**
     * Get方式请求
     * @param pageUrl 请求地址
     * @param charset 编码方式
     * @param params  参数
     * @param proxyIp 代理IP
     * @return
     */
    public static Map doGet(String pageUrl, String charset, Map params, String proxyIp) {
        Map map = new HashMap();
        String result = null;
        if (null == charset) {
            charset = "utf-8";
        }
        //设置绕过SSL请求验证
        CloseableHttpClient httpclient = HttpClients.custom().setSSLSocketFactory(createSSLConnSocketFactory()).build();
        try {
            URL url = new URL(pageUrl);
            //设置代理协议
            HttpHost target = new HttpHost(url.getHost(), url.getDefaultPort(), url.getProtocol());
            HttpHost proxy = new HttpHost(proxyIp.split(":")[0], Integer.parseInt(proxyIp.split(":")[1]));
            RequestConfig config = RequestConfig.custom().setProxy(proxy).setConnectTimeout(CONNECTION_TIME_OUT)
                    .setConnectionRequestTimeout(CONNECTION_TIME_OUT).setSocketTimeout(CONNECTION_TIME_OUT).build();
            HttpGet httpget = new HttpGet(url.toString());
            httpget.setConfig(config);
            try {
                for (Map.Entry entry : params.entrySet()) {
                    httpget.addHeader(entry.getKey(), entry.getValue());
                }
            } catch (Exception e) {
            }
            CloseableHttpResponse response = null;
            try {
                response = httpclient.execute(target, httpget);
                if (response != null) {
                    HttpEntity resEntity = response.getEntity();
                    if (resEntity != null) {
                        result = EntityUtils.toString(resEntity, charset);
                        map

你可能感兴趣的:(Java,java,爬虫,https)