Java爬虫被服务器拒绝访问 403错误 学习笔记

利用HttpClient对一个网站进行确定页面的内容抓取,其中从指定URL获取response内容的代码如下:

这是HttpClient推荐的请求网页内容的基本写法,第一次尝试运行,直接被服务器403 forbidden。

public final static String getByString(String url) throws Exception {
        CloseableHttpClient httpclient = HttpClients.createDefault();
        
        try {
            ResponseHandler responseHandler = new ResponseHandler() {
 
                public String handleResponse(
                        final HttpResponse response) throws ClientProtocolException, IOException {
                    int status = response.getStatusLine().getStatusCode();
                    if (status >= 200 && status < 300) {
                        HttpEntity entity = response.getEntity();
                        System.out.println(status);
                        return entity != null ? EntityUtils.toString(entity) : null;
                    } else {
                    	System.out.println(status);
                    	Date date=new Date();
                    	System.out.println(date);
                    	System.exit(0);
                        throw new ClientProtocolException("Unexpected response status: " + status);
                    }
                }
            };
            String responseBody = httpclient.execute(httpget, responseHandler);
            return responseBody;
        } finally {
            httpclient.close();
        }
    }


考虑通过浏览器能访问该网站,但是上述方法不行,因此尝试为httpget加入header属性,使其在服务器看来更像是用户直接访问。

public final static String getByString(String url) throws Exception {
        CloseableHttpClient httpclient = HttpClients.createDefault();
        
        try {
            HttpGet httpget = new HttpGet(url);
            httpget.addHeader("Accept", "text/html");
	    httpget.addHeader("Accept-Charset", "utf-8");
            httpget.addHeader("Accept-Encoding", "gzip");
	    httpget.addHeader("Accept-Language", "en-US,en");
	    httpget.addHeader("User-Agent",
			"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.160 Safari/537.22");
            ResponseHandler responseHandler = new ResponseHandler() {
 
                public String handleResponse(
                        final HttpResponse response) throws ClientProtocolException, IOException {
                    int status = response.getStatusLine().getStatusCode();
                    if (status >= 200 && status < 300) {
                        HttpEntity entity = response.getEntity();
                        System.out.println(status);
                        return entity != null ? EntityUtils.toString(entity) : null;
                    } else {
                    	System.out.println(status);
                    	Date date=new Date();
                    	System.out.println(date);
                    	System.exit(0);
                        throw new ClientProtocolException("Unexpected response status: " + status);
                    }
                }
            };
            String responseBody = httpclient.execute(httpget, responseHandler);
            return responseBody;
        } finally {
            httpclient.close();
        }
    }
 
  

再次运行,不再被服务器拒绝访问,但是在短时间请求大量网页后,再次被服务器拒绝访问,依旧报403错误。此时通过浏览器访问该网站,同样显示403错误。当出现这种问题的时候,只能重连宽带,或者使用VPN更换代理,改变IP地址。考虑到服务器应该是屏蔽了本机IP地址,因此尝试降低请求频率,在代码中加入sleep()方法,在每次请求后,等待一段时间。并且由于被服务器拒绝访问后并不能通过程序解决,因此在获取到服务器非正常response status时,加入

System.exit(0);
直接让程序退出。

public final static String getByString(String url) throws Exception {
        CloseableHttpClient httpclient = HttpClients.createDefault();
        
        try {
public final static String getByString(String url) throws Exception {
        CloseableHttpClient httpclient = HttpClients.createDefault();
        
        try {
            HttpGet httpget = new HttpGet(url);
            httpget.addHeader("Accept", "text/html");
	    httpget.addHeader("Accept-Charset", "utf-8");
            httpget.addHeader("Accept-Encoding", "gzip");
	    httpget.addHeader("Accept-Language", "en-US,en");
	    httpget.addHeader("User-Agent",
			"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.160 Safari/537.22");
            ResponseHandler responseHandler = new ResponseHandler() {
 
                public String handleResponse(
                        final HttpResponse response) throws ClientProtocolException, IOException {
                    int status = response.getStatusLine().getStatusCode();
                    if (status >= 200 && status < 300) {
                        HttpEntity entity = response.getEntity();
                        System.out.println(status);
                        return entity != null ? EntityUtils.toString(entity) : null;
                    } else {
                    	System.out.println(status);
                    	Date date=new Date();
                    	System.out.println(date);
                    	System.exit(0);
                        throw new ClientProtocolException("Unexpected response status: " + status);
                    }
                }
            };
            String responseBody = httpclient.execute(httpget, responseHandler);
	    Thread.currentThread().sleep(200);
            return responseBody;
        } finally {
            httpclient.close();
        }
    }

对于不同网站,判定访问时间间隔可能不同,这次采集的网站,将sleep设置成2秒时,连续运行12小时,没有被服务器403拒绝。

你可能感兴趣的:(java,爬虫,服务器,403错误)