使用BufferedReader处理HttpURLConnection.getInputStream()出现阻塞的问题

业务流程:

我有一个语词列表,想查看在百度百科中是否有对应的词条。需要访问含有中文的指定URL。(题外说一句,由于URL中含有中文,直接访问会乱码,所以需要对中文部分进行编码解决。)由于百科对词条有大量的重定向(301、302等),所以也要对这部分处理。(这部分不是本文重点,所以忽略)。我使用BufferedReader包裹得到的输入流,但是由于readline()方法是阻塞方法。由于网络原因,可能会导致readline()无法得到终止符从而出现阻塞。

比如:

		URL url = new URL("https://baike.baidu.com/search/word?word=Lamy");
		HttpURLConnection httpUrlConn = (HttpURLConnection) url.openConnection();
		httpUrlConn.connect();		
		InputStream inputStream = httpUrlConn.getInputStream();
		InputStreamReader inputStreamReader = new InputStreamReader(inputStream, "utf-8");
		BufferedReader bufferedReader = new BufferedReader(inputStreamReader);
		StringBuilder sb = new StringBuilder();
		String str = "";
		while ((str = bufferedReader.readLine()) != null) {//readline()是阻塞方法,如果接受不到换行符等会抑制阻塞,导致程序停留在while所在行
			sb.append(str);
		}
		bufferedReader.close();
		inputStreamReader.close();
		inputStream.close();
		httpUrlConn.disconnect();
		String res = sb.toString();
网上之前很多人说将while循环去掉,只读取一次,或者服务器关闭套接字通知客户端的BufferedReader。但是,针对我目前的需求,是不合适的。

首先,无法确定从百度百科获取数据的长度,所以,无法只运行一次readline()完成功能。(网上的经验大多数是针对自己的服务器,客户端与服务器之间就数据长度已事先沟通好。)

其次,对于关闭套接字,由于网络不稳定,我无法确定是链接已断开还是传输速度过慢。

第三,针对网上设置超时选项的解决办法如:

httpUrlConn.setConnectTimeout(300);//设置连接超时
httpUrlConn.setReadTimeout(100);//设置建立连接后,到得到数据前的等待超时时间
对于我来说是无用的,因为出现的阻塞是在已经获取到一部分数据后,所以ReadTimeout无效。

void java.net.URLConnection.setReadTimeout(int timeout)


Sets the read timeout to a specified timeout, in milliseconds. A non-zero value specifies the timeout when reading from Input stream when a connection is established to a resource. If the timeout expires before there is data available for read, a java.net.SocketTimeoutException is raised. A timeout of zero is interpreted as an infinite timeout. 

Some non-standard implementation of this method ignores the specified timeout. To see the read timeout set, please call getReadTimeout().
Parameters:timeout an int that specifies the timeout value to be used in millisecondsThrows:IllegalArgumentException - if the timeout parameter is negativeSince:1.5

失败的方式参考的stackoverflow上Alper Akture的答案,用另一个线程使用CountDownLatch监视数据读取线程,当latch.await(2000, TimeUnit.MILLISECONDS)因为超时而返回false时,就认为之前的读取数据线程的连接已经异常阻塞(或者是因为网络问题读取过于缓慢),将HttpURLConnection的数据流关闭,使BufferedReader退出阻塞状态。(注意,由于I/O和在synchronized块上的等待是不可中断的,所以直接调用Thread.interrupt()是无法退出bufferedreder的阻塞的,《Thinking in java》第四版第21章696页的例子可以说明这一点,在我的程序中其作用仅仅是通知接收数据过慢的线程数据无效而已。)之后再重新请求数据。

代码不用细看,因为不成功,哎)。

class Read implements Runnable {
	public String url;
	BlockingQueue queue;
	private CountDownLatch latch;
	String word;
	InputStream inputStream = null;
	InputStreamReader inputStreamReader = null;
	BufferedReader bufferedReader = null;
	HttpURLConnection httpUrlConn = null;
		
	public Read(String s, BlockingQueue queue, CountDownLatch latch,String word) {
		url = s;
		this.queue = queue;
		this.latch = latch;
		this.word = word;
	}

	public void run() {
		while (true) {
			StringBuffer buffer = new StringBuffer();			
			try {
				URL url = new URL(this.url);
				httpUrlConn = (HttpURLConnection) url.openConnection();
				httpUrlConn.setRequestProperty("User-agent",
						"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.215 Safari/535.1");
				// httpUrlConn.setRequestProperty("accept-language", "zh-CN");
				// httpUrlConn.setRequestMethod();
				httpUrlConn.setConnectTimeout(300);
				httpUrlConn.setReadTimeout(100);

				System.out.println("Request URL ... " + URLDecoder.decode(this.url, "utf-8"));// 41

				boolean redirect = false;

				// normally, 3xx is redirect
				int status = httpUrlConn.getResponseCode();
				if (status != HttpURLConnection.HTTP_OK) {
					if (status == HttpURLConnection.HTTP_MOVED_TEMP || status == HttpURLConnection.HTTP_MOVED_PERM
							|| status == HttpURLConnection.HTTP_SEE_OTHER)
						redirect = true;
				}

				System.out.println("Response Code ... " + status);

				while (redirect) {//处理重定向部分
					// get redirect url from "location" header field
					String newUrl = httpUrlConn.getHeaderField("Location");
					// get the cookie if need, for login
					// String cookies =
					// httpUrlConn.getHeaderField("Set-Cookie");
					// open the new connnection again
					httpUrlConn = (HttpURLConnection) new URL(newUrl).openConnection();
					// httpUrlConn.setRequestProperty("Cookie", cookies);
					httpUrlConn.setRequestProperty("User-agent",
							"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.215 Safari/535.1");
					System.out.println("Redirect to URL : " + URLDecoder.decode(newUrl, "utf-8"));
					status = httpUrlConn.getResponseCode();
					if (status != HttpURLConnection.HTTP_OK) {
						if (status == HttpURLConnection.HTTP_MOVED_TEMP || status == HttpURLConnection.HTTP_MOVED_PERM
								|| status == HttpURLConnection.HTTP_SEE_OTHER)
							redirect = true;
					} else {
						redirect = false;
					}
					System.out.println("Response Code ... " + status);
				}
				httpUrlConn.connect();
				inputStream = httpUrlConn.getInputStream();
				inputStreamReader = new InputStreamReader(inputStream, "utf-8");
				bufferedReader = new BufferedReader(inputStreamReader);
				int countline = 0;
				String str = null;
				while ((str = bufferedReader.readLine()) != null) {
					countline++;
					str = bufferedReader.readLine();
					buffer.append(str);
					if (countline > 100)//业务需要,我的目的是解析部分网页数据,因此不需要网页的全文,金宝村100行即可
						break;
				}				
				String res = buffer.toString();
				if(!Thread.interrupted()){
					queue.add(res);
					System.out.println("queue has add : "+word);
				}				
				latch.countDown();
				break;
			} catch (SocketTimeoutException e) {
				System.err.println("Socket time out!");
				continue;				
			} catch (IOException e) {
				// TODO Auto-generated catch block
				e.printStackTrace();
			} finally{
				try {
					if(bufferedReader!=null)
					bufferedReader.close();
				} catch (IOException e) {
					// TODO Auto-generated catch block
					e.printStackTrace();
				}
				try {
					if(inputStreamReader!=null)
					inputStreamReader.close();
				} catch (IOException e) {
					// TODO Auto-generated catch block
					e.printStackTrace();
				}

				try {
					if(inputStream!=null)
					inputStream.close();
				} catch (IOException e) {
					// TODO Auto-generated catch block
					e.printStackTrace();
				}
				httpUrlConn.disconnect();
			}
		}
	}
}

监督读取数据的部分程序,放在另一个线程中,那个queue是之前定义的一个阻塞队列,用于数据处理的,与业务相关:

			boolean success = false;
			int temp0 = 0;
			while (!success) {
				System.out.println("**************in new location************");
				CountDownLatch latch = new CountDownLatch(1);
				Read r = new Read(str, queue, latch,word);
				Thread t =new Thread(r);
				t.setName(word+"-"+temp0++);
				t.start();
				System.out.println("Thread:"+t.getName()+" start");
				try {
					success = latch.await(2000, TimeUnit.MILLISECONDS);
					if(!success){
						t.interrupt();
						r.httpUrlConn.getInputStream().close();				
						System.out.println(t.getName()+" call interrupt");
						continue;
					}
				} catch (InterruptedException e) {
					// TODO Auto-generated catch block
					//e.printStackTrace();
					System.err.println("latch.await:InterruptedException");
				} catch (IOException e) {
					// TODO Auto-generated catch block
					//e.printStackTrace();
					System.err.println("httpUrlConn.getInputStream().close() Exception");
				}
			}

失败原因:读取数据的线程会保持inputstream的Lock,导致监控线程试图关闭inputstream时也被阻塞,导致双双进入阻塞状态。使用JDK环境变量路径下/bin文件夹下的Jconsole工具可以观察到这一现象。(使用很简单,如Window下,双击打开,连接到先观察的线程,连接即可)。


成功的方式:不使用URLConnection,而使用HttpClient。

十分感谢:struggleee_luo 的博客,点击可以查看其博客,我要干的事情和他类似,而且遇到的问题也很像。下面包含重定向部分,如果不需要可以删掉。

代码:

	public static String staticDownloadByHttpClient(String urlstr, String encoding, String param) {
		String bufferStr = null;

		// 创建带有重定向功能的Http客户端,使用已有工具类
		HttpClientBuilder builder = HttpClients.custom().disableAutomaticRetries() // 关闭自动处理重定向
				.setRedirectStrategy(new LaxRedirectStrategy());// 利用LaxRedirectStrategy处理POST重定向问题

		CloseableHttpClient httpclient = builder.build();

		// 创建默认的httpClient实例.
		// CloseableHttpClient httpclient = HttpClients.createDefault();

		// 创建httppost
		HttpPost httppost = new HttpPost(urlstr);

		// 设置套接字超时时间!!!!!!!
		RequestConfig requestConfig = RequestConfig.custom().setSocketTimeout(6000).setConnectTimeout(6000).build();// 设置请求和传输超时时间
		httppost.setConfig(requestConfig);

		// 创建参数队列
		List formparams = new ArrayList();
		String name = param.split("=")[0];
		String value = param.split("=")[1];
		formparams.add(new BasicNameValuePair(name, value));
		UrlEncodedFormEntity uefEntity;
		try {
			uefEntity = new UrlEncodedFormEntity(formparams, "UTF-8");
			httppost.setEntity(uefEntity);

			CloseableHttpResponse response = httpclient.execute(httppost);

			if (response == null) {
				httpclient.close();
				return bufferStr;
			}

			try {
				HttpEntity entity = response.getEntity();
				if (entity != null) {
					// 不设置读取超时会导致词语局阻塞
					bufferStr = EntityUtils.toString(entity, encoding);
				}
				try {
					EntityUtils.consume(entity);
				} catch (final IOException ignore) {
				}
			} finally {
				response.close();
			}
		} catch (ClientProtocolException e) {
			e.printStackTrace();
		} catch (UnsupportedEncodingException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		} finally {
			// 关闭连接,释放资源
			try {
				httpclient.close();
			} catch (IOException e) {
				e.printStackTrace();
			}
		}
		return bufferStr;
	}

由于百度的页面是动态渲染的,直接解析得到的HTML会丢失一些信息,下一步准备使用工具HtmlUnit对渲染后的页面在进行解析。

希望可以有更好的方法指正。

再次感谢:

struggleee_luo

http://blog.csdn.net/u010695420/article/details/53898526


你可能感兴趣的:(NLP,临时类别,java)