业务流程:
我有一个语词列表,想查看在百度百科中是否有对应的词条。需要访问含有中文的指定URL。(题外说一句,由于URL中含有中文,直接访问会乱码,所以需要对中文部分进行编码解决。)由于百科对词条有大量的重定向(301、302等),所以也要对这部分处理。(这部分不是本文重点,所以忽略)。我使用BufferedReader包裹得到的输入流,但是由于readline()方法是阻塞方法。由于网络原因,可能会导致readline()无法得到终止符从而出现阻塞。
比如:
URL url = new URL("https://baike.baidu.com/search/word?word=Lamy");
HttpURLConnection httpUrlConn = (HttpURLConnection) url.openConnection();
httpUrlConn.connect();
InputStream inputStream = httpUrlConn.getInputStream();
InputStreamReader inputStreamReader = new InputStreamReader(inputStream, "utf-8");
BufferedReader bufferedReader = new BufferedReader(inputStreamReader);
StringBuilder sb = new StringBuilder();
String str = "";
while ((str = bufferedReader.readLine()) != null) {//readline()是阻塞方法,如果接受不到换行符等会抑制阻塞,导致程序停留在while所在行
sb.append(str);
}
bufferedReader.close();
inputStreamReader.close();
inputStream.close();
httpUrlConn.disconnect();
String res = sb.toString();
网上之前很多人说将while循环去掉,只读取一次,或者服务器关闭套接字通知客户端的BufferedReader。但是,针对我目前的需求,是不合适的。
首先,无法确定从百度百科获取数据的长度,所以,无法只运行一次readline()完成功能。(网上的经验大多数是针对自己的服务器,客户端与服务器之间就数据长度已事先沟通好。)
其次,对于关闭套接字,由于网络不稳定,我无法确定是链接已断开还是传输速度过慢。
第三,针对网上设置超时选项的解决办法如:
httpUrlConn.setConnectTimeout(300);//设置连接超时
httpUrlConn.setReadTimeout(100);//设置建立连接后,到得到数据前的等待超时时间
对于我来说是无用的,因为出现的阻塞是在已经获取到一部分数据后,所以ReadTimeout无效。
void java.net.URLConnection.setReadTimeout(int timeout)
Sets the read timeout to a specified timeout, in milliseconds. A non-zero value specifies the timeout when reading from Input stream when a connection is established to a resource. If the timeout expires before there is data available for read, a java.net.SocketTimeoutException is raised. A timeout of zero is interpreted as an infinite timeout.
Some non-standard implementation of this method ignores the specified timeout. To see the read timeout set, please call getReadTimeout().
Parameters:timeout an int that specifies the timeout value to be used in millisecondsThrows:IllegalArgumentException - if the timeout parameter is negativeSince:1.5
失败的方式:参考的stackoverflow上Alper Akture的答案,用另一个线程使用CountDownLatch监视数据读取线程,当latch.await(2000, TimeUnit.MILLISECONDS)因为超时而返回false时,就认为之前的读取数据线程的连接已经异常阻塞(或者是因为网络问题读取过于缓慢),将HttpURLConnection的数据流关闭,使BufferedReader退出阻塞状态。(注意,由于I/O和在synchronized块上的等待是不可中断的,所以直接调用Thread.interrupt()是无法退出bufferedreder的阻塞的,《Thinking in java》第四版第21章696页的例子可以说明这一点,在我的程序中其作用仅仅是通知接收数据过慢的线程数据无效而已。)之后再重新请求数据。
(代码不用细看,因为不成功,哎)。
class Read implements Runnable {
public String url;
BlockingQueue queue;
private CountDownLatch latch;
String word;
InputStream inputStream = null;
InputStreamReader inputStreamReader = null;
BufferedReader bufferedReader = null;
HttpURLConnection httpUrlConn = null;
public Read(String s, BlockingQueue queue, CountDownLatch latch,String word) {
url = s;
this.queue = queue;
this.latch = latch;
this.word = word;
}
public void run() {
while (true) {
StringBuffer buffer = new StringBuffer();
try {
URL url = new URL(this.url);
httpUrlConn = (HttpURLConnection) url.openConnection();
httpUrlConn.setRequestProperty("User-agent",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.215 Safari/535.1");
// httpUrlConn.setRequestProperty("accept-language", "zh-CN");
// httpUrlConn.setRequestMethod();
httpUrlConn.setConnectTimeout(300);
httpUrlConn.setReadTimeout(100);
System.out.println("Request URL ... " + URLDecoder.decode(this.url, "utf-8"));// 41
boolean redirect = false;
// normally, 3xx is redirect
int status = httpUrlConn.getResponseCode();
if (status != HttpURLConnection.HTTP_OK) {
if (status == HttpURLConnection.HTTP_MOVED_TEMP || status == HttpURLConnection.HTTP_MOVED_PERM
|| status == HttpURLConnection.HTTP_SEE_OTHER)
redirect = true;
}
System.out.println("Response Code ... " + status);
while (redirect) {//处理重定向部分
// get redirect url from "location" header field
String newUrl = httpUrlConn.getHeaderField("Location");
// get the cookie if need, for login
// String cookies =
// httpUrlConn.getHeaderField("Set-Cookie");
// open the new connnection again
httpUrlConn = (HttpURLConnection) new URL(newUrl).openConnection();
// httpUrlConn.setRequestProperty("Cookie", cookies);
httpUrlConn.setRequestProperty("User-agent",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.215 Safari/535.1");
System.out.println("Redirect to URL : " + URLDecoder.decode(newUrl, "utf-8"));
status = httpUrlConn.getResponseCode();
if (status != HttpURLConnection.HTTP_OK) {
if (status == HttpURLConnection.HTTP_MOVED_TEMP || status == HttpURLConnection.HTTP_MOVED_PERM
|| status == HttpURLConnection.HTTP_SEE_OTHER)
redirect = true;
} else {
redirect = false;
}
System.out.println("Response Code ... " + status);
}
httpUrlConn.connect();
inputStream = httpUrlConn.getInputStream();
inputStreamReader = new InputStreamReader(inputStream, "utf-8");
bufferedReader = new BufferedReader(inputStreamReader);
int countline = 0;
String str = null;
while ((str = bufferedReader.readLine()) != null) {
countline++;
str = bufferedReader.readLine();
buffer.append(str);
if (countline > 100)//业务需要,我的目的是解析部分网页数据,因此不需要网页的全文,金宝村100行即可
break;
}
String res = buffer.toString();
if(!Thread.interrupted()){
queue.add(res);
System.out.println("queue has add : "+word);
}
latch.countDown();
break;
} catch (SocketTimeoutException e) {
System.err.println("Socket time out!");
continue;
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} finally{
try {
if(bufferedReader!=null)
bufferedReader.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
try {
if(inputStreamReader!=null)
inputStreamReader.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
try {
if(inputStream!=null)
inputStream.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
httpUrlConn.disconnect();
}
}
}
}
监督读取数据的部分程序,放在另一个线程中,那个queue是之前定义的一个阻塞队列,用于数据处理的,与业务相关:
boolean success = false;
int temp0 = 0;
while (!success) {
System.out.println("**************in new location************");
CountDownLatch latch = new CountDownLatch(1);
Read r = new Read(str, queue, latch,word);
Thread t =new Thread(r);
t.setName(word+"-"+temp0++);
t.start();
System.out.println("Thread:"+t.getName()+" start");
try {
success = latch.await(2000, TimeUnit.MILLISECONDS);
if(!success){
t.interrupt();
r.httpUrlConn.getInputStream().close();
System.out.println(t.getName()+" call interrupt");
continue;
}
} catch (InterruptedException e) {
// TODO Auto-generated catch block
//e.printStackTrace();
System.err.println("latch.await:InterruptedException");
} catch (IOException e) {
// TODO Auto-generated catch block
//e.printStackTrace();
System.err.println("httpUrlConn.getInputStream().close() Exception");
}
}
失败原因:读取数据的线程会保持inputstream的Lock,导致监控线程试图关闭inputstream时也被阻塞,导致双双进入阻塞状态。使用JDK环境变量路径下/bin文件夹下的Jconsole工具可以观察到这一现象。(使用很简单,如Window下,双击打开,连接到先观察的线程,连接即可)。
成功的方式:不使用URLConnection,而使用HttpClient。
十分感谢:struggleee_luo 的博客,点击可以查看其博客,我要干的事情和他类似,而且遇到的问题也很像。下面包含重定向部分,如果不需要可以删掉。
代码:
public static String staticDownloadByHttpClient(String urlstr, String encoding, String param) {
String bufferStr = null;
// 创建带有重定向功能的Http客户端,使用已有工具类
HttpClientBuilder builder = HttpClients.custom().disableAutomaticRetries() // 关闭自动处理重定向
.setRedirectStrategy(new LaxRedirectStrategy());// 利用LaxRedirectStrategy处理POST重定向问题
CloseableHttpClient httpclient = builder.build();
// 创建默认的httpClient实例.
// CloseableHttpClient httpclient = HttpClients.createDefault();
// 创建httppost
HttpPost httppost = new HttpPost(urlstr);
// 设置套接字超时时间!!!!!!!
RequestConfig requestConfig = RequestConfig.custom().setSocketTimeout(6000).setConnectTimeout(6000).build();// 设置请求和传输超时时间
httppost.setConfig(requestConfig);
// 创建参数队列
List formparams = new ArrayList();
String name = param.split("=")[0];
String value = param.split("=")[1];
formparams.add(new BasicNameValuePair(name, value));
UrlEncodedFormEntity uefEntity;
try {
uefEntity = new UrlEncodedFormEntity(formparams, "UTF-8");
httppost.setEntity(uefEntity);
CloseableHttpResponse response = httpclient.execute(httppost);
if (response == null) {
httpclient.close();
return bufferStr;
}
try {
HttpEntity entity = response.getEntity();
if (entity != null) {
// 不设置读取超时会导致词语局阻塞
bufferStr = EntityUtils.toString(entity, encoding);
}
try {
EntityUtils.consume(entity);
} catch (final IOException ignore) {
}
} finally {
response.close();
}
} catch (ClientProtocolException e) {
e.printStackTrace();
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
// 关闭连接,释放资源
try {
httpclient.close();
} catch (IOException e) {
e.printStackTrace();
}
}
return bufferStr;
}
由于百度的页面是动态渲染的,直接解析得到的HTML会丢失一些信息,下一步准备使用工具HtmlUnit对渲染后的页面在进行解析。
希望可以有更好的方法指正。
再次感谢: