Nutch学习笔记10---一个bug引发Http协议研究

自己修改protocol-http的代码,实现连接池,发现怎么样都有异常,

经过调试跟踪,发现了一个bug,见我给nutch发的邮件,想验证下是否是一个bug.

Hello, I am using Nutch 1.7

I find that sometimes the field "Content-Length" does not exist in the HTTP Header of Response.
But the code  [protocol-http  HttpResponse.java  readPlainContent(...)]  can not handle this .

for example, if I set the
<property>
  <name>http.content.limit</name>
  <value>1048576</value>
  <description> </description>
</property>

and socket timeout time is 10 seconds

so Nutch will try to read 1048576 bytes when no "Content-Length".
then after 10 seconds , the timeout exception will occur.

is it a bug?
I find it in 1.7 and 1.8 .

对应的http响应体的一个例子是:

HTTP/1.1 200 OK
Server: Tengine/1.4.6
Date: Wed, 09 Jul 2014 06:54:22 GMT
Content-Type: text/html;charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Cache-Control: must-revalidate, no-cache, private
Pragma: no-cache
Expires: Sun, 1 Jan 2000 01:00:00 GMT
Content-Encoding: gzip

 于是我又用浏览器访问,抓包,发现抓的包里也没有Content-Length,但是浏览器可以正确识别。

说明HTTP协议在特定情况下,是支持没有Content-Length字段的,必然有别的办法解决!

~~~~~~~~~~~~~~~~~~~

经过网络上搜索,网上的资料是这么说的:

1、在Http 1.0及之前版本中,content-length字段可有可无。

2、在http1.1及之后版本。如果是keep alive,则content-length和chunk必然是二选一。若是非keep alive,则和http1.0一样。content-length可有可无。

 具体可参考:

http://zh.wikipedia.org/wiki/分块传输编码

 下面来修改HttpResponse.java

自己写了半天代码,最后发现nutch提供了一个readChunkedContent函数,为啥不用呢?

自己改源码来实现。

 主要修改处是:

String transferEncodingString = headers.get("Transfer-Encoding");
			if (null != transferEncodingString
					&& transferEncodingString.toLowerCase().equals("chunked")) {
				// read Chunked Content..
				readChunkedContent(in, new StringBuffer());
				LOG.info("read all chunked content succeed,return");
			} else {
				readPlainContent(in);
			}

 源码重新编译,测试通过!

:)

你可能感兴趣的:(Nutch)