http://code.google.com/p/pyv8/
http://wwwsearch.sourceforge.net/mechanize/
http://code.google.com/p/multi-mechanize/
http://xiudaima.appspot.com/code/detail/14001
"""
urllib2是个非常好用的http客户端库,但是用来写爬虫可能会遇到一些问题,
一是连续请求某一网站的时候,cookie需要手动添加到HTTP请求头里,
二是手动处理Referer页。这里贴一段代码,可以自动处理Cookie和Referer问题。
每一个请求会自动添加上一次的cookie和 referer。
参考了部分ClientCookie库的代码。
"""
import urllib2
class HTTPRefererProcessor(urllib2.BaseHandler):
def __init__(self):
self.referer = None
def http_request(self, request):
if ((self.referer is not None) and
not request.has_header("Referer")):
request.add_unredirected_header("Referer", self.referer)
return request
def http_response(self, request, response):
self.referer = response.geturl()
return response
https_request = http_request
https_response = http_response
def main():
cj = CookieJar()
opener = urllib2.build_opener(
urllib2.HTTPCookieProcessor(cj),
HTTPRefererProcessor(),
)
urllib2.install_opener(opener)
urllib2.urlopen(url1) #打开第一个网址
urllib2.urlopen(url2) #打开第二个网址
if "__main__" == __name__:
main()
Extract cookies from HTTP response and store them in the CookieJar, where allowed by policy.
The CookieJar will look for allowable Set-Cookie and Set-Cookie2 headers in the response argument, and store cookies as appropriate (subject to the CookiePolicy.set_ok() method’s approval).
The response object (usually the result of a call to urllib2.urlopen(), or similar) should support an info() method, which returns an object with a getallmatchingheaders() method (usually a mimetools.Message instance).
The request object (usually a urllib2.Request instance) must support the methods get_full_url(), get_host(), unverifiable(), andget_origin_req_host(), as documented by urllib2. The request is used to set default values for cookie-attributes as well as for checking that the cookie is allowed to be set.
http://fly5.com.cn/p/p-like/python_https.html
http://www.cnblogs.com/xiaoxia/archive/2010/08/04/1792461.html?login=1
http://xiudaima.appspot.com/code/detail/14001