urllib的urlopen方法很好用,代理使用如下:
import urllib opener=urllib.urlopen('http://www.baidu.com/',proxies={'http':'http://root:[email protected]:808'}) print opener.code
import urllib def reporthook(bk,bs,total): print bk*bs,'b' filename,message=urllib.urlretrieve("http://ww.baidu.com/",None,reporthook) print message.getheader('Content-Length')返回结果:
fp = self.open(url, data) headers = fp.info() if "content-length" in headers: size = int(headers["Content-Length"])看的我疑惑了,这headers到底是什么类型啊,怎么那么像字典,还能自动处理大小写?
class Message: def __contains__(self, name): """Determine whether a message contains the named header.""" return name.lower() in self.dict def __getitem__(self, name): """Get a specific header, as from a dictionary.""" return self.dict[name.lower()]
GET / HTTP/1.1 Host: 127.0.0.1 HTTP/1.1 401 Authorization Required WWW-Authenticate: Basic realm="my site" ---------------------------------------------------------- GET / HTTP/1.1 Host: 127.0.0.1 Authorization: Basic cm9vdDpyb290 HTTP/1.1 200 OK我们可以简单配置一下apache,修改httpd.conf,在需要认证的的directory下,添加AllowOverride authconfig,
<Directory "c:/wamp/www/"> AllowOverride authconfig Order Deny,Allow Deny from all Allow from 127.0.0.1 </Directory>
authname "my site" authtype basic authuserfile c:/wamp/www/user.txt require valid-user其中authname也就是提示信息,会出现在“WWW-Authenticate”头的“realm"(以前一直很奇怪,realm到底是什么?)。authuserfile最好绝对路径。
user_passwd, realhost = splituser(realhost) #realhost类似root:[email protected] if user_passwd: import base64 auth = base64.b64encode(user_passwd).strip() else: auth = None h = httplib.HTTP(host) if auth: h.putheader('Authorization', 'Basic %s' % auth)当url中没有提供用户名:密码时,但站点的确需要认证时,将返回401状态码,这时将调用http_error_401,如果出现"www-authenticate"响应头,将判断是否是basic认证。如果是将调用retry_http_basic_auth,这个函数从缓存中寻找有无该站点的用户名:密码,如果没有将调用prompt_user_passwd,提示用户输入(如果在GUI环境下需要重写该函数),并将用户名:密码缓存,如下
GET / HTTP/1.1 Host: www.baidu.com HTTP/1.0 407 Unauthorized Server: Proxy Proxy-Authenticate: Basic realm="CCProxy Authorization" GET / HTTP/1.1 Host: www.baidu.com Proxy-Authorization: Basic cm9vdDpyb290Mg== HTTP/1.1 200 OK第一次访问百度,由于设置了代理,所以返回了407状态码,”其中Basic realm“就是网页认证对话标题。如果代理需要”用户名:密码“认证,和http基本认证一样,每次都将”用户名:密码“base64加密作为”Proxy-Authorization“。
def __init__(self, proxies=None, **x509): if proxies is None: proxies = getproxies() assert hasattr(proxies, 'has_key'), "proxies must be a mapping" self.proxies = proxies我们看到,如果没有设置代理,将调用getproxies()函数获取本地环境变量中的代理。
def getproxies_environment(): proxies = {} for name, value in os.environ.items(): name = name.lower() if value and name[-6:] == '_proxy': proxies[name[:-6]] = value return proxies本地环境变量中代理一般设置如下,如在linux中.bashrc设置
1. quote(s, safe = '/'): url编码,将保留字符编码为%20格式,主要用来编码path前面信息
fullurl = quote(fullurl, safe="%/:=&?~#+!$,;'@()*[]|")
增补:从python2.6版本才会主动编码,2.5以及低版本不会
2. unquote(s): quote的逆过程。
3. quote_plus(s, safe = ''): query字符串编码,与quote不同点就是将“空格”用“+"表示,主要用于urlencode函数中。
4. unquote_plus(s): quote_plus的逆过程。
5.urlencode(query,doseq=0) : 将2元素的序列或字典编码成查询字符串,doseq允许第二元素也可以为序列,query字符串都是用quote_plus编码。
注意:这里有必要对quote,quote_plus和urlencode做个说明
quote用于url的编码,而quote_plus用于post的data编码,所以而urlencode内部是用quote_plus实现的,所以urlencode一般只用做post data使用,get时不用。
官方文档说明:
urllib.unquote(string)
Replace %xx escapes by their single-character equivalent.
Example: unquote('/%7Econnolly/') yields '/~connolly/'.
urllib.unquote_plus(string)
Like unquote(), but also replaces plus signs by spaces, as required for unquoting HTML form values.
urllib.urlencode(query[, doseq])
Convert a mapping object or a sequence of two-element tuples to a “percent-encoded” string, suitable to pass to urlopen() above as the optional data argument. This is useful to pass a dictionary of form fields to a POST request. The resulting string is a series of key=value pairs separated by '&' characters, where both key and value are quoted using quote_plus() above. When a sequence of two-element tuples is used as the query argument, the first element of each tuple is a key and the second is a value. The value element in itself can be a sequence and in that case, if the optional parameter doseq is evaluates to True, individual key=value pairs separated by '&' are generated for each element of the value sequence for the key. The order of parameters in the encoded string will match the order of parameter tuples in the sequence. The urlparse module provides the functions parse_qs() and parse_qsl() which are used to parse query strings into Python data structures.
6 .url2pathname: 将file协议url转换为本地路径
_typeprog = None def splittype(url): """splittype('type:opaquestring') --> 'type', 'opaquestring'.""" global _typeprog if _typeprog is None: import re _typeprog = re.compile('^([^/:]+):') match = _typeprog.match(url) if match: scheme = match.group(1) return scheme.lower(), url[len(scheme) + 1:] return None, url