参考网站:http://www.starming.com/index.php?action=plugin&v=wave&tpl=union&ac=viewgrouppost&gid=73&tid=3714
from urllib2 import Request, urlopen, URLError, HTTPError
req = Request(someurl)
try:
response = urlopen(req)
except HTTPError, e:
print('The server couldn't fulfill the request.')
print('Error code: ', e.code)
except URLError, e:
print('We failed to reach a server.')
print('Reason: ', e.reason)
你也可能发现以下这篇文章对通过PYTHON获取网页资源上有帮助:
*
Basic Authentication
A tutorial on Basic Authentication, with examples in Python.
urllib2 是一个用来取得URLs(统一资源定位器)的 Python 模块。它以 urlopen 函数的方式提供了一个非常简单的接口。它能够取得许多不同协议的URLs.它还提供了稍微复杂的接口处理常见的问题:例如:基础的验证、cookies、代理等等。这些功能由叫做handlers 和openers的对象提供。
urllib2 支持取得许多用“URL schemes”(根据URL中’:’字符前的字符串判断。例如’ftp’是’ ftp://python.org‘的URL scheme)关联网络协议(例如:FTP,HTTP)的URLs。这个教程以常见的HTTP协议作为案例.
简单的情况下 urlopen 是非常容易使用的。但是在你打开HTTP URLs的时候一遇到错误或者是不寻常的问题的时候,你就必须懂得点HTTP协议。HTTP协议最综合、权威的说明是 RFC 2616. 这是一个技术文档而且并不容易读懂。这个 HOWTO 目的在于演示使用*urllib2*, 提供足够的关于HTTP的细节,帮助你过整个流程;并不打算替代 urllib2 的文档,而是作为文档的补充。
取得 URLs
使用 urllib2 最简便的方法如下:
import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
许多urllib2的应用是这么的简单(注意:你可以使用一个用”ftp:”,”file:”等开头的URL替换掉一个”http:”URL)。当然,本教程要讲解的是基于HTTP协议的更加复杂一点的案例。
HTTP基于请求和回复-客户端提出请求,服务器发送回复。urllib2把你作出的HTTP请求反映到一个Request对象中。最简单的形式是你为你指定要取得的URL创建了一个Request对象。调用Request对象的urlopen函数会为你的URL请求返回一个Response对象。这个Response是一个类似File的对象。这意味着你可以在Reponse对象上调用例如.read()这样的函数
import urllib2
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
the_page = response.read()
注意:urllib2使用同样的Request接口来处理所有的URL schemes。例如:你可以这样处理FTP 请求:
req = urllib2.Request('ftp://example.com/')
就HTTP来说,Request对象允许你做2个额外的事情:首先,你能传递数据到服务器。第二,你能传递数据的额外信息(‘metadata’)或者Request本身的信息到服务器-这些性息是以 HTTP “headers”发送的.让我们分别来看看这2件事情。
Data
有时候你想发送数据到一个URL(通常这个URL指向一个CGI(通用网关接口)脚本 [1] ).在HTTP中这种操作被称之为 POST 请求。这些是你填充一个网页表单递交后浏览器做的事情。并不是所有的POST数据都来自表单:你可以用POST传递任意数据到你自己的应用程序。在通常的案例下,数据必须被编码成标准的格式然后按照 data 描述传递给Request对象。编码方法是引用至 urllib 而 不是 urllib2:
import urllib
import urllib2
url = 'http://www.someserver.com/cgi-bin/register.cgi'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()
注意有时候需要其他的编码方式 (例如:HTML文件上传表单 - 具体细节可以查看 HTML Specification, Form Submission ).
如果你不需要发送 data 描述,urllib2可以使用一个 GET 请求。GET和POST请求的一个区别是POST请求 总是 有“副作用”:它用某种方式改变了系统的状态(例如:你在网站上订购了火腿罐头送到你家)。虽然HTTP标准很清楚POSTs 始终回来带来副作用。GET请求 从来不会 带来副作用,没有任何操作阻止GET有副作用,也么有一个POST请求没有负作用。也可以用URL的自身编码的方式通过GET请求发送数据。
如下操作:
>>> import urllib2
>>> import urllib
>>> data = {}
>>> data['name'] = 'Somebody Here'
>>> data['location'] = 'Northampton'
>>> data['language'] = 'Python'
>>> url_values = urllib.urlencode(data)
>>> print(url_values)
name=Somebody+Here&language=Python&location=Northampton
>>> url = 'http://www.example.com/example.cgi'
>>> full_url = url + '?' + url_values
>>> data = urllib2.open(full_url)
注意:完整的URL是加上一个”?”字符到URL,后面跟上编码后的值
Headers
我们通过讨论一个详细的HTTP头来阐述如何添加头到你的HTTP请求中。
有些网站 [2] 不喜欢被程序访问, 而且发送不同的版本给不同的浏览器 [3] .在默认的情况下urllib2用 Python-urllib/x.y 来标识自己(x,y代表Python的大小版本号),这可能是网站混淆以至不能正常工作。浏览器是通过一个 User-Agent 头 [4] 来标识自己的。当你创建一个Request对象你可以传一个向头传递一个字典。下面的例子生成了一个如上所述的请求,使这个请求看起来像一个IE浏览器发出的 [5]. :
import urllib
import urllib2
url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()
Response也有2个有用的方法。等我们介绍完错误处理后,可以看 info and geturl 这一节
错误处理
当 urlopen 无法处理Response的时候会激发一个 URLError (然后如同一个普通Python APIs,像 ValueError, TypeError 等这样的内建错误可能同时被激发).
HTTPError 作为 URLError 的子类仅会在 HTTP URLs中激发
URLError
通常情况下,激发 URLError 是因为没有网络连接(没有路由能够到达指定的服务器)或者是制指定的服务器不存在。在这种情况下,激发的错误包含一个 ‘reason’ 属性,它是一个包含错误代码和文本错误信息的元组
例如:
>>> req = urllib2.Request('http://www.pretend_server.org')
>>> try: urllib2.urlopen(req)
>>> except URLError, e:
>>> print(e.reason)
>>>
(4, 'getaddrinfo failed')
HTTPError
每个从服务器返回的HTTP Reponse都有包含一个 “状态码”。有时候状态码表示服务器无法响应请求。默认的处理器会为你处理一些返回(例如:如果一个请求的返回是一个”重定向”需要客户端去另外的URL 取得文档,urllib2会自动的为你处理这些事情).对那些无法处理的,urlopen会激发一个 HTTPError . 典型的错误包括 ‘404’(无法找到页面),‘403’(禁止请求)和‘401’(需要身份验证)
请查阅RFC 2616的第10小结里面罗列了所有的HTTP错误代码
激发的 HTTPError 实例包含一个整型的’code’属性,它对应服务器设置的错误。
Error Codes
由于一些默认的错误被重定向了(状态码在300以内),状态码在100-299表示处理成功。你只要查看那些在400-599之间的状态码
BaseHTTPServer.BaseHTTPRequestHandler.responses 是一个非常有用的回复代码字典,它里面囊括了所有 RFC 2616用到的回复代码。为了方便起见这个字典转载如下:
Table mapping response codes to messages; entries have the
form {code: (shortmessage, longmessage)}.
responses = {
100: ('Continue', 'Request received, please continue'),
101: ('Switching Protocols',
'Switching to new protocol; obey Upgrade header'),
200: ('OK', 'Request fulfilled, document follows'),
201: ('Created', 'Document created, URL follows'),
202: ('Accepted',
'Request accepted, processing continues off-line'),
203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
204: ('No Content', 'Request fulfilled, nothing follows'),
205: ('Reset Content', 'Clear input form for further input.'),
206: ('Partial Content', 'Partial content follows.'),
300: ('Multiple Choices',
'Object has several resources -- see URI list'),
301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
302: ('Found', 'Object moved temporarily -- see URI list'),
303: ('See Other', 'Object moved -- see Method and URL list'),
304: ('Not Modified',
'Document has not changed since given time'),
305: ('Use Proxy',
'You must use proxy specified in Location to access this '
'resource.'),
307: ('Temporary Redirect',
'Object moved temporarily -- see URI list'),
400: ('Bad Request',
'Bad request syntax or unsupported method'),
401: ('Unauthorized',
'No permission -- see authorization schemes'),
402: ('Payment Required',
'No payment -- see charging schemes'),
403: ('Forbidden',
'Request forbidden -- authorization will not help'),
404: ('Not Found', 'Nothing matches the given URI'),
405: ('Method Not Allowed',
'Specified method is invalid for this server.'),
406: ('Not Acceptable', 'URI not available in preferred format.'),
407: ('Proxy Authentication Required', 'You must authenticate with '
'this proxy before proceeding.'),
408: ('Request Timeout', 'Request timed out; try again later.'),
409: ('Conflict', 'Request conflict.'),
410: ('Gone',
'URI no longer exists and has been permanently removed.'),
411: ('Length Required', 'Client must specify Content-Length.'),
412: ('Precondition Failed', 'Precondition in headers is false.'),
413: ('Request Entity Too Large', 'Entity is too large.'),
414: ('Request-URI Too Long', 'URI is too long.'),
415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
416: ('Requested Range Not Satisfiable',
'Cannot satisfy request range.'),
417: ('Expectation Failed',
'Expect condition could not be satisfied.'),
500: ('Internal Server Error', 'Server got itself in trouble'),
501: ('Not Implemented',
'Server does not support this operation'),
502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
503: ('Service Unavailable',
'The server cannot process the request due to a high load'),
504: ('Gateway Timeout',
'The gateway server did not receive a timely response'),
505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
}
一个错误产生的时候服务方器会返回一个HTTP 错误代码 和 一个错误页面。在页面返回的时候你可以使用一个 HTTPError 实例作为返回值。这意味和代码属性一样,它也有 read,geturl,和info方法。:
>>> req = urllib2.Request('http://www.python.org/fish.html')
>>> try:
>>> urllib2.urlopen(req)
>>> except URLError, e:
>>> print(e.code)
>>> print(e.read())
>>>
404
"http://www.w3.org/TR/html4/loose.dtd">
type="text/css"?>
...... etc...
封装起来
所以如果你想处理 HTTPError 和 URLError 这里有2中基本的方法。我更喜欢第2种方法。
第一
from urllib2 import Request, urlopen, URLError, HTTPError
req = Request(someurl)
try:
response = urlopen(req)
except HTTPError, e:
print('The server couldn't fulfill the request.')
print('Error code: ', e.code)
except URLError, e:
print('We failed to reach a server.')
print('Reason: ', e.reason)
else:
everything is fine
Note
except HTTPError 必须 放在前面,否则 except URLError 也会 捕获 HTTPError.
第二
from urllib2 import Request, urlopen, URLError
req = Request(someurl)
try:
response = urlopen(req)
except URLError, e:
if hasattr(e, 'reason'):
print('We failed to reach a server.')
print('Reason: ', e.reason)
elif hasattr(e, 'code'):
print('The server couldn't fulfill the request.')
print('Error code: ', e.code)
else:
everything is fine
info and geturl
urlopen(或者是 HTTPError 实例)的返回还有2个有用的方法 info 和 geturl 。
geturl - 它返回一个页面取得的真实的UR。说它有用是因为 urlopen (或则是opener对象)可能只想一个从定向的网页。URL取得网页可能和URL请求的网页不一致。
info - 返回描述取回页面的一个类似字典的对象,通常情况下是服务器返回的头信息.眼下是一个 httplib.HTTPMessage 实例
典型的头信息包含 ‘Content-length’, ‘Content-type’等,请查看 Quick Reference to HTTP Headers 里面是一些用头信息的列表,以及它们简单的含义和用途。
Openers 和 Handlers
When you fetch a URL you use an opener (an instance of the perhaps confusingly-named urllib2.OpenerDirector). Normally we have been using the default opener - via urlopen - but you can create custom openers. Openers use handlers. All the “heavy lifting” is done by the handlers. Each handler knows how to open URLs for a particular URL scheme (http, ftp, etc.), or how to handle an aspect of URL opening, for example HTTP redirections or HTTP cookies.
You will want to create openers if you want to fetch URLs with specific handlers installed, for example to get an opener that handles cookies, or to get an opener that does not handle redirections.
To create an opener, instantiate an OpenerDirector, and then call .add_handler(some_handler_instance) repeatedly.
Alternatively, you can use build_opener, which is a convenience function for creating opener objects with a single function call. build_opener adds several handlers by default, but provides a quick way to add more and/or override the default handlers.
Other sorts of handlers you might want to can handle proxies, authentication, and other common but slightly specialised situations.
install_opener can be used to make an opener object the (global) default opener. This means that calls to urlopen will use the opener you have installed.
Opener objects have an open method, which can be called directly to fetch urls in the same way as the urlopen function: there’s no need to call install_opener, except as a convenience.
Basic Authentication
To illustrate creating and installing a handler we will use the HTTPBasicAuthHandler. For a more detailed discussion of this subject – including an explanation of how Basic Authentication works - see the Basic Authentication Tutorial.
When authentication is required, the server sends a header (as well as the 401 error code) requesting authentication. This specifies the authentication scheme and a ‘realm’. The header looks like : Www-authenticate: SCHEME realm="REALM".
例如.:
.. cnid:: 65
Www-authenticate: Basic realm="cPanel Users"
The client should then retry the request with the appropriate name and password for the realm included as a header in the request. This is ‘basic authentication’. In order to simplify this process we can create an instance of HTTPBasicAuthHandler and an opener to use this handler.
The HTTPBasicAuthHandler uses an object called a password manager to handle the mapping of URLs and realms to passwords and usernames. If you know what the realm is (from the authentication header sent by the server), then you can use a HTTPPasswordMgr. Frequently one doesn’t care what the realm is. In that case, it is convenient to use HTTPPasswordMgrWithDefaultRealm. This allows you to specify a default username and password for a URL. This will be supplied in the absence of you providing an alternative combination for a specific realm. We indicate this by providing None as the realm argument to the add_password method.
The top-level URL is the first URL that requires authentication. URLs “deeper” than the URL you pass to .add_password() will also match. :
create a password manager
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
Add the username and password.
If we knew the realm, we could use it instead of ``None``.
top_level_url = "http://example.com/foo/"
password_mgr.add_password(None, top_level_url, username, password)
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
create "opene
(OpenerDirector instance)
opener = urllib2.build_opener(handler)
use the opener to fetch a URL
opener.open(a_url)
Install the opener.
Now all calls to urllib2.urlopen use our opener.
urllib2.install_opener(opener)
Note
In the above example we only supplied our HHTPBasicAuthHandler to build_opener. By default openers have the handlers for normal situations – ProxyHandler, UnknownHandler, HTTPHandler, HTTPDefaultErrorHandler, HTTPRedirectHandler, FTPHandler, FileHandler, HTTPErrorProcessor.
top_level_url is in fact either a full URL (including the ‘http:’ scheme component and the hostname and optionally the port number) e.g. “ http://example.com/” or an “authority” (i.e. the hostname, optionally including the port number) e.g. “example.com” or “example.com:8080” (the latter example includes a port number). The authority, if present, must NOT contain the “userinfo” component - for example “joe@password:example.com” is not correct.
Proxies
urllib2 will auto-detect your proxy settings and use those. This is through the ProxyHandler which is part of the normal handler chain. Normally that’s a good thing, but there are occasions when it may not be helpful [6]. One way to do this is to setup our own ProxyHandler, with no proxies defined. This is done using similar steps to setting up a Basic Authentication handler : :
>>> proxy_support = urllib2.ProxyHandler({})
>>> opener = urllib2.build_opener(proxy_support)
>>> urllib2.install_opener(opener)
Note
Currently urllib2 does not support fetching of https locations through a proxy. However, this can be enabled by extending urllib2 as shown in the recipe [7].
Sockets and Layers
The Python support for fetching resources from the web is layered. urllib2 uses the httplib library, which in turn uses the socket library.
As of Python 2.3 you can specify how long a socket should wait for a response before timing out. This can be useful in applications which have to fetch web pages. By default the socket module has no timeout and can hang. Currently, the socket timeout is not exposed at the httplib or urllib2 levels. However, you can set the default timeout globally for all sockets using :
import socket
import urllib2
timeout in seconds
timeout = 10
socket.setdefaulttimeout(timeout)
this call to urllib2.urlopen now uses the default timeout
we have set in the socket module
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)