urllib2 is a Python module for fetching URLs (Uniform Resource Locators). It offers a very simple interface, in the form of the urlopen function. This is capable of fetching URLs using a variety of different protocols. It also offers a slightly more complex interface for handling common situations - like basic authentication, cookies, proxies and so on. These are provided by objects called handlers and openers.
urllib2 是python中的一个来处理URLs(统一资源定位器)的模块。它以urlopen()函数的方式,提供非常简单的接口。它可以使用多种不同的协议来打开网页。它也提供稍微复杂的接口来处理更一般的情形:例如基本的身份验证,Cookies,代理等等。这些由类提供的(函数)也叫做句柄和Openers.
urllib2 supports fetching URLs for many "URL schemes" (identified by the string before the ":" in URL - for example "ftp" is the URL scheme of "ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP). This tutorial focuses on the most common case, HTTP.
urllib2 支持多种方案来获取网页(通过网址字符串之前的“:”--例如FTP,HTTP)。此教程重点关注最常用的情形: http。
For straightforward situations urlopen is very easy to use. But as soon as you encounter errors or non-trivial cases when opening HTTP URLs, you will need some understanding of the HyperText Transfer Protocol. The most comprehensive and authoritative reference to HTTP is RFC 2616. This is a technical document and not intended to be easy to read. This HOWTO aims to illustrate using urllib2, with enough detail about HTTP to help you through. It is not intended to replace the urllib2 docs , but is supplementary to them.
urlopen在通常情况下很好使用。但是当你打开网页遇到错误或者异常时,你需要了解一些超文本传输协议。最全面和权威的文档当然是参考HTTP的 RFC 2616,但是这个技术文档却并不容易阅读。这个指南就是通过详尽的HTTP细节,来说明怎样使用urllib2。这个指南仅仅是对文档urllib2 docs的补充,而不是试图取代它们。
The simplest way to use urllib2 is as follows :
最简单的使用urllib2的方式如下所示:
Many uses of urllib2 will be that simple (note that instead of an 'http:' URL we could have used an URL starting with 'ftp:', 'file:', etc.). However, it's the purpose of this tutorial to explain the more complicated cases, concentrating on HTTP.
很多urllib2的情形都是如此简单的(当然你也可以打开这样的网址'ftp://***.***.***.***'),然而我们本教程的目的为了解释更复杂的情形:HTTP。
HTTP is based on requests and responses - the client makes requests and servers send responses. urllib2 mirrors this with a Request object which represents the HTTP request you are making. In its simplest form you create a Request object that specifies the URL you want to fetch. Calling urlopen with this Request object returns a response object for the URL requested. This response is a file-like object, which means you can for example call .read() on the response :
HTTP是基于请求和响应的:客户端发出请求,服务器作出答复。urllib2利用 Request 类来描述这个行为,代表你作出的HTTP请求。最简单的创建Request类的方法就是指定你要打开的URL。利用函数urlopen打开Request类,返回一个response类。这个答复是一个像文件的类,你可以使用.read()函数来查看答复的内容。
Note that urllib2 makes use of the same Request interface to handle all URL schemes. For example, you can make an FTP request like so :
注意:urllib2使用同样的请求借口来处理URL方案。例如,你可以创建一个FTP请求:
In the case of HTTP, there are two extra things that Request objects allow you to do: First, you can pass data to be sent to the server. Second, you can pass extra information ("metadata") about the data or the about request itself, to the server - this information is sent as HTTP "headers". Let's look at each of these in turn.
在HTTP情形下,Request类有两件额外的事让你去做:第一,你可以将数据发送到服务器。第二,你可以发送关于数据本身,或者关于请求自己的额外信息(元数据)给服务器。这些信息通常用Http“headers”形式传递。让我们依次看几个例子。
数据
Sometimes you want to send data to a URL (often the URL will refer to a CGI (Common Gateway Interface) script [1] or other web application). With HTTP, this is often done using what's known as a POST request. This is often what your browser does when you submit a HTML form that you filled in on the web. Not all POSTs have to come from forms: you can use a POST to transmit arbitrary data to your own application. In the common case of HTML forms, the data needs to be encoded in a standard way, and then passed to the Request object as the data argument. The encoding is done using a function from the urllib library not from urllib2.
有时候你想给某个URL传递数据(这里的URL通常会涉及到CGI(通用网关界面)脚本或者其他web应用程序)。结合HTTP,这通常使用POST请求。这通常是当你提交一个HTML表格时,你的浏览器所作的事情。并非所有的POSTs都得来自表格。你可以使用POST方法传递任意数据到你的应用程序。在通常的HTML表单上,这些要传递的数据需要惊醒标准的编码,然后传递到Request对象的data参数。用urllib库,而不是urllib2库中的函数来进行这种编码。
Note that other encodings are sometimes required (e.g. for file upload from HTML forms - see HTML Specification, Form Submission for more details).
注意:有时候需要其他编码形式(例如,从HTML表格中上传文件,请参考HTML Specification, Form Submission)
If you do not pass the data argument, urllib2 uses a GET request. One way in which GET and POST requests differ is that POST requests often have "side-effects": they change the state of the system in some way (for example by placing an order with the website for a hundredweight of tinned spam to be delivered to your door). Though the HTTP standard makes it clear that POSTs are intended to always cause side-effects, and GET requests never to cause side-effects, nothing prevents a GET request from having side-effects, nor a POST requests from having no side-effects. Data can also be passed in an HTTP GET request by encoding it in the URL itself.
如果你不想以data参数的形式传递数据,urllib2可以使用Get请求。GET和POST请求的一个不同之处在于:POST请求经常有副作用:他们会改变系统的状态(例如,可能会把一听垃圾放在你门口)。虽然HTTP标准清楚地告诉我们:POST总会引起副作用,GET方法从不引起副作用,但是,GET也会有副作用,POST方法也许没有副作用。数据也可以通过GET请求将数据直接镶嵌在URL中。
This is done as follows.
>>> import urllib2
>>> import urllib
>>> data = {}
>>> data['name'] = 'Somebody Here'
>>> data['location'] = 'Northampton'
>>> data['language'] = 'Python'
>>> url_values = urllib.urlencode(data)
>>> print url_values
name=Somebody+Here&language=Python&location=Northampton
>>> url = 'http://www.example.com/example.cgi'
>>> full_url = url + '?' + url_values
>>> data = urllib2.open(full_url)
Notice that the full URL is created by adding a ? to the URL, followed by the encoded values.
注意到完整的URL是由网址+‘?’还有编码后的数据组成的。
We'll discuss here one particular HTTP header, to illustrate how to add headers to your HTTP request.
Some websites [2] dislike being browsed by programs, or send different versions to different browsers [3] . By default urllib2 identifies itself as Python-urllib/x.y (where x and y are the major and minor version numbers of the Python release, e.g. Python-urllib/2.5), which may confuse the site, or just plain not work. The way a browser identifies itself is through the User-Agent header [4]. When you create a Request object you can pass a dictionary of headers in. The following example makes the same request as above, but identifies itself as a version of Internet Explorer [5].
我们在这里讨论一个特定的HTTP标题,来说明如何向你的HTTP请求添加标题。有些网站不喜欢正在浏览的节目,或者给不同的浏览器发送不同版本。默认情况下urllib2识别自己为Python-urllib/x.y(其中x和y分别是主要的和次要的python版本号。例如,Python-urllib/2.5),这样会混淆一些网站,或者不能工作。浏览器通过user-Agent标题来确认自己。当你创建一个Request类时候,你传递包含标题的字典型。下面的例子向上面一样做了同样的请求,但是他将自己作为IE浏览器。
The response also has two useful methods. See the section on info and geturl which comes after we have a look at what happens when things go wrong.
答复已经有了两个有用的方法(POST,GET)。在看 info and geturl 之前,我们看看如果程序出错会发生什么事情。
HTTPError is the subclass of URLError raised in the specific case of HTTP URLs.
urlopen会引发一个URLError异常,当它不能处理答复(尽管像Python的APIs,内建的异常如ValueError,TypeError等也可能一起异常)时。HTTPError是URLError的一个子类,当具体的HTTP网址是会引发这个异常。
Often, URLError is raised because there is no network connection (no route to the specified server), or the specified server doesn't exist. In this case, the exception raised will have a 'reason' attribute, which is a tuple containing an error code and a text error message.
通常,URLError被引发是因为没有网络连接(没有这个服务器),或者目标服务器不存在。在这种情况下,异常被引发会有一个‘reason’属性,这个属性是个元组类型,包含一个错误代码和一个文本错误信息。e.g.
>>> req = urllib2.Request('http://www.pretend_server.org')
>>> try: urllib2.urlopen(req)
>>> except URLError, e:
>>> print e.reason
>>>
(4, 'getaddrinfo failed')
Every HTTP response from the server contains a numeric "status code". Sometimes the status code indicates that the server is unable to fulfil the request. The default handlers will handle some of these responses for you (for example, if the response is a "redirection" that requests the client fetch the document from a different URL, urllib2 will handle that for you). For those it can't handle, urlopen will raise an HTTPError. Typical errors include '404' (page not found), '403' (request forbidden), and '401' (authentication required).
See section 10 of RFC 2616 for a reference on all the HTTP error codes.The HTTPError instance raised will have an integer 'code' attribute, which corresponds to the error sent by the server.
每个来自服务器的HTTP response响应都包含一个数字状态码。有时候这个状态码表明服务器不能履行你的请求。默认处理程序会给你一些错误的信息(如,如果请求是'redirection',它从不同的网址获得文件,urllib2会为你处理这些),对一些不能处理的,urlopen会引发一个HTTPError。典型的异常包括‘404‘(找不到网页),’403‘(请求被禁止),’401‘(需要验证)。请参考10条和RFC2616中的HTTP错误代码。HTTPError有一个代码属性。他对应服务器发出的错误。
Because the default handlers handle redirects (codes in the 300 range), and codes in the 100-299 range indicate success, you will usually only see error codes in the 400-599 range.
BaseHTTPServer.BaseHTTPRequestHandler.responses is a useful dictionary of response codes in that shows all the response codes used by RFC 2616. The dictionary is reproduced here for convenience :
因为默认的处理是重新定向(代码在300范围内)。代码在100-299表明成功。通常你看到的代码错误在400-599之间。BaseHTTPServer.BaseHTTPRequestHandler.response 是一个有用的代码字典。它RFC2616中使用的响应代码。如下所示:
When an error is raised the server responds by returning an HTTP error code and an error page. You can use the HTTPError instance as a response on the page returned. This means that as well as the code attribute, it also has read, geturl, and info, methods.
当一个异常被引发,服务器通过返回一个HTTP错误代码和一个错误网页。你可以使用HTTPError实例打开。这意味着你可以使用code属性如 read,geturl,info,methods方法。
>>> req = urllib2.Request('http://www.python.org/fish.html') >>> try: >>> urllib2.urlopen(req) >>> except URLError, e: >>> print e.code >>> print e.read() >>> 404 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <?xml-stylesheet href="./css/ht2html.css" type="text/css"?> <html><head><title>Error 404: File Not Found</title> ...... etc...
So if you want to be prepared for HTTPError or URLError there are two basic approaches. I prefer the second approach.
如果你想编写HTTPError和URLError,这有两种方法。我更愿意使用第二个方法。
Note
The except HTTPError must come first, otherwise except URLError will also catch an HTTPError.
Note
URLError is a subclass of the built-in exception IOError.
This means that you can avoid importing URLError and use :
from urllib2 import Request , urlopen
req = Request ( someurl )
try :
response = urlopen ( req )
except IOError , e :
if hasattr ( e , 'reason' ) :
print 'We failed to reach a server.'
print 'Reason: ' , e . reason
elif hasattr ( e , 'code' ) :
print 'The server couldn\'t fulfill the request.'
print 'Error code: ' , e . code
else :
# everything is fine
Under rare circumstances urllib2 can raise socket.error.
The response returned by urlopen (or the HTTPError instance) has two useful methods info and geturl.
geturl - this returns the real URL of the page fetched. This is useful because urlopen (or the opener object used) may have followed a redirect. The URL of the page fetched may not be the same as the URL requested.
info - this returns a dictionary-like object that describes the page fetched, particularly the headers sent by the server. It is currently an httplib.HTTPMessage instance.
Typical headers include 'Content-length', 'Content-type', and so on. See the Quick Reference to HTTP Headers for a useful listing of HTTP headers with brief explanations of their meaning and use.
urlopen返回的结果有两个好的方法:geturl:返回得到的网页的真是的网址。这个很有用。因为urlopen可能跟着一个重新定向。URL的网址也许不是你发出请求的那个URL。info:返回一个字典类型的数据。包括描述的网页,特别是服务器返回的标题。它目前是httplib.HTTPMessage的一个实例。典型的标题包括'Content-length', 'Content-type', 等等。请参考 Quick Reference to HTTP Headers里面有一个有用的标题列表和简要的介绍和用法。
When you fetch a URL you use an opener (an instance of the perhaps confusingly-named urllib2.OpenerDirector). Normally we have been using the default opener - via urlopen - but you can create custom openers. Openers use handlers. All the "heavy lifting" is done by the handlers. Each handler knows how to open URLs for a particular URL scheme (http, ftp, etc.), or how to handle an aspect of URL opening, for example HTTP redirections or HTTP cookies.
当你用一个opener(是urllib2.OpenerDirector的一个实例)打开一个网址,一般说来,我们一直利用默认的opener-通过urlopen-但是你可以自己创建一个opener. Openers使用句柄。所有繁重的工作都是由handlers来做的。每一个Handler知道怎样对某个特定的URL打开网址,或者知道怎样处理URL的某方面。例如,HTTP重新定向或者HTTP cookies。
You will want to create openers if you want to fetch URLs with specific handlers installed, for example to get an opener that handles cookies, or to get an opener that does not handle redirections.
To create an opener, instantiate an OpenerDirector, and then call .add_handler(some_handler_instance) repeatedly.
当你想处理URLs,你就想去建立openers。例如得到opener来处理cookies。或者用opener来处理重新定向。为了建立一个OpenerDirector 的实例opener,接着需要需要函数.add_handler().
Alternatively, you can use build_opener, which is a convenience function for creating opener objects with a single function call. build_opener adds several handlers by default, but provides a quick way to add more and/or override the default handlers.
Other sorts of handlers you might want to can handle proxies, authentication, and other common but slightly specialised situations.
你也可以使用build_opener,他是一个很方便的函数来创建opener类。它默认情况下增加许多handles,但提供一个快速的增加或者覆盖默认handlers的方法。其他handlers你也许想去处理代理,认证或者其他普通但稍微专业的情形。
install_opener can be used to make an opener object the (global) default opener. This means that calls to urlopen will use the opener you have installed.
Opener objects have an open method, which can be called directly to fetch urls in the same way as the urlopen function: there's no need to call install_opener, except as a convenience.
install_opener 可以用来创建一个opener类。这意味着urlopen使用你建立的opener。Opener类有一个open方法。他可以用来直接得到urls,像urlopen函数那一样不需要使用install_opener函数。
To illustrate creating and installing a handler we will use the HTTPBasicAuthHandler. For a more detailed discussion of this subject - including an explanation of how Basic Authentication works - see the Basic Authentication Tutorial.
When authentication is required, the server sends a header (as well as the 401 error code) requesting authentication. This specifies the authentication scheme and a 'realm'. The header looks like : Www-authenticate: SCHEME realm="REALM".e.g.
当创建一个handler时我们使用HTTPBasicAuthHandler.更多信息请参考权威的 Basic Authentication Tutorial.
当需要认证的时候,服务器发送一个标题(401代码)要求验证。这中需要验证和‘realm‘ 标题看起来想这样: Www-authenticate: SCHEME realm="REALM"例如:
Www-authenticate: Basic realm="cPanel Users"
The client should then retry the request with the appropriate name and password for the realm included as a header in the request. This is 'basic authentication'. In order to simplify this process we can create an instance of HTTPBasicAuthHandler and an opener to use this handler.
客户端应该试图重新提交请求用合适的名字和密码。这就是基本的认证。为了简化这种国沉给我们建立一个HTTPBasicAuthHandler的一个实例和opener。
The HTTPBasicAuthHandler uses an object called a password manager to handle the mapping of URLs and realms to passwords and usernames. If you know what the realm is (from the authentication header sent by the server), then you can use a HTTPPasswordMgr. Frequently one doesn't care what the realm is. In that case, it is convenient to use HTTPPasswordMgrWithDefaultRealm. This allows you to specify a default username and password for a URL. This will be supplied in the absence of you providing an alternative combination for a specific realm. We indicate this by providing None as the realm argument to the add_password method.
The top-level URL is the first URL that requires authentication. URLs "deeper" than the URL you pass to .add_password() will also match.
HTTPBasicAuthHandler 用一个密码管理者的类来处理我那个这和密码用户名。如果你知道哦阿realm是什么,你可以使用HTTPPasswrodMgr. 通常我们不关心realm是什么。在这种哦功能情形下,我们用HTTPPasswordMgrWithDefaultRealm是很方便的。这如许你可以具体化用户名和密码。如果你不提供另外的可选方案他会帮你作这些。我们通过用add_password 中的None。
在顶极URL是第一个URL需要认证。URL比.addpassword()更deeper.
Note
In the above example we only supplied our HHTPBasicAuthHandler to build_opener. By default openers have the handlers for normal situations - ProxyHandler, UnknownHandler, HTTPHandler, HTTPDefaultErrorHandler, HTTPRedirectHandler, FTPHandler, FileHandler, HTTPErrorProcessor.
top_level_url is in fact either a full URL (including the 'http:' scheme component and the hostname and optionally the port number) e.g. "http://example.com/" or an "authority" (i.e. the hostname, optionally including the port number) e.g. "example.com" or "example.com:8080" (the latter example includes a port number). The authority, if present, must NOT contain the "userinfo" component - for example "joe@password:example.com" is not correct.
urllib2 will auto-detect your proxy settings and use those. This is through the ProxyHandler which is part of the normal handler chain. Normally that's a good thing, but there are occasions when it may not be helpful [6]. One way to do this is to setup our own ProxyHandler, with no proxies defined. This is done using similar steps to setting up a Basic Authentication handler :
urllib2自动检测的代理设置并使用他们。这是通过正常处理链下的ProxyHandler实现的。一般来说它是个好东西但是有时候,它并不是很管用。一种方式就是自己设定我们的ProxyHandler,没有代理人的定义.用类似的步骤也可以设定 Basic Authentication :
>>> proxy_support = urllib2.ProxyHandler({}) >>> opener = urllib2.build_opener(proxy_support) >>> urllib2.install_opener(opener)
Note
Currently urllib2 does not support fetching of https locations through a proxy. This can be a problem.
The Python support for fetching resources from the web is layered. urllib2 uses the httplib library, which in turn uses the socket library.
As of Python 2.3 you can specify how long a socket should wait for a response before timing out. This can be useful in applications which have to fetch web pages. By default the socket module has no timeout and can hang. Currently, the socket timeout is not exposed at the httplib or urllib2 levels. However, you can set the default timeout globally for all sockets using :
python支持从网络层面获得资源。urllib2使用httplib库中的socket库。在python2.3中你可以指定多久算超时。当你想得到网页是很有用。默认情况下socket模块没有timeout 可以挂起。目前,socket 中的timeout只在httplib和urllib2层面上。然而,你可以设定全局的timout值。