【原文链接】https://doc.scrapy.org/en/latest/topics/request-response.html
Scrapy uses Request
and Response
对象来爬网页.
Typically, spiders 中会产生 Request
对象,然后传递 across the system, 直到他们到达 Downloader, which 执行请求并返回一个 Response
对象 which travels back to the spider that issued the request.
Both Request
and Response
类都有子类 which 增加了基类中 not required 的功能. These are described below in Request subclasses and Response subclasses.
class scrapy.http.
Request
(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback, flags])
A Request
对象代表了一次 HTTP 请求, which is usually generated in the Spider and executed by the Downloader, and thus 生成一个 Response
.
Parameters: |
|
---|
url
A string containing the URL of this request. Keep in mind that this attribute contains 转义的 URL, so it can differ from the URL passed in the constructor.
This attribute is read-only. To change the URL of a Request use replace()
.
method
A string representing the HTTP method in the request. This is guaranteed to be 大写. Example: "GET"
, "POST"
, "PUT"
, etc
headers
A 类似于字典的对象 which contains the request headers.
body
一个包含了请求体的 str.
This attribute is read-only. To change the body of a Request use replace()
.
meta
A dict that contains arbitrary 元数据 for this request. 对于 new Requests 来说这个字典是空的, 且该字典通常会填充不同的 Scrapy 组件 (插件, 中间件, etc). 所以这个字典中包含的数据会依赖于 the extensions you have enabled.
See Request.meta special keys for a list of special meta keys recognized by Scrapy.
当使用 copy()
or replace()
方法克隆请求的时候,这个字典会被 shallow copied, 并且可以通过你的 spider 中的 response.meta
属性 be accessed.
copy
()
返回一个新的 Request which is a copy of this Request. See also: Passing additional data to callback functions.
replace
([url, method, headers, body, cookies, meta, encoding, dont_filter, callback, errback])
返回一个 Request 对象 with the same members, 除了 those members given new values by whichever keyword arguments are specified. The attribute Request.meta
默认会被复制 (除非 meta
argument中被给予了一个新的值). See also Passing additional data to callback functions.
一个请求的回调函数是指一个方程,该方程在此请求所对应的响应被下载的时候被调用. 这个回调函数在被调用的时候,被下载的 Response
对象会成为其第一个参数.
Example:
def parse_page1(self, response):
return scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
def parse_page2(self, response):
# this would log http://www.example.com/some_page.html
self.logger.info("Visited %s", response.url)
有的时候你可能想传递 arguments 给这些回调函数,然后之后在第二个回调函数中你就可以接收这些 arguments. You can use the Request.meta
attribute for that.
Here’s an example of how to pass an item using this mechanism, to populate different fields from different pages:
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
yield request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
yield item
一个请求的 errback 是一个方程 that will be called when an exception is raise while processing it.
It receives a Twisted Failure instance as first parameter and can be used to track 连接建立超时, DNS 错误 etc.
Here’s an example spider logging all errors and 捕获一些特殊的 errors if needed:
import scrapy
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError
class ErrbackSpider(scrapy.Spider):
name = "errback_example"
start_urls = [
"http://www.httpbin.org/", # HTTP 200 expected
"http://www.httpbin.org/status/404", # Not found error
"http://www.httpbin.org/status/500", # server issue
"http://www.httpbin.org:12345/", # non-responding host, timeout expected
"http://www.httphttpbinbin.org/", # DNS error expected
]
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, callback=self.parse_httpbin,
errback=self.errback_httpbin,
dont_filter=True)
def parse_httpbin(self, response):
self.logger.info('Got successful response from {}'.format(response.url))
# do something useful here...
def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))
# in case you want to do something special for some errors,
# you may need the failure's type:
if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)
elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)
The Request.meta
属性可以包含任何 arbitrary data, but there are some special keys recognized by Scrapy and its built-in extensions.
Those are:
dont_redirect
dont_retry
handle_httpstatus_list
handle_httpstatus_all
dont_merge_cookies
cookiejar
dont_cache
redirect_urls
bindaddress
dont_obey_robotstxt
download_timeout
download_maxsize
download_latency
download_fail_on_dataloss
proxy
ftp_user
(See FTP_USER
for more info)ftp_password
(See FTP_PASSWORD
for more info)referrer_policy
max_retry_times
The IP of the outgoing IP address to use for the performing the request.
The amount of time (in secs) that the downloader will wait before timing out. See also: DOWNLOAD_TIMEOUT
.
指从请求开始起,the amount of time spent to fetch the response, i.e. HTTP 消息 sent over the network. 当响应被下载下来的时候,这个 meta key 才会变得可用. While most other meta keys are used to control Scrapy behavior, this one is supposed to be read-only.
Whether or not to fail on 断开的响应. See: DOWNLOAD_FAIL_ON_DATALOSS
.
The meta key is used set retry times per request. When initialized, the max_retry_times
meta key 比 the RETRY_TIMES
setting 优先级更高.
Here is the list of built-in Request
subclasses. You can also subclass it to implement your own custom functionality.
Using FormRequest to send data via HTTP POST
If you want to 模拟 a HTML Form POST in your spider 并且发送一系列 key-value fields, you can return a FormRequest
object (from your spider) like this:
return [FormRequest(url="http://www.example.com/post/action",
formdata={'name': 'John Doe', 'age': '27'},
callback=self.after_post)]
Using FormRequest.from_response() to 模拟用户登陆
网站经常会通过 元素提供 pre-populated 表单 fields, 比如 session related data or authentication tokens (在登陆页面上). 当爬取页面时,你会希望这些 fields 都是自动 pre-populated and only override a couple of them, such as the user name and password. You can use the
FormRequest.from_response()
method for this job. Here’s an example spider which uses it:
import scrapy
class LoginSpider(scrapy.Spider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login
)
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.logger.error("Login failed")
return
# continue scraping with authenticated session...
class scrapy.http.
Response
(url[, status=200, headers=None, body=b'', flags=None, request=None])
A Response
对象代表了一个 HTTP 响应, which 通常会被下载下来 (by the Downloader) 然后给到 Spiders 进行处理.
Parameters: |
|
---|
url
A string containing the URL of the response.
This attribute is read-only. To change the URL of a Response use replace()
.
status
一个代表了响应的 HTTP 状态的 Integer. Example: 200
, 404
.
headers
A dictionary-like object which 包含了响应头. 可以通过 get()
获取值然后返回 the first header value with the specified name 或者通过 getlist()
to return all header values with the specified name. For example, this call will give you all cookies in the headers:
response.headers.getlist('Set-Cookie')
body
The body of this Response. Keep in mind that Response.body 永远是一个字节对象. If you want the unicode version use TextResponse.text
(only available in TextResponse
and subclasses).
This attribute is read-only. To change the body of a Response use replace()
.
request
The Request
object that generated this response. This attribute is assigned in the Scrapy engine, after the response and the request have passed through all Downloader Middlewares. In particular, this means that:
response_downloaded
signal.meta
A shortcut to the Request.meta
attribute of the Response.request
object (ie. self.request.meta
).
Unlike the Response.request
attribute, the Response.meta
attribute is propagated along redirects and retries, so you will get the original Request.meta
sent from your spider.
See also
Request.meta
attribute
flags
一个包含了该请求 flags 的列表. Flags 是用于标记 Responses 的标签. For example: ‘cached’, ‘redirected’, etc. 并且他们被显示在 Response 的字符串表达式中 (__str__ method) which is used by the engine for logging.
copy
()
Returns a new Response which is a copy of this Response.
replace
([url, status, headers, body, request, flags, cls])
返回一个 Response 对象 with the same members, except for those members given new values by whichever keyword arguments are specified. The attribute Response.meta
is copied by default.
urljoin
(url)
Constructs an absolute url by combining the Response’s url
with a possible relative url.
This is a wrapper over urlparse.urljoin, it’s merely an alias for making this call:
urlparse.urljoin(response.url, url)
follow
(url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None)
Return a Request
instance 来 follow 一个链接 url
. 与 Request.__init__
方法接受的 arguments 相同, but url
can be 一个相对的 URL or a scrapy.link.Link
object, not only an absolute URL.
TextResponse
provides a follow()
method which supports selectors in addition to absolute/relative URLs and Link objects.
Here is the list of available built-in Response subclasses. You can also subclass the Response class to implement your own functionality.