Request
request是请求对象,参数列表为Request[url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback])]
url:必填,请求的目标网址
callback:回调函数,默认一个参数response,其他请求相关信息可通过meta传递
method:请求方式,默认为Get
**headers: **请求头信息,字典格式
body:请求正文,bytes或str类型
cookies:请求cookie信息,可以是cookie列表或cookie字典
meta:包含此请求的任意元数据的字典。此dict对于新请求为空,通常由不同的Scrapy组件(扩展程序,中间件等)填充,内置关键词为如下
dont_redirect:如果 Request.meta 包含 dont_redirect 键,则该request将会被RedirectMiddleware忽略
dont_retry:如果 Request.meta 包含 dont_retry 键, 该request将会被RetryMiddleware忽略
handle_httpstatus_list:Request.meta 中的 handle_httpstatus_list 键可以用来指定每个request所允许的response code
handle_httpstatus_all:handle_httpstatus_all为True ,可以允许请求的任何响应代码
dont_merge_cookies:Request.meta 中的dont_merge_cookies设为TRUE,可以避免与现有cookie合并
cookiejar:Scrapy通过使用 Request.meta中的cookiejar 来支持单spider追踪多cookie session。 默认情况下其使用一个cookie jar(session),不过可以传递一个标示符来使用多个
dont_cache:可以避免使用dont_cache元键等于True缓存每个策略的响应
redirect_urls:通过该中间件的(被重定向的)request的url可以通过 Request.meta 的 redirect_urls 键找到
bindaddress:用于执行请求的传出IP地址的IP
dont_obey_robotstxt:如果Request.meta将dont_obey_robotstxt键设置为True,则即使启用ROBOTSTXT_OBEY,RobotsTxtMiddleware也会忽略该请求
download_timeout:下载器在超时之前等待的时间(以秒为单位)
download_maxsize:爬取URL的最大长度
download_latency:自请求已经开始,即通过网络发送的HTTP消息,用于获取响应的时间量
该元密钥仅在下载响应时才可用。虽然大多数其他元键用于控制Scrapy行为,但是这个应用程序应该是只读的
download_fail_on_dataloss:是否在故障响应失败
proxy:可以将代理每个请求设置为像http:// some_proxy_server:port这样的值
ftp_user :用于FTP连接的用户名
ftp_password :用于FTP连接的密码
referrer_policy:为每个请求设置referrer_policy
max_retry_times:用于每个请求的重试次数。初始化时,max_retry_times元键比RETRY_TIMES设置更高优先级
其中常用的proxy用于设置代理,其余关键字在使用时查询文档即可。
encoding: 此请求的编码(默认为'utf-8')
priority:此请求的优先级(默认为0),较高优先级值的请求将较早执行
dont_filter:默认为False,表示参与过滤,反之则不对该URL过滤
errback: 指定处理请求中的异常回调函数,包括失败的404 HTTP错误等页面。
Request子类FormRequest
FormRequest用于表单操作,原文文档如下:
FormRequest objects
The FormRequest class extends the baseRequest
with functionality for dealing with HTML forms. It uses lxml.html forms to pre-populate form fields with form data fromResponse
objects.
classscrapy.http.FormRequest
(url[, formdata, ...])
The FormRequest
class adds a new argument to the constructor. The remaining arguments are the same as for the Request
class and are not documented here.
Parameters:formdata (dict or iterable of tuples) – is a dictionary (or iterable of (key, value) tuples) containing HTML Form data which will be url-encoded and assigned to the body of the request.The FormRequest
objects support the following class method in addition to the standard Request
methods:
response (Response
object) – the response containing a HTML form which will be used to pre-populate the form fields
formname (string
) – if given, the form with name attribute set to this value will be used.
formid (string
) – if given, the form with id attribute set to this value will be used.
formxpath (string
) – if given, the first form that matches the xpath will be used.
formcss (string
) – if given, the first form that matches the css selector will be used.
formnumber (integer
) – the number of form to use, when the response contains multiple forms. The first one (and also the default) is 0
.
formdata (dict
) – fields to override in the form data. If a field was already present in the response
element, its value is overridden by the one passed in this parameter.
clickdata (dict
) – attributes to lookup the control clicked. If it’s not given, the form data will be submitted simulating a click on the first clickable element. In addition to html attributes, the control can be identified by its zero-based index relative to other submittable inputs inside the form, via the nr
attribute.
dont_click (boolean
) – If True, the form data will be submitted without clicking in any element.
classmethodfrom_response
(response[, formname=None, formid=None, formnumber=0, formdata=None, formxpath=None, formcss=None, clickdata=None, dont_click=False, ...])
Returns a new FormRequest
object with its form field values pre-populated with those found in the HTML element contained in the given response. For an example see Using FormRequest.from_response() to simulate a user login.
The policy is to automatically simulate a click, by default, on any form control that looks clickable, like a . Even though this is quite convenient, and often the desired behaviour, sometimes it can cause problems which could be hard to debug. For example, when working with forms that are filled and/or submitted using javascript, the default
from_response()
behaviour may not be the most appropriate. To disable this behaviour you can set the dont_click
argument to True
. Also, if you want to change the control clicked (instead of disabling it) you can also use the clickdata
argument.
Parameters:
The other parameters of this class method are passed directly to the FormRequest
constructor.
*New in version 0.10.3: *The formname
parameter.
*New in version 0.17: *The formxpath
parameter.
*New in version 1.1.0: *The formcss
parameter.
*New in version 1.1.0: *The formid
parameter.
Request usage examples
Using FormRequest to send data via HTTP POST
If you want to simulate a HTML Form POST in your spider and send a couple of key-value fields, you can return a FormRequest
object (from your spider) like this:
return [FormRequest(url="http://www.example.com/post/action",
formdata={'name': 'John Doe', 'age': '27'},
callback=self.after_post)]
Using FormRequest.from_response() to simulate a user login
It is usual for web sites to provide pre-populated form fields through elements, such as session related data or authentication tokens (for login pages). When scraping, you’ll want these fields to be automatically pre-populated and only override a couple of them, such as the user name and password. You can use the
FormRequest.from_response()
method for this job. Here’s an example spider which uses it:
import scrapy
class LoginSpider(scrapy.Spider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login
)
def after_login(self, response):
if "authentication failed" in response.body:
self.logger.error("Login failed")
return
不规则翻译如下:
请求子类
这里是内置子类的Request列表。您还可以将其子类化以实现您自己的自定义功能。
FormRequest对象
FormRequest类扩展了Request具有处理HTML表单的功能的基础。它使用lxml.html表单 从Response对象的表单数据预填充表单字段。class scrapy.http.FormRequest(url[, formdata, ...])
本FormRequest
类增加了新的构造函数的参数。其余的参数与Request
类相同,这里没有记录。
FormRequest
对象支持除标准以下类方法Request的方法:classmethod from_response(response[, formname=None, formid=None, formnumber=0, formdata=None, formxpath=None, formcss=None, clickdata=None, dont_click=False, ...])
返回一个新FormRequest
对象,其中的表单字段值已预先填充在给定响应中包含的HTML 元素中。有关示例,请参阅 使用FormRequest.from_response()来模拟用户登录。
该策略是在任何可查看的表单控件上默认自动模拟点击,如a 。即使这是相当方便,并且经常想要的行为,有时它可能导致难以调试的问题。例如,当使用使用javascript填充和/或提交的表单时,默认行为可能不是最合适的。要禁用此行为,您可以将参数设置 为。此外,如果要更改单击的控件(而不是禁用它),您还可以使用 参数。 from_response() dont_click True clickdata
参数:
请求使用示例
使用FormRequest通过HTTP POST发送数据
如果你想在你的爬虫中模拟HTML表单POST并发送几个键值字段,你可以返回一个FormRequest对象(从你的爬虫)像这样:
使用FormRequest.from_response()来模拟用户登录
网站通常通过元素(例如会话相关数据或认证令牌(用于登录页面))提供预填充的表单字段。进行剪贴时,您需要自动预填充这些字段,并且只覆盖其中的一些,例如用户名和密码。您可以使用 此作业的方法。这里有一个使用它的爬虫示例: FormRequest.from_response()
import scrapy
class LoginSpider(scrapy.Spider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login
)
def after_login(self, response):
if "authentication failed" in response.body:
self.logger.error("Login failed")
return
感兴趣的看看原文文档(https://doc.scrapy.org),后面会说具体的案例。
Response
response是响应对象,具体子类有TextResponse、HtmlResponse、XmlResponse,子类继承于Response基类,它是根据响应中的Content-Type类型生成的,使用最多的是HtmlResponse。
HtmlResponse子类具有下列属性:
scrapy中Request、Response是最为核心的两个对象,它是连接爬虫、引擎、下载器、调度器关键的媒介,是不得不深入理解的两个对象。