Request传递值到callback回调函数
def parse(self, response): request = scrapy.Request('http://www.example.com/index.html', callback=self.parse_page2, cb_kwargs=dict(main_url=response.url)) request.cb_kwargs['foo'] = 'bar' # add more arguments for the callback yield request def parse_page2(self, response, main_url, foo): yield dict( main_url=main_url, other_url=response.url, foo=foo, )
Request传递cookie并且指定在一定范围的域名内存储接下来获得的cookie.
request_with_cookies = Request(url="http://www.example.com", cookies=[{'name': 'currency', 'value': 'USD', 'domain': 'example.com', 'path': '/currency'}])
自定义domain\path属性,可以实现——自动存储在这些网页中获得的cookie并自动加上,以后不用手动添加。如果网站要cookie才能进,表示只需要在开头发送一次就可以了。
把Request的dont_filter设置为True,可以防止它被反重复机制过滤,可以用在当想要对同一页面多此请求。Default to False
.
使用Request的errback来捕获异常。
import scrapy from scrapy.spidermiddlewares.httperror import HttpError from twisted.internet.error import DNSLookupError from twisted.internet.error import TimeoutError, TCPTimedOutError class ErrbackSpider(scrapy.Spider): name = "errback_example" start_urls = [ "http://www.httpbin.org/", # HTTP 200 expected "http://www.httpbin.org/status/404", # Not found error "http://www.httpbin.org/status/500", # server issue "http://www.httpbin.org:12345/", # non-responding host, timeout expected "http://www.httphttpbinbin.org/", # DNS error expected ] def start_requests(self): for u in self.start_urls: yield scrapy.Request(u, callback=self.parse_httpbin, errback=self.errback_httpbin, dont_filter=True) def parse_httpbin(self, response): self.logger.info('Got successful response from {}'.format(response.url)) # do something useful here... def errback_httpbin(self, failure): # log all failures self.logger.error(repr(failure)) # in case you want to do something special for some errors, # you may need the failure's type: if failure.check(HttpError): # these exceptions come from HttpError spider middleware # you can get the non-200 response response = failure.value.response self.logger.error('HttpError on %s', response.url) elif failure.check(DNSLookupError): # this is the original request request = failure.request self.logger.error('DNSLookupError on %s', request.url) elif failure.check(TimeoutError, TCPTimedOutError): request = failure.request self.logger.error('TimeoutError on %s', request.url)
使用Request的replace方法构建一个新的Request,除了重新定义的值,其他的照搬原来的值。
replace([url, method, headers, body, cookies, meta, flags, encoding, priority, dont_filter, callback, errback, cb_kwargs])
返回具有相同成员的Request对象,但那些通过指定关键字参数赋予新值的成员除外。默认情况下,Request.cb_kwargs和Request.meta属性是浅复制的(除非给定新值作为参数)。
scrapy.FormRequest和scrapy.FormRequest.from_response
https://blog.csdn.net/qq_33472765/article/details/80958820
具体可查看文档。from_response和FromRequest不同,后者是单纯的post,前者是与填充html页面的form字段,可以自己指定表单的位置(如果有多个),在模拟点击。如果是那些javascript的表单,也可以指定不模拟点击。
return [FormRequest(url="http://www.example.com/post/action", formdata={'name': 'John Doe', 'age': '27'}, callback=self.after_post)]
网站通常会通过元素(例如,与会话相关的数据或身份验证令牌(用于登录页面))提供预先填充的表单字段。抓取时,您会希望自动填充这些字段,并且仅覆盖其中的几个字段,例如用户名和密码。您可以将FormRequest.from_response()方法用于此作业。
import scrapy def authentication_failed(response): # TODO: Check the contents of the response and return True if it failed # or False if it succeeded. pass class LoginSpider(scrapy.Spider): name = 'example.com' start_urls = ['http://www.example.com/users/login.php'] def parse(self, response): return scrapy.FormRequest.from_response( response, formdata={'username': 'john', 'password': 'secret'}, callback=self.after_login ) def after_login(self, response): if authentication_failed(response): self.logger.error("Login failed") return # continue scraping with authenticated session...