在Scrapy中写爬虫时,有时想在spider运行到某个位置时暂停,查看被处理的response, 以确认response是否是期望的。
这可以通过 scrapy.shell.inspect_response 函数来实现。
以下是如何在spider中调用该函数的例子:
import scrapy
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = [
"http://example.com",
"http://example.org",
"http://example.net",
]
def parse(self, response):
# We want to inspect one specific response.
if ".org" in response.url:
from scrapy.shell import inspect_response
inspect_response(response, self)
# Rest of parsing code.
当运行spider时,您将得到类似下列的输出:
2017-10-23 17:48:31-0400 [myspider] DEBUG: Crawled (200) (referer: None)
2017-10-23 17:48:31-0400 [myspider] DEBUG: Crawled (200) (referer: None)
[s] Available Scrapy objects:
[s] crawler 0x1e16b50>
...
>>> response.url
'http://example.org'
接着测试提取代码:
>>> sel.xpath('//h1[@class="fn"]')
[]
呃,看来是没有。您可以在浏览器里查看response的结果,判断是否是您期望的结果:
>>> view(response)
True
最后您可以点击Ctrl-D(Windows下Ctrl-Z)来退出终端,恢复爬取:
>>> ^D
2014-01-23 17:50:03-0400 [myspider] DEBUG: Crawled (200) (referer: None)
...