本文记录了使用Scrapy的一些实践经验(common practices)。 这包含了很多使用不会包含在其他特定章节的的内容。
除了常用的 scrapy crawl
来启动Scrapy,您也可以使用 API 在脚本中启动Scrapy。
需要注意的是,Scrapy是在Twisted异步网络库上构建的, 因此其必须在Twisted reactor里运行。看下面的例子:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
CrawlerProcess
的更多用法可以去查看官方文档
If you are inside a Scrapy project there are some additional helpers you can use to import those components within the project. You can automatically import your spiders passing their name to CrawlerProcess
, and use get_project_settings
to get a Settings
instance with your project settings.
下面给出了如何实现的例子,使用 testspiders 项目作为例子。
from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings process = CrawlerProcess(get_project_settings()) # 'followall' is the name of one of the spiders of the project. process.crawl('testspider', domain='scrapinghub.com') process.start() # the script will block here until the crawling is finished
There’s another Scrapy utility that provides more control over the crawling process: scrapy.crawler.CrawlerRunner
. This class is a thin wrapper that encapsulates some simple helpers to run multiple crawlers, but it won’t start or interfere with existing reactors in any way.
Using this class the reactor should be explicitly run after scheduling your spiders. It’s recommended you use CrawlerRunner
instead of CrawlerProcess
if your application is already using Twisted and you want to run Scrapy in the same reactor.
Note that you will also have to shutdown the Twisted reactor yourself after the spider is finished. This can be achieved by adding callbacks to the deferred returned by the CrawlerRunner.crawl
method.
Here’s an example of its usage, along with a callback to manually stop the reactor after MySpiderhas finished running.
from twisted.internet import reactor import scrapy from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging class MySpider(scrapy.Spider): # Your spider definition ... configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'}) runner = CrawlerRunner() d = runner.crawl(MySpider) d.addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until the crawling is finished
参见
Twisted Reactor Overview.
默认情况下,当您执行 scrapy crawl
时,Scrapy每个进程运行一个spider。 当然,Scrapy通过 内部(internal)API 也支持单进程多个spider。
Here is an example that runs multiple spiders simultaneously:
import scrapy from scrapy.crawler import CrawlerProcess class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2(scrapy.Spider): # Your second spider definition ... process = CrawlerProcess() process.crawl(MySpider1) process.crawl(MySpider2) process.start() # the script will block here until all crawling jobs are finished
Same example using CrawlerRunner
:
import scrapy from twisted.internet import reactor from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2(scrapy.Spider): # Your second spider definition ... configure_logging() runner = CrawlerRunner() runner.crawl(MySpider1) runner.crawl(MySpider2) d = runner.join() d.addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until all crawling jobs are finished
Same example but running the spiders sequentially by chaining the deferreds:
from twisted.internet import reactor, defer from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2(scrapy.Spider): # Your second spider definition ... configure_logging() runner = CrawlerRunner() @defer.inlineCallbacks def crawl(): yield runner.crawl(MySpider1) yield runner.crawl(MySpider2) reactor.stop() crawl() reactor.run() # the script will block here until the last crawl call is finished
参见
在脚本中运行Scrapy.
Scrapy并没有提供内置的机制支持分布式(多服务器)爬取。不过还是有办法进行分布式爬取, 取决于您要怎么分布了。
如果您有很多spider,那分布负载最简单的办法就是启动多个Scrapyd,并分配到不同机器上。
如果想要在多个机器上运行一个单独的spider,那您可以将要爬取的url进行分块,并发送给spider。 例如:
首先,准备要爬取的url列表,并分配到不同的文件url里:
http://somedomain.com/urls-to-crawl/spider1/part1.list http://somedomain.com/urls-to-crawl/spider1/part2.list http://somedomain.com/urls-to-crawl/spider1/part3.list
接着在3个不同的Scrapd服务器中启动spider。spider会接收一个(spider)参数 part
, 该参数表示要爬取的分块:
curl http://scrapy1.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=1 curl http://scrapy2.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=2 curl http://scrapy3.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=3
有些网站实现了特定的机制,以一定规则来避免被爬虫爬取。 与这些规则打交道并不容易,需要技巧,有时候也需要些特别的基础。 如果有疑问请考虑联系 商业支持 。
下面是些处理这些站点的建议(tips):
COOKIES_ENABLED
),有些站点会使用cookies来发现爬虫的轨迹。DOWNLOAD_DELAY
设置。关于反爬虫的一些技巧可以参考另外两篇文章:
常见爬虫/BOT对抗技术介绍(一)
常见爬虫/BOT 对抗技术简介(二)