Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
Scrapy是Python开发的一个快速,高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛,可以用于数据挖掘、信息处理和历史档案。
Even though Scrapy was originally designed for screen scraping (more precisely, web scraping), it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.
尽管Scrapy原本是设计用来屏幕抓取(更精确的说,是网络抓取)的目的,但它也可以用来访问API来提取数据,比如Amazon的AWS或者用来当作通常目的应用的网络蜘蛛
The purpose of this document is to introduce you to the concepts behind Scrapy so you can get an idea of how it works and decide if Scrapy is what you need.
本文档的目的是介绍一下Scrapy背后的概念,这样你会了解它是如何工作的,以决定它是不是你需要的
When you’re ready to start a project, you can start with the tutorial.
当你准备启动一个项目时,可以从这个教程开始
So you need to extract some information from a website, but the website doesn’t provide any API or mechanism to access that info programmatically. Scrapy can help you extract that information.
如果你需要从某个网站提取一些信息,但是网站不提供API或者其他可编程的访问机制,那么Scrapy可以帮助你提取信息
Let’s say we want to extract the URL, name, description and size of all torrent files added today in the Mininova site.
Let’s say we want to extract the URL, name, description and size of all torrent files added today in the Mininova site.
让我们看下Mininova网站今天增加的torrent文件,我们需要提取网址,名称,描述和文件大小
The list of all torrents added today can be found on this page:
下面这个列表是所有今天新增的torrents文件的页面
http://www.mininova.org/today
The first thing is to define the data we want to scrape. In Scrapy, this is done through Scrapy Items (Torrent files, in this case).第一件事情就是定义你要抓取的数据,在Scrapy这个是通过定义Scrapy Items来实现的(本例是BT文件)
This would be our Item:这就是要定义的Item
from scrapy.item import Item, Field class Torrent(Item): url = Field() name = Field() description = Field() size = Field()
The next thing is to write a Spider which defines the start URL (http://www.mininova.org/today), the rules for following links and the rules for extracting the data from pages.下一步是写一个指定起始网址的蜘蛛,这个蜘蛛的规则包含follow链接规则和数据提取规则
If we take a look at that page content we’ll see that all torrent URLs are like http://www.mininova.org/tor/NUMBER where NUMBER is an integer. We’ll use that to construct the regular expression for the links to follow: /tor/\d+. 如果你看一眼页面内容,就会发现所有的torrent网址都是类似http://www.mininova.org/tor/NUMBER,其中Number是一个整数,我们将用正则表达式,例如 /tor/\d+. 来提取规则
We’ll use XPath for selecting the data to extract from the web page HTML source. Let’s take one of those torrent pages: 我们将使用Xpath,从页面的HTML Source里面选取要要抽取的数据,我们 选中一个页面
http://www.mininova.org/tor/13204203
And look at the page HTML source to construct the XPath to select the data we want which is: torrent name, description and size.根据页面HTML 源码,建立XPath,选取我们所要的:torrent name, description和size
By looking at the page HTML source we can see that the file name is contained inside a <h1> tag: 通过页面HTML源代码可以看到name属性包含在H1 标签内
<h1>Home[2009][Eng]XviD-ovd</h1>
An XPath expression to extract the name could be: 使用 XPath expression提取的表达式:
//h1/text()
<h2>Description:</h2> <div id="description"> "HOME" - a documentary film by Yann Arthus-Bertrand <br/> <br/> *** <br/> <br/> "We are living in exceptional times. Scientists tell us that we have 10 years to change the way we live, avert the depletion of natural resources and the catastrophic evolution of the Earth's climate. ...
//div[@id='description']
<div id="specifications"> <p> <strong>Category:</strong> <a href="/cat/4">Movies</a> > <a href="/sub/35">Documentary</a> </p> <p> <strong>Total size:</strong> 699.79 megabyte</p>
//div[@id='specifications']/p[2]/text()[2]
Finally, here’s the spider code: 最后,蜘蛛代码如下:
class MininovaSpider(CrawlSpider): name = 'mininova.org' allowed_domains = ['mininova.org'] start_urls = ['http://www.mininova.org/today'] rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')] def parse_torrent(self, response): x = HtmlXPathSelector(response) torrent = TorrentItem() torrent['url'] = response.url torrent['name'] = x.select("//h1/text()").extract() torrent['description'] = x.select("//div[@id='description']").extract() torrent['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract() return torrent
Finally, we’ll run the spider to crawl the site an output file scraped_data.json with the scraped data in JSON format: 最后,我们运行蜘蛛来爬取这个网站,输出为json格式 scraped_data.json
scrapy crawl mininova.org -o scraped_data.json -t json
You can also write an item pipeline to store the items in a database very easily.
你也可以写一段item pipeline,把数据直接写入数据库,很简单
If you check the scraped_data.json file after the process finishes, you’ll see the scraped items there:
要运行结束以后,查看一下数据:scraped_data.json,内容大致如下
[{"url": "http://www.mininova.org/tor/2657665", "name": ["Home[2009][Eng]XviD-ovd"], "description": ["HOME - a documentary film by ..."], "size": ["699.69 megabyte"]}, # ... other items ... ]
You’ll notice that all field values (except for the url which was assigned directly) are actually lists. This is because the selectors return lists. You may want to store single values, or perform some additional parsing/cleansing to the values. That’s what Item Loaders are for.
关注一下数据,你会发现,所有字段都是lists(除了url是直接赋值),这是因为selectors返回的就是lists格式,如果你想存储单独数据或者在数据上增加一些解释或者清洗,可以使用Item Loaders
You’ve seen how to extract and store items from a website using Scrapy, but this is just the surface. Scrapy provides a lot of powerful features for making scraping easy and efficient, such as:
你也看到了如何使用Scrapy从一个网站提取和存储数据,但这只是表象,实际上,Scrapy提供了许多强大的特性,让它更容易和高效的抓取:
The next obvious steps are for you to download Scrapy, read the tutorial and join the community. Thanks for your interest!很明显啦,下一步就是下载Scrapy,然后阅读教程,加入社区,感谢你对Scrapy感兴趣~!
T:\mininova\mininova\items.py 源码
# Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/topics/items.html from scrapy.item import Item, Field class MininovaItem(Item): # define the fields for your item here like: # name = Field() url = Field() name = Field() description = Field() size = Field()
T:\mininova\mininova\spiders\spider_mininova.py 源码
from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from mininova.items import MininovaItem class MininovaSpider(CrawlSpider): name = 'mininova.org' allowed_domains = ['mininova.org'] start_urls = ['http://www.mininova.org/today'] #start_urls = ['http://www.mininova.org/yesterday'] rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_item')] # def parse_item(self, response): # filename = response.url.split("/")[-1] + ".html" # open(filename, 'wb').write(response.body) def parse_item(self, response): x = HtmlXPathSelector(response) item = MininovaItem() item['url'] = response.url #item['name'] = x.select('''//*[@id="content"]/h1''').extract() item['name'] = x.select("//h1/text()").extract() #item['description'] = x.select("//div[@id='description']").extract() item['description'] = x.select('''//*[@id="specifications"]/p[7]/text()''').extract() #download #item['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract() item['size'] = x.select('''//*[@id="specifications"]/p[3]/text()''').extract() return item