文档地址:https://doc.scrapy.org/en/latest/intro/tutorial.html
1.创建工程:
C:\Users\Gunner>scrapy startproject tutorial |
自动在当前目录下创建工程文件。
2.定义蜘蛛spider:
在tutorial/spiders目录下,添加quotes_spider.py文件。
import scrapy
class QuotesSpider(scrapy.Spider): name = "quotes"
def start_requests(self): urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] for url in urls: yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response): page = response.url.split("/")[-2] filename = 'quotes-%s.html' % page with open(filename, 'wb') as f: f.write(response.body) self.log('Saved file %s' % filename) |
其中:
l name用于定义spider的名称;
l start_requests()方法定义开始爬取得网址并回调解析函数parse;
l Parse函数用于处理爬取网页的响应;
3.运行spider:
scrapy crawl quotes |
4.提取数据:
提取数据最好的学习方式是使用scrapy shell,例如:
scrapy shell “http://quotes.toscrape.com/page/1/”
2017-05-30 16:46:33 [scrapy.core.engine] DEBUG: Crawled (200) [s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler [s] item {} [s] request [s] response <200 http://quotes.toscrape.com/page/1/> [s] settings [s] spider [s] Useful shortcuts: [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) [s] fetch(req) Fetch a scrapy.Request and update local objects [s] shelp() Shell help (print this help) [s] view(response) View response in a browser >>> |
可以使用response的css方法获取不同元素:
>>> response.css('title') [ >>> response.css('title').extract() [u' >>> response.css('title').extract_first() u' >>> response.css('title')[0].extract() u' >>> |
此外,re()方法可以使用正则表达式进行提取。
>>> response.css('title::text').re(r'Quotes.*') [u'Quotes to Scrape'] >>> response.css('title::text').re(r'Q\w+') [u'Quotes'] >>> |
浏览器查看response的方法:
>>> view(response) True |
5.使用xpath。
除了支持css,scrapy还支持使用xpath进行数据的定位提取。实际上css也是在内部被转成xpath进行使用的。
>>> response.xpath('//title') [ |
6.使用yeild进行抽取数据,如quotes1.py:
import scrapy
class QuotesSpider(scrapy.Spider): name = "quotes1" start_urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ]
def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.css('small.author::text').extract_first(), 'tags': quote.css('div.tags a.tag::text').extract(), } |
运行scrapy crawl quotes1时会打印提取信息:text、author、tags。
7.保存数据:
scrapy crawl quotes1 -o quotes.json |
或者保存为流式的json lines格式:
scrapy crawl quotes1 -o quotes.jl |
8.获取下一个链接:
response.css('li.next a::attr(href)').extract_first() |
attr用于获取标签的属性。
9.获取下一页链接,quotes2.py
import scrapy
class QuotesSpider(scrapy.Spider): name = "quotes2" start_urls = [ 'http://quotes.toscrape.com/page/1/', ]
def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.css('small.author::text').extract_first(), 'tags': quote.css('div.tags a.tag::text').extract(), }
next_page = response.css('li.next a::attr(href)').extract_first() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse) |
还可以使用resopnse.follow(next_page,callback=self.parse)