1、Scrapy安装
在windows平台anaconda
环境下,在命令窗口输入conda install scrapy
,输入确认的y
后,静静等待安装完成即可。安装完成后,在窗口输入scrapy version
,能显示版本号说明能够正常使用。
2、Scrapy指令
输入scrapy -h
可以看到指令,关于命令行,后面会再总结。
Scrapy 1.3.3 - project: quotetutorial
Usage:
scrapy [options] [args]
Available commands:
bench Run quick benchmark test
check Check spider contracts
commands
crawl Run a spider
edit Edit spider
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
list List available spiders
parse Parse URL (using its spider) and print the results
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
Use "scrapy -h" to see more info about a command
3、新建项目
爬取的为用于测试scrapy的网站:http://quotes.toscrape.com/
爬取目标:获取名言---作者---标签。
1、命令窗口下,用cd
指令移动到想用来存放项目的文件夹
2、命令窗口下,scrapy startproject 你的文件夹名
,这里命名为scrapy startproject quotetutorial
。
可以看到两个提示, cd quotetutorial
,scrapy genspider example example.com
,(即cd 你的工作文件夹
,scrapy genspider 你的爬虫名 爬取的目标地址
),根据提示继续操作。
C:\Users\m1812>scrapy startproject quotetutorial
New Scrapy project 'quotetutorial', using template directory 'C:\\Users\\m1812\\Anaconda3\\lib\\site-packages\\scrapy\\templates\\project', created in:
C:\Users\m1812\quotetutorial
You can start your first spider with:
cd quotetutorial
scrapy genspider example example.com
3、cd quotetutorial
移动到创建好的文件夹中
C:\Users\m1812>cd quotetutorial
4、scrapy genspider quotes quotes.toscrape.com
,生成一个名为quotes.py
的文件,地址为quotes quotes.toscrape.com
C:\Users\m1812\quotetutorial>scrapy genspider quotes quotes.toscrape.com
Created spider 'quotes' using template 'basic' in module:
quotetutorial.spiders.quotes
4、Scrapy初窥
1、修改quotes.py
中的parse
函数,让其打印出网页的html代码,这个网页直接输出print(response.text)
会有编码报错。parse
函数会在爬虫运行的最后开始执行,这里的response就是网页请求返回的结果。
2、在命令窗口中使用
scrapy crawl quotes
运行爬虫,看到scrapy除了打印出网页html代码外,还有很多信息输出。
C:\Users\m1812\quotetutorial>scrapy crawl quotes
2019-04-05 19:50:11 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: quotetutorial)
2019-04-05 19:50:11 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'quotetutorial.spiders', 'SPIDER_MODULES': ['quotetutorial.spi
ders'], 'BOT_NAME': 'quotetutorial', 'ROBOTSTXT_OBEY': True}
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.logstats.LogStats']
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-05 19:50:11 [scrapy.core.engine] INFO: Spider opened
2019-04-05 19:50:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-05 19:50:11 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-05 19:50:12 [scrapy.core.engine] DEBUG: Crawled (404) (referer: None)
2019-04-05 19:50:12 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None)
b'\n\n\n\t\n\tQuotes to Scrape \n \n \n\n\n \n \n \n \n Quotes to Scrape\n
\n \n \n \n \n Login\n \n
\n \n \n \n\n\n \n\
n \n \xa1\xb0The world as we have
created it is a process of our thinking. It cannot be changed without changing our thinking.\xa1\xb1\n by Albert Einstein\n (about)\n \n \n \n\n \n \xa1\xb0It is our choices, Harry, that show what we truly are, far more than our abilities.\xa1\xb1\n
by J.K. Rowling\n (about)\n \n \n \n\n \n \xa1\xb0There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\xa1
\xb1\n by Albert Einstein\n (about)\
n \n \n \n\n \n \xa1\xb0The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\xa1\xb1\n by Jane Austen\n (about)\n \n \n \n\n \n \xa1\xb0Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring
.\xa1\xb1\n by Marilyn Monroe\n (about)
a>\n \n \n \n\n \n \xa1\xb0Try not to become a man of success. Rather become a man of value.\xa1\xb
1\n by Albert Einstein\n (about)\n
\n \n \
n\n \n \xa1\xb0It is better to be
hated for what you are than to be loved for what you are not.\xa1\xb1\n by Andr\xa8\xa6
Gide\n (about)\n \n \n \n\n \n \xa1\xb0I have not failed. I've just found 10,000 ways that won&
#39;t work.\xa1\xb1\n by Thomas A. Edison\n (about)\n \n \n \n\n \n \xa1\xb0A woman is like a tea bag; you
never know how strong it is until it's in hot water.\xa1\xb1\n by Eleanor Roosevelt
small>\n (about)\n \n \n \n\n \n \xa1\xb0A day without sunshine is like, you know, night.\xa1\xb1\n by Steve Martin\n (about)\n \n \n \n\n \n \n \n\n\n \n
\n\n'
2019-04-05 19:50:12 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-05 19:50:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 444,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2701,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 4, 5, 11, 50, 12, 560342),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2019, 4, 5, 11, 50, 11, 713697)}
2019-04-05 19:50:12 [scrapy.core.engine] INFO: Spider closed (finished)
5、开始爬取
首先浏览下目标信息的html结构:
1、修改items.py
中的内容,将欲提取的3个信息按照指定的格式填入:
2、修改
quotes.py
中的内容,添加爬取的规则,并且和步骤一中items.py
的配置相映射。
# -*- coding: utf-8 -*-
import scrapy
from quotetutorial.items import QuotetutorialItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
quotes = response.css('.quote')
for quote in quotes:
item = QuotetutorialItem()
text = quote.css('.text::text').extract_first()
author = quote.css('.author::text').extract_first()
tags = quote.css('.tags .tag::text').extract()
item['text'] = text
item['author'] = author
item['tags'] = tags
yield item
这边用到了自带的css选择器。
在命令窗口中,利用shell
指令可以进行交互性测试,scrapy shell "quotes.toscrape.com"
(注意这里的双引号),从这里我们可以理解上面的css选出了什么,extract_first()
和extract()
有什么区别。
C:\Users\m1812\quotetutorial>scrapy shell "quotes.toscrape.com"
2019-04-05 20:08:39 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: quotetutorial)
2019-04-05 20:08:39 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'quotetutorial', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilte
r', 'SPIDER_MODULES': ['quotetutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'quotetutorial.spiders'}
2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-05 20:08:39 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-05 20:08:39 [scrapy.core.engine] INFO: Spider opened
2019-04-05 20:08:40 [scrapy.core.engine] DEBUG: Crawled (404) (referer: None)
2019-04-05 20:08:40 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None)
2019-04-05 20:08:46 [traitlets] DEBUG: Using default logger
2019-04-05 20:08:46 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler
[s] item {}
[s] request
[s] response <200 http://quotes.toscrape.com>
[s] settings
[s] spider
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]:
In [1]: response
Out[1]: <200 http://quotes.toscrape.com>
In [2]: quotes = response.css('.quote')
In [3]: quotes
Out[3]:
[,
,
,
,
,
,
,
,
,
]
In [4]: quotes[0]
Out[4]:
In [5]: quotes[0].css('.text')
Out[5]: []
In [6]: quotes[0].css('.text::text')
Out[6]: []
In [7]: quotes[0].css('.text::text').extract()
Out[7]: ['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”']
In [8]: quotes[0].css('.text').extract()
Out[8]: ['“The world as we have created it is a process of our thinking. It cannot be changed without changing ou
r thinking.”']
In [9]: quotes[0].css('.text::text').extract_first()
Out[9]: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
In [10]: quotes[0].css('.tags .tag::text').extract()
Out[10]: ['change', 'deep-thoughts', 'thinking', 'world']
In [11]: exit()
此时我们再运行下爬虫scrapy crawl quotes
在终端可以看到很多信息。
3、单页爬取完成,接下来要进行翻页。网页的url变化如下,可以通过next按钮的href属性获得下一页网址。
修改
quotes.py
中的代码:
# -*- coding: utf-8 -*-
import scrapy
from quotetutorial.items import QuotetutorialItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
quotes = response.css('.quote')
for quote in quotes:
item = QuotetutorialItem() #字典类型
text = quote.css('.text::text').extract_first()
author = quote.css('.author::text').extract_first()
tags = quote.css('.tags .tag::text').extract()
item['text'] = text
item['author'] = author
item['tags'] = tags
yield item
next = response.css('.pager .next a::attr(href)').extract_first()
url = response.urljoin(next) #生成完整url
yield scrapy.Request(url=url, callback=self.parse) #递归调用
4、保存数据
命令行scrapy crawl quotes -o quotes.json
,保存为json格式
或者存成jl格式,
scrapy crawl quotes -o quotes.jl
或者存成CSV格式,
scrapy crawl quotes -o quotes.csv
或者存成xml格式,
scrapy crawl quotes -o quotes.xml
或者存成pickle格式,
scrapy crawl quotes -o quotes.pickle
或者存成marshal格式,
scrapy crawl quotes -o quotes.marshal
5、处理一些不想要的item或者保存到数据库
这时候需要修改pipelines.py
中的代码
这里限定了字符最大为50个字符,超过的部分在后面添加
...
同时定义了mongodb的保存函数
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from scrapy.exceptions import DropItem
# 这里限定了最大字符个数为50,超过用省略号代替
class QuotetutorialPipeline(object):
def __init__(self):
self.limit = 50
# 只能返回两种值,item和DropItem
def process_item(self, item, spider):
if item['text']:
if len(item['text']) > self.limit:
item['text'] = item['text'][0:self.limit].rstrip() + '...'
return item
else:
return DropItem('Missing Text')
class MongoPipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DB')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def process_item(self, item, spider):
name = item.__class__.__name__
self.db[name].insert(dict(item))
return item
def close_spider(self, spider):
# 关闭mongodb
self.client.close()
还要修改setting.py
中的相关设置
重新运行命令行scrapy crawl quotes
mongodb中也可以看到保存的数据了。
参考自崔庆才博主的Scrapy教学