知识点1.创建项目
scrapy startproject testproject
# testproject是项目的名称可以自己命名
输出结果为:
C:\Users\qs418>scrapy startproject testproject
New Scrapy project 'testproject', using template directory 'd:\\python_exe\\lib\\site-packages\\scrapy\\templates\\project', created in:
C:\Users\qs418\testproject
You can start your first spider with:
cd testproject
scrapy genspider example example.com
知识点2.进入项目中·:
cd testproject
知识点3.生成spider
scrapy genspider baidu www.baidu.com
# 生成一个baidu的spider
输出结果为;
Created spider 'baidu' using template 'basic' in module:
testproject.spiders.baidu
知识点4.了解各类模板
scrapy genspider -l
输出结果为
Available templates:
basic
crawl
csvfeed
xmlfeed
知识点5,指定模板
scrapy genspider -t crawl zhihu www.zhihu.com
输出结果:
C:\Users\qs418>scrapy genspider -t crawl zhihu www.zhihu.com
Created spider 'zhihu' using template 'crawl'
6.学习笔记
crawl :运行spider的方法,可以指定运行的spider的名称
例如:
scrapy crawl zhihu.py
check:用来检查代码是否有错误
scrapy check zhihu.py
scrapy list:返回项目中所有的名称
scrapy edit :在命令行下编辑
fetch:返回网页源代码,等同于response
scrapy fetch http://www.baidu.com
去掉日志:得到headers
scrapy fetch --nolog --headers http://www.baidu.com
输出结果:
C:\Users\qs418>scrapy fetch --nolog --headers http://www.baidu.com
> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> Accept-Language: en
> User-Agent: Scrapy/1.5.1 (+https://scrapy.org)
> Accept-Encoding: gzip,deflate
>
< Cache-Control: private, no-cache, no-store, proxy-revalidate, no-transform
< Content-Type: text/html
< Date: Thu, 02 Aug 2018 04:36:31 GMT
< Last-Modified: Mon, 23 Jan 2017 13:27:32 GMT
< Pragma: no-cache
< Server: bfe/1.0.8.18
< Set-Cookie: BDORZ=27315; max-age=86400; domain=.baidu.com; path=/
禁止重定向:–no redicrect
scrapy fetch --no-direct http://www.baidu.com
view:将网页以文件的形式保存下来,然后去打开,可以在自动测试中应用
scrapy view http://www.baidu.com
shell:命令行模式的交互,并且返回一些可用的变量
scrapy shell http://www.baidu.com
parse: 传入一些参数,查看返回的结果,相当于格式化输出
seetings:获取当前的配置信息
scrapy settings -h
# -h 获取帮助信息
输出:
C:\Users\qs418\quotetutorial>scrapy settings -h
Usage
=====
scrapy settings [options]
Get settings values
Options
=======
--help, -h show this help message and exit
--get=SETTING print raw setting value
--getbool=SETTING print setting value, interpreted as a boolean
--getint=SETTING print setting value, interpreted as an integer
--getfloat=SETTING print setting value, interpreted as a float
--getlist=SETTING print setting value, interpreted as a list
Global Options
--------------
--logfile=FILE log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
log level (default: DEBUG)
--nolog disable logging completely
--profile=FILE write python cProfile stats to FILE
--pidfile=FILE write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
set/override setting (may be repeated)
--pdb enable pdb on failure
C:\Users\qs418\quotetutorial>
runspider:运行spider
scrapy runspider baidu.py
version:输出scrapy的版本
scrapy version -v
bench:测试当前爬虫的速度