1。安装SCRAPY
2。进入CMD:执行:SCRAPY显示:
Scrapy 1.8.0 - no active project
Usage:
scrapy
Available commands:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
[ more ] More commands available when run from project directory
Use "scrapy
出现上面的内容则表示SCRAPY安装成功
3。建立放置爬虫的文件夹d:\crapy
4.进入d:\crapy
d:\crapy>
5.建立爬虫项目:scrapy startproject cnblog
New Scrapy project 'cnblog', using template directory 'd:\python\python37\lib\site-packages\scrapy\templates\project', created in:
D:\crapy\cnblog
You can start your first spider with:
cd cnblog
scrapy genspider example example.com
上面的提示表示建立了一个名称叫cnblog的爬虫项目,指明了项目应用的模板及位置:即当前位置下建立了一个与项目同名的文件夹;要想开始爬虫必须进入新建立的文件夹(cnblog)来建立爬虫
6。建立第一个爬虫
D:\crapy>cd cnblog
D:\crapy\cnblog>scrapy genspider cnblog cnblogs.com #指定爬虫名称为cnblogs时出错,提示不能与当前项目同名
Cannot create a spider with the same name as your project
D:\crapy\cnblog>scrapy genspider cnbloga cnblogs.com
Created spider 'cnbloga' using template 'basic' in module
cnblog.spiders.cnbloga
#建立了第一个爬虫名称为“cnbloga",爬取的DOMAIN为“cnblogs.com",只爬取域名内的信息,这是爬取范围限定;并且指定的应用模板为“basic"
7。打开相应的爬虫文件:d:\crapy\cnblog\cnblog\spider\cnbloga.py
# -*- coding: utf-8 -*- import scrapy class CnblogaSpider(scrapy.Spider): name = 'cnbloga' allowed_domains = ['cnblogs.com'] start_urls = ['http://cnblogs.com/'] def parse(self, response): pass
第一行引用爬虫;声明一个类:Cnblogaspider,继承于scrapy.Spider;爬虫的名称“ cnbloga";爬取的范围'cnblogs.com';开始爬取的网址为http://cnblogs.com/;
默认方法parse,即每得到相应的网址,就交给这个方法来处理;
8。运行爬虫:
d:\crapy\cnblog>scrapy crawl cnbloga#‘cnbloga'为相应的爬虫的名称