Scrapy是一个由Python语言开发的适用爬取网站数据、提取结构性数据的Web应用程序框架。主要用于数据挖掘、信息处理、数据存储和自动化测试等。通过Scrapy框架实现一个爬虫,只需要少量的代码,就能够快速的网络抓取
Scrapy框架5大组件(架构):
其他组件:
官方文档:https://docs.scrapy.org
入门文档:https://doc.scrapy.org/en/latest/intro/tutorial.html
1)新建一个爬虫项目ScrapyDemo
2)在Terminal终端安装所需模块
Scrapy基于Twisted,Twisted是一个异步网络框架,主要用于提高爬虫的下载速度
pip install scrapy
pip install twisted
如果报错:
ERROR: Failed building wheel for twisted
error: Microsoft Visual C++ 14.0 or greater is required
则需要下载对应的whl文件安装:
Python扩展包whl文件下载:https://www.lfd.uci.edu/~gohlke/pythonlibs/#
ctrl+f
查找需要的whl文件,点击下载对应版本
安装:
pip install whl文件绝对路径
例如:
pip install F:\PyWhl\Twisted-20.3.0-cp38-cp38m-win_amd64.whl
3)在Terminal终端创建爬虫项目ScrapyDemo
scrapy startproject ScrapyDemo
生成项目目录结构
4)在spiders文件夹下创建核心爬虫文件SpiderDemo.py
最终项目结构及说明:
ScrapyDemo/ 爬虫项目
├── ScrapyDemo/ 爬虫项目目录
│ ├── spiders/ 爬虫文件
│ │ ├── __init__.py
│ │ └── SpiderDemo.py 自定义核心功能文件
│ ├── __init__.py
│ ├── items.py 爬虫目标数据
│ ├── middlewares.py 中间件、代理
│ ├── pipelines.py 管道,用于处理爬取的数据
│ └── settings.py 爬虫配置文件
└── scrapy.cfg 项目配置文件
1)明确目标
明确爬虫的目标网站
明确需要爬取实体(属性):items.py
定义:属性名 = scrapy.Field()
2)制作爬虫
自定义爬虫核心功能文件:spiders/SpiderDemo.py
3)存储数据
设计管道存储爬取内容:settings.py、pipelines.py
4)运行爬虫
方式1:在Terminal终端执行(cmd执行需要切到项目根目录下)
scrapy crawl dangdang(爬虫名)
cmd切换操作:
切盘:F:
切换目录:cd A/B/...
方式2:在PyCharm执行文件
在爬虫项目目录下创建运行文件run.py
,右键运行
1)爬取当当网手机信息:https://category.dangdang.com/cid4004279.html
2)明确需要爬取实体属性:items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
# 1)明确目标
# 1.2)明确需要爬取实体属性
class ScrapyDemoItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# 名称
name = scrapy.Field()
# 价格
price = scrapy.Field()
SpiderDemo.py
# 入门案例
# 1)明确目标
# 1.1)爬取当当网手机信息:https://category.dangdang.com/cid4004279.html
# 2)制作爬虫
import scrapy
from scrapy.http import Response
from ..items import ScrapyDemoItem
class SpiderDemo(scrapy.Spider):
# 爬虫名称,运行爬虫时使用的值
name = "dangdang"
# 爬虫域,允许访问的域名
allowed_domains = ['category.dangdang.com']
# 爬虫地址:起始URL:第一次访问是域名
start_urls = ['https://category.dangdang.com/cid4004279.html']
# 翻页分析
# 第1页:https://category.dangdang.com/cid4004279.html
# 第2页:https://category.dangdang.com/pg2-cid4004279.html
# 第3页:https://category.dangdang.com/pg3-cid4004279.html
# ......
page = 1
# 请求响应处理
def parse(self, response: Response):
li_list = response.xpath('//ul[@id="component_47"]/li')
for li in li_list:
# 商品名称
name = li.xpath('.//img/@alt').extract_first()
print(name)
# 商品价格
price = li.xpath('.//p[@class="price"]/span[1]/text()').extract_first()
print(price)
# 获取一个实体对象就交给管道pipelines
demo = ScrapyDemoItem(name=name, price=price)
# 封装item数据后,调用yield将控制权给管道,管道拿到item后返回该程序
yield demo
# 每一页爬取逻辑相同,只需要将执行下一页的请求再次调用parse()方法即可
if self.page <= 10:
self.page += 1
url = rf"https://category.dangdang.com/pg{str(self.page)}-cid4004279.html"
# scrapy.Request为scrapy的请求
# yield中断
yield scrapy.Request(url=url, callback=self.parse)
补充:Response对象的属性和方法
'''
1)获取响应的字符串
response.text
2)获取响应的二进制数据
response.body
3)解析响应内容
response.xpath()
'''
settings.py
# Scrapy settings for ScrapyDemo project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# 3)存储数据
# 3.1)爬虫配置、打开通道和添加通道
# 爬虫项目名
BOT_NAME = "ScrapyDemo"
SPIDER_MODULES = ["ScrapyDemo.spiders"]
NEWSPIDER_MODULE = "ScrapyDemo.spiders"
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "ScrapyDemo (+http://www.yourdomain.com)"
# User-Agent配置
USER_AGENT = 'Mozilla/5.0'
# Obey robots.txt rules
# 是否遵循机器人协议(默认True),为了避免一些爬取限制需要改为False
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 最大并发数
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 下载延迟(单位:s),用于控制爬取的频率
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
# 是否保存Cookies(默认False)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
# "Accept-Language": "en",
#}
# 请求头
DEFAULT_REQUEST_HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en",
}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# "ScrapyDemo.middlewares.ScrapydemoSpiderMiddleware": 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# "ScrapyDemo.middlewares.ScrapydemoDownloaderMiddleware": 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# "scrapy.extensions.telnet.TelnetConsole": None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# "ScrapyDemo.pipelines.ScrapydemoPipeline": 300,
#}
# 项目管道
ITEM_PIPELINES = {
# 管道可以有多个,后面的数字是优先级(范围:1-1000),值越小优先级越高
# 爬取网页
'scrapy_dangdang.pipelines.ScrapyDemoPipeline': 300,
# 保存数据
'scrapy_dangdang.pipelines.ScrapyDemoSinkPiepline': 301,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"
# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
# 设置日志输出等级(默认DEBUG)与日志存放的路径
LOG_LEVEL = 'INFO'
# LOG_FILE = "spider.log"
pipelines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
# 3)存储数据
# 3.2)使用管道存储数据
# 若使用管道,则必须在settings.py中开启管道
import os
import csv
# 爬取网页
class ScrapyDemoPipeline:
# 数据item交给管道输出
def process_item(self, item, spider):
print(item)
return item
# 保存数据
class ScrapyDemoSinkPiepline:
# item为yield后面的ScrapyDemoItem对象,字典类型
def process_item(self, item, spider):
with open(r'C:\Users\cc\Desktop\scrapy_test.csv', 'a', newline='', encoding='utf-8') as csvfile:
# 定义表头
fields = ['name', 'price']
writer = csv.DictWriter(csvfile, fieldnames=fields)
writer.writeheader()
# 写入数据
writer.writerow(item)
run.py
# 4)运行爬虫
from scrapy import cmdline
cmdline.execute('scrapy crawl dangdang'.split())
其他文件不动,本案例运行会报错:
ERROR: Twisted-20.3.0-cp38-cp38m-win_amd64.whl is not a supported wheel on this platform
builtins.ModuleNotFoundError: No module named 'scrapy_dangdang'
原因大概是Twisted版本兼容问题,暂未解决,后续补充