操作系统:Windows 10 专业版
虚拟环境:Anaconda
python 版本:3.7
XPath 工具:xpath-helper
开发工具:PyCharm 2020.1
参考
scrapy 官网:https://scrapy.org/
scrapy 教程:https://docs.scrapy.org/en/latest/intro/tutorial.html
scrapy 架构:https://docs.scrapy.org/en/latest/topics/architecture.html
scrapy 调参:https://docs.scrapy.org/en/latest/topics/settings.html
XPath:https://www.w3school.com.cn/xpath/index.asp
Anaconda 教程:https://blog.csdn.net/u011424614/article/details/105579502
PyMySQL:https://pypi.org/project/PyMySQL/
组件简介:
数据流:
(1) (2) engine 组件将 spiders 组件的请求转发到 scheduler 组件进行排队
(3) (4) 排队完成后, engine 组件将请求转发给 downloader 组件进行页面下载
(5) (6) enginx 组件将 downlader 组件下载的页面转发给 spiders 组件进行解析
(7) (8) enginx 组件将 spiders 组件解析的数据转发到 item pipeline 组件中进行处理及存储
(1) (2) spiders 组件将解析的数据分为两部分,一部分 engine 组件转发到 item pipeline 组件进行处理及存储,另一部分 engine 组件转发到 scheduler 组件进行排队
(3) (4) 如果网页下载失败,engine 组件会重新将请求转发给 scheduler 组件进行排队
场景说明:爬取豆瓣电影的 TOP 250 数据
Anaconda 安装和操作,请查看 前言 的 参考 链接
# 创建 scrapy 环境
> conda create -n scrapy_env python=3.7 scrapy
#-- 激活 scrapy 环境
> activate scrapy_env
#-- 创建模板项目;scrapy startproject [项目名称]
> scrapy startproject scrapy_douban
# Scrapy settings for scrapy_douban project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'scrapy_douban'
SPIDER_MODULES = ['scrapy_douban.spiders']
NEWSPIDER_MODULE = 'scrapy_douban.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0.5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'scrapy_douban.middlewares.ScrapyDoubanSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
# 'scrapy_douban.middlewares.ScrapyDoubanDownloaderMiddleware': 543,
# 'scrapy_douban.middlewares.proxy_ip': 544, # 代理IP 544 优先等级
'scrapy_douban.middlewares.random_user_agent': 545, # 随机客户端信息
}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'scrapy_douban.pipelines.ScrapyDoubanPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
mysql_host = '127.0.0.1'
mysql_port = 3306
mysql_dbname = 'python-db'
mysql_username = 'root'
mysql_pwd = 'root2020'
参数 | 说明 |
---|---|
USER_AGENT | 客户端信息 |
ROBOTSTXT_OBEY | robots 协议 |
CONCURRENT_REQUESTS | 请求并发量 |
DOWNLOAD_DELAY | 下载延迟 |
CONCURRENT_REQUESTS_PER_DOMAIN | 域名并发量 |
CONCURRENT_REQUESTS_PER_IP | IP并发量 |
COOKIES_ENABLED | 是否使用 cookies;用于登录操作 |
DEFAULT_REQUEST_HEADERS | 默认请求头 |
SPIDER_MIDDLEWARES | spider 中间件 |
DOWNLOADER_MIDDLEWARES | downloader 中间件 |
EXTENSIONS | 扩展中间件 |
ITEM_PIPELINES | item pipelines 组件 |
#-- scrapy genspider [文件名] [入口域名]
> scrapy genspider douban_spider movie.douban.com
from scrapy import cmdline
# 执行 cmd 命令,用于程序启动
cmdline.execute('scrapy crawl douban_spider'.split())
# 输出 csv 文件 ( 使用 notepad++ 修改编码为 UTF-8 BOM )
# cmdline.execute('scrapy crawl douban_spider -o douban.csv'.split())
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class ScrapyDoubanItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
serial_number = scrapy.Field() # 排名
movie_name = scrapy.Field() # 电影名
introduce = scrapy.Field() # 简介
star = scrapy.Field() # 星级
evaluate = scrapy.Field() # 评论数
slogan = scrapy.Field() # 标语
import scrapy
from scrapy_douban.items import ScrapyDoubanItem
class DoubanSpiderSpider(scrapy.Spider):
name = 'douban_spider' # 爬出名称;不要跟项目名一样
allowed_domains = ['movie.douban.com'] # 只抓取当前域名下的链接
start_urls = ['https://movie.douban.com/top250'] # 入口链接
# 解析下载组件返回的响应数据
def parse(self, response):
# print(response.text)
# 获取 li 标签列表
list = response.xpath("//div[@class='article']//ol[@class='grid_view']//li")
# 循环解析 li 标签的数据
for item in list:
# print(item)
douban_item = ScrapyDoubanItem()
# 通过 XPath 获取数据项
douban_item["serial_number"] = item.xpath(".//div[@class='item']//em//text()").extract_first()
douban_item["movie_name"] = item.xpath(".//div[@class='info']//div[@class='hd']//a//span[1]//text()").extract_first()
introduces = item.xpath(".//div[@class='info']//div[@class='bd']//p[1]//text()").extract() # 处理列表数据
for introduce_item in introduces:
i_content = "".join(introduce_item.split())
douban_item["introduce"] = i_content
douban_item["star"] = item.xpath(".//span[@class='rating_num']//text()").extract_first()
douban_item["evaluate"] = item.xpath(".//div[@class='star']//span[4]//text()").extract_first()
douban_item["slogan"] = item.xpath(".//p[@class='quote']//span//text()").extract_first()
# print(douban_item)
# 将数据提交到 pipelines.py(需要配置 settings.py 的 item pipelines 组件)
yield douban_item
# 获取"后页"的链接
next_link = response.xpath("//span[@class='next']//link//@href").extract()
# 判断是否为最后一页
if next_link:
next_link = next_link[0]
# 将请求提交到 schedulers 组件,响应回调到 parse 方法
yield scrapy.Request(self.start_urls[0] + next_link, callback=self.parse)
> conda install pymysql
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import pymysql
from scrapy_douban.settings import mysql_dbname,mysql_host,mysql_port,mysql_pwd,mysql_username
class ScrapyDoubanPipeline:
# 处理每一项数据
def process_item(self, item, spider):
# 插入数据
self.insert(item)
return item
# 插入数据
def insert(self, item):
# 封装 sql 参数值
values = (int(item["serial_number"]), item["movie_name"], item["introduce"]
, float(item["star"]), item["evaluate"], item["slogan"])
# 连接数据库
conn = pymysql.connect(host=mysql_host, user=mysql_username, password=mysql_pwd, port=mysql_port,
db=mysql_dbname)
# 获取游标
cursor = conn.cursor()
# 插入数据语句
sql = 'INSERT INTO douban(serial_number,movie_name,introduce, star, evaluate, slogan) VALUES (%s, %s, %s, %s, %s, %s)'
try:
cursor.execute(sql, values)
conn.commit()
print("数据插入成功:"+str(values))
except Exception as ex:
print("出现如下异常:%s"%ex)
conn.rollback()
print("数据回滚:"+str(values))
# 关闭数据库连接
finally:
conn.close()
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
import base64
import random
from scrapy import signals
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
class ScrapyDoubanSpiderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, or item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Request or item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
class ScrapyDoubanDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
# 代理IP
class proxy_ip(object):
def process_request(self, request, spider):
request.meta['proxy'] = 'aaaaaaaaaa:1234'
proxy_name_pwd = b'pppppppppp:xxxxxxxxx'
encode_name_pwd = base64.b64encode(proxy_name_pwd)
request.headers['Proxy-Authorization'] = 'Basic '+ encode_name_pwd.decode()
# 随机客户端信息
class random_user_agent(object):
def process_request(self, request, spider):
USER_AGENT_LIST = [
'MSIE (MSIE 6.0; X11; Linux; i686) Opera 7.23',
'Opera/9.20 (Macintosh; Intel Mac OS X; U; en)',
'Opera/9.0 (Macintosh; PPC Mac OS X; U; en)',
'iTunes/9.0.3 (Macintosh; U; Intel Mac OS X 10_6_2; en-ca)',
'Mozilla/4.76 [en_jp] (X11; U; SunOS 5.8 sun4u)',
'iTunes/4.2 (Macintosh; U; PPC Mac OS X 10.2)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:5.0) Gecko/20100101 Firefox/5.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:9.0) Gecko/20100101 Firefox/9.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:16.0) Gecko/20120813 Firefox/16.0',
'Mozilla/4.77 [en] (X11; I; IRIX;64 6.5 IP30)',
'Mozilla/4.8 [en] (X11; U; SunOS; 5.7 sun4u)'
]
user_agent = random.choice(USER_AGENT_LIST)
request.headers['User_Agent'] = user_agent
第一种:
点击 chrome 浏览器的工具栏(书签栏上面)的插件图标
在 QUERY 输入框中输入 XPath 内容
第二种: