增量爬虫
1、增量爬虫(crawlspider)
1)创建增量式爬虫:scrapy genspider -t crawl xxx xxx.xx
2)增量式爬虫介绍:
在scrapy中有许多的爬虫模板(例如:crawl,Feed等模板),这些模板可以对basic爬虫进行功能的扩充),这些模板经过扩充以后可以更好的实现一些复杂功能,crawlspider是最常用的一种爬虫模板
3)增量式爬虫的运行机制:
basic模板运行机制:从start_urls中提取起始url,把这些url放入调度队列进行调度。
增量式模板运行机制:以start_urls中url为起点,从这些url的响应网页中根据一定的规则匹配出一批url,把匹配出的这批url放入到调度队列中;新产生网页中也会根据前面的规则来匹配新的url并且这些url如果没有和之前重复将其将入到调度队列中去,这样循环往复直至再也匹配不到新的url为止。
#dushu
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
# 导入链接提取器类,从一个url的网页上根据一定的规则来提取新的链接
from scrapy.spiders import CrawlSpider, Rule
# CrawlSpider是spiders一个派生类,在基本爬虫的基础上扩展功能
# Rule规则对象,根据规则安排url的提取、组合与调度
from DushuPro.items import DushuproItem
class DushuSpider(CrawlSpider):
name = 'dushu'
allowed_domains = ['dushu.com']
start_urls = ['https://www.dushu.com/book/1002.html']
rules = (
Rule(LinkExtractor(allow=r'/book/1002_\d\.html'), callback='parse_item', follow=True),
# Rule(LinkExtractor(restrict_xpaths="//div[@class='pages']//a"), callback='parse_item', follow=True),
# Rule(LinkExtractor(restrict_css=".pages a"), callback='parse_item', follow=True),
)
# rules属性:是一个元组,包含了若干个Rule对象
# 每个Rule对象有三个参数,可以根据参数LinkExtractor对象里面的规则来匹配、提取并且组合、调度符合规则url;callback是回调函数(这个回调函数写法是写函数名的字符串)当对应的url请求完毕以后回调
# LinkExtractor对象:链接提取器,用于根据一定的规则来提取链接,具体有如下三种:
# 规则一:allow="xxx" 根据xxx这个正则表达式从网页源码上匹配新的链接
# 规则二:restrict_xpaths="xxx" 根据xxx这个xpath路径来匹配链接
# 规则三:restrict_css="xxx" 根据xxx这个css选择器来匹配
# 【注意】在使用xpath或者css的时候,只需要匹配目标链接的那个a标签即可,不需要再xpath路径或者css选择器中写具体href属性
def parse_item(self, response):
booklist = response.xpath("//div[@class='bookslist']//li")
for book in booklist:
item = DushuproItem()
item["title"] = book.xpath(".//h3/a/text()").extract_first()
# extract_first()从selector列表将内容取出,然后从内容列表中取出首元素,如果列表为空,直接去None
item["author"] = "".join(book.xpath(".//div[@class='book-info']/p[1]//text()").extract())
# print(item)
# 匹配出二级页面的链接
next_url = "https://www.dushu.com" + book.xpath(".//h3/a/@href").extract_first()
# 向二级页面发起请求
yield scrapy.Request(url=next_url,callback=self.parse_Info,meta={"item":item})
# 回调函数,用于解析下级页面
def parse_Info(self, response):
# 把上级页面送的item提取出来
item = response.meta["item"]
# 继续解析item
item["price"] = response.xpath("//span[@class='num']/text()").extract_first()
item["publisher"] = response.xpath("//div[@class='book-details-left']/table//tr[2]//a/text()").extract_first()
item["authorInfo"] = response.xpath("//div[@class='text txtsummary']//text()").extract()[1]
item["content"] = response.xpath("//div[@class='text txtsummary']//text()").extract()[0]
item["mulu"] = "\n".join(response.xpath("//div[@class='text txtsummary']")[2].xpath(".//text()").extract())
yield item
import scrapy
class DushuproItem(scrapy.Item):
title = scrapy.Field()
author = scrapy.Field()
price = scrapy.Field()
publisher = scrapy.Field()
authorInfo = scrapy.Field()
content = scrapy.Field()
mulu = scrapy.Field()
1、下载中间件
下载中间件:主要工作在下载器向服务器发起请求的过程中,可以截获下载器的请求对请求作出相应的扩展与配置
下载中间件有两种:一种是系统自带的中间,位置在scrapy的核心引擎中,路径为:scrapy.dowloadermiddleweres.xxxx.xxxx;另外一种是自定义下载中间,它的位置就在我们的当前工程的middleweres中
下载中间件的开启:settings文件中
DOWNLOADER_MIDDLEWARES = {
‘DushuPro.middlewares.DushuproDownloaderMiddleware’: 543,
}
2、下载中间件的应用
1)植入selenium动态页面加载
在settings文件中将用于植入selenium的下载中间件激活同时为了节省系统开销,可把浏览器能够代替的那些组件的功能的中间件关闭掉
然后在selenium中间件类中重写响应方法来截获request,并且用selenium来代替其工作,最后把selenium中取出的解析以后的网页源码封装到响应数据对象中返回出去
spiders
import scrapy
class MoguSpider(scrapy.Spider):
name = 'mogu'
allowed_domains = ['mogu.com']
start_urls = ['https://list.mogu.com/book/clothing/50240?acm=3.mce.1_10_1ko4s.132244.0.mtYuRrx6QL5ne.pos_1-m_482170-sd_119&ptp=31.nXjSr._head.0.UvbiJ3IU']
def parse(self, response):
goods_list = response.css(".iwf")
print(len(goods_list))
# 练习:解析内容
pass
middlewares
from scrapy import signals
from selenium import webdriver
from time import sleep
from scrapy.http import HtmlResponse
class MogujieDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
print("当前创建出了一个爬虫对象!")
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
print("我是process_request方法,爬虫:%s的请求:%s正在经过当前下载中间件..."%(spider.name,request))
# 由于系统的下载器不能解析js动态页面,我们在这里截获系统的请求对象,重新的定义其请求的过程,用selenium来操作
driver = webdriver.Chrome()
# 从截获的request中提取url
url = request.url
print("当前浏览器正在访问:",url)
driver.get(url)
sleep(1)
# 下拉加载
distance = 0
for i in range(100):
distance = i*500
js = "document.documentElement.scrollTop=%d"%distance
driver.execute_script(js)
sleep(0.5)
sleep(3)
# 提取网页源码
html = driver.page_source
# 把网页源码封装到一个响应对象中返回出去
res = HtmlResponse(url=driver.current_url,request=request,body=html,encoding='utf-8')
return res
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
print("我是process_response方法,爬虫%s的请求%s的响应对象%s正在被返回..."%(spider.name,request,response))
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
print("有异常出现")
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
print("爬虫被打开!")
spider.logger.info('Spider opened: %s' % spider.name)
settings
DOWNLOADER_MIDDLEWARES = {
'Mogujie.middlewares.MogujieDownloaderMiddleware': 543,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware':None,
}
2)植入代理池
代理服务器:在客户端与服务器之间发生请求与响应的过程中,如果设置了代理服务器,则我们的请求会直接发到代理服务器端,由代理服务器代替客户端向服务器发起请求,服务器的响应数据也会响应给代理服务器,然后由代理服务器在把响应数据传递回客户端
代理服务器的获取:1)自己搭建(不推荐) 2)抓取免费的代理服务器(不靠谱) 3)付钱买