今天学习爬虫网页项目时遇到xpath解析问题,纠结了十几分钟也没成功解决。让我不安的是这个知识点不难,而且之前已经重复学习了多次,如此的记忆效果使我不得不重新审视笔记的作用。很显然一些博客记录学习笔记和反刍学习内容正变得迫在眉睫,简直到了不可不做的地步了。
安装过程费劲,csdn教程很多,逐步下载相应文件一步步来,需要耐心。遇到pip无法下载的第一选择失去换镜像源,然后再考虑.whl文件安装
记于软面笔记本上结合实体书略看略记,实践第一
from scrapy import cmdline
cmdline.execute(['scrapy', 'crawl', 'biquge'])
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
}
ITEM_PIPELINES = {
'xiaoshuo.pipelines.XiaoshuoPipeline': 300,
}
class BiqugeSpider(scrapy.Spider):
name = ‘biquge’
allowed_domains = [‘paoshuzw.com’]
start_urls = [‘http://www.paoshuzw.com/10/10489/’]
def parse(self, response):
#获取章节名
name_list = response.xpath("//dd//text()").getall()
for name in name_list:
print(name)
item = XiaoshuoItem(name=name)
yield item
#获取章节链接
href_list = response.xpath("//dd//@href").getall()
for href in href_list:
print(href)
item = XiaoshuoItem(href=href,name=name)
yield item`
import scrapy
class XiaoshuoItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field()
href = scrapy.Field()
import json
class XiaoshuoPipeline:
def open_spider(self, spider):
self.fp = open("小说.txt", "w", encoding='utf-8')
def process_item(self, item, spider):
self.fp.write(json.dumps(dict(item), ensure_ascii=False) + "\n") # 转换中文
print(item)
return item
def close_spider(self, spider):
self.fp.close()
next_href = response.xpath("//a[@id='amore']/@href").get()
print(next_href)
#现在仅仅有url一半
if next_href:
#判断是否有,否则会陷入死循环
next_url = response.urljoin(next_href)#自动加域名
request = scrapy.Request(next_url)#创建request对象
yield request#如果yield的是item就扔给pipeline如果yield的是request就发送给调度器让它再一次发送请求
上述仅是scrapy初步运用,可用来爬取网站文字信息并存储至指定文件,爬取速度极快