Scrapy 图片爬虫构建思路为:
1. 分析网站
2. 选择爬取方式和策略
3. 创建爬虫项目--》定义items
4. 编写爬虫文件
5. 调试pipelines与settings
6. 调试
该项目的难点有:
1. 要爬取全站图片
2. 要爬取高清图片
3. 相应的反爬机制(不遵循robots协议,模拟成浏览器,不记录cookie等)
下面开始正题:
1. 创建一个爬虫项目,并定义一个spider:
cmd文件路径下使用命令:scrapy startproject nipic
路径进入刚建的nipic使用命令:scrapy genspider -t basic f1 nipic.com(定义了一个name为f1的spider)
items.py内为项目创建一个存储容器
class NipicItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
url = scrapy.Field()
打开昵图网首页,”设计“,”摄影“,”多媒体“三个栏目(图中蓝框)下的所有图片是要爬取的,作为从首页开始爬取的第一层链接;
#爬取三个栏目的链接,即第一层链接
def parse(self, response):
urldata = response.xpath("//div[@class='fl nav-item-wrap']//a/@href").extract()
urldata = urldata[1:4]
for i in urldata:
urlnew = response.urljoin(i)
yield Request(url=urlnew, callback=self.next)
每个栏目下有许多的子栏目(图中红框),是第二层链接:
#爬取第二层链接
def next(self, response):
url2 = response.xpath('//dd[@class="menu-item-list clearfix"]//a/@href').extract()
for j in url2:
url2new = response.urljoin(j)
yield Request(url=url2new, callback=self.next2)
点进每个子栏目,发现都有好多页,于是要获得所有页面的链接,即第三层链接:
#爬取第三层链接
def next2(self, response):
#获取总页面数
pages = response.xpath('//div[@class="common-page-box mt10 align-center"]//a/@href').extract()
pageslast = response.urljoin(pages[-1])
pagenumber = pageslast.split('=')
page1 = pagenumber[0]
page2 = pagenumber[1]
#构造出所有页面的网址
for m in range(1, int(page2) + 1):
pageurl = page1 + '=' + str(m)
yield Request(url=pageurl, callback=self.next3)
接下来爬取每个页面图片网址:
#爬取每个页面图片网址
def next3(self, response):
item = NipicItem()
item["url"] = response.xpath(
'//a[@class="relative block works-detail hover-none works-img-box"]//img/@src'
).extract()
yield item
3. 编写pipelines.py文件,处理爬取到的图片网址
由于是要爬取高清图片,而之前抓取的都是小图的网址,分析小图和高清大图的网址关系
可知有两处区别,由于目前还不会正则表达式,使用字符串的处理方式从小图网址构建大图网址
import urllib.request
class NipicPipeline(object):
def process_item(self, item, spider):
for i in range(len(item["url"])):
try:
that = item["url"][i]
#构建大图网址
urlstr = that.split(".", 1)
urlstr1 = 'http://pic115'+'.'+urlstr[1]
urlture = urlstr1.replace('1.jpg', '2.jpg')
#print("正在爬取---"+urlture)
file = "D:/爬取结果/nipic/" + urlture[-18:]
urllib.request.urlretrieve(urlture, filename=file)
except Exception as e:
pass
return item
使用urllib.request.urlretrieve下载图片到本地存储。
4. setting.py设置(基本反爬破解abc)
a. 禁止robots协议
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
b. 模拟成浏览器
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E) '
c. 禁止cookies
#Disable cookies (enabled by default)
COOKIES_ENABLED = False
启用pipelines
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'nipic.pipelines.NipicPipeline': 300,
}
# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy.http import Request
from nipic.items import NipicItem
class F1Spider(scrapy.Spider):
name = "f1"
allowed_domains = ["nipic.com"]
start_urls = ['http://www.nipic.com/']
#爬取三个栏目的链接,即第一层链接
def parse(self, response):
urldata = response.xpath("//div[@class='fl nav-item-wrap']//a/@href").extract()
urldata = urldata[1:4]
for i in urldata:
urlnew = response.urljoin(i)
yield Request(url=urlnew, callback=self.next)
#爬取第二层链接
def next(self, response):
url2 = response.xpath('//dd[@class="menu-item-list clearfix"]//a/@href').extract()
for j in url2:
url2new = response.urljoin(j)
yield Request(url=url2new, callback=self.next2)
#爬取第三层链接
def next2(self, response):
#获取总页面数
pages = response.xpath('//div[@class="common-page-box mt10 align-center"]//a/@href').extract()
pageslast = response.urljoin(pages[-1])
pagenumber = pageslast.split('=')
page1 = pagenumber[0]
page2 = pagenumber[1]
#构造出所有页面的网址
for m in range(1, int(page2) + 1):
pageurl = page1 + '=' + str(m)
yield Request(url=pageurl, callback=self.next3)
#爬取每个页面图片网址
def next3(self, response):
item = NipicItem()
item["url"] = response.xpath(
'//a[@class="relative block works-detail hover-none works-img-box"]//img/@src'
).extract()
yield item
爬取结果:
许多不足之处还请各位大神见谅。
欢迎一起交流哈。。。