十三、Scrapy框架–实战–zcool网站精选图高速下载(2)
settings.py 设置代码
import os
BOT_NAME= 'imagedownload'
SPIDER_MODULES= ['imagedownload.spiders']
NEWSPIDER_MODULE= 'imagedownload.spiders'
DEFAULT_REQUEST_HEADERS= {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1;WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132Safari/537.36'
}
ITEM_PIPELINES= {
#'imagedownload.pipelines.ImagedownloadPipeline': 300,
'scrapy.pipelines.images.ImagesPipeline': 1
}
IMAGES_STORE= os.path.join(os.path.dirname(os.path.dirname(__file__)), 'images')
Items.py 代码
import scrapy
class ImagedownloadItem(scrapy.Item):
title = scrapy.Field()
# image_urls:是用来保存这个item上的突破的链接的
image_urls = scrapy.Field()
# images:是后期图片下载完成后保存后形成image对象再保存到这个上面
images = scrapy.Field()
start.py 代码
from scrapy import cmdline
cmdline.execute("scrapycrawl zcool".split(" "))
续上例,zcool.py 示例代码
import scrapy
from scrapy.spiders.crawl import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ..items import ImagedownloadItem
class ZcoolSpider(CrawlSpider):
name = 'zcool'
allowed_domains = ['zcool.com.cn']
start_urls = ['http://zcool.com.cn/']
rules = (
# 翻页的url
Rule(LinkExtractor(allow=".+0!0!0!0!0!!!!2!0!\d+"),follow=True),
# 详情页面的url
Rule(LinkExtractor(allow=".+/work/.+html"), follow=False,callback="parse_detail")
)
def parse_detail(self, response):
image_urls =response.xpath("//div[@class='reveal-work-wraptext-center']//img/@src").getall()
title_list =response.xpath("//div[@class='details-contitle-box']/h2/text()").getall()
title ="".join(title_list).strip()
item = ImagedownloadItem(title=title,image_urls=image_urls)
yield item
上一篇文章 第六章 Scrapy框架(十二) 2020-03-14 地址:
https://www.jianshu.com/p/fc0b7b7fc5c8
下一篇文章 第六章 Scrapy框架(十四) 2020-03-16 地址:
https://www.jianshu.com/p/2febb184009d
以上资料内容来源网络,仅供学习交流,侵删请私信我,谢谢。