第六章 Scrapy框架(十三) 2020-03-15

十三、Scrapy框架–实战–zcool网站精选图高速下载(2


settings.py 设置代码


import os

 

BOT_NAME= 'imagedownload'

 

SPIDER_MODULES= ['imagedownload.spiders']

NEWSPIDER_MODULE= 'imagedownload.spiders'

 

DEFAULT_REQUEST_HEADERS= {

  'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

  'Accept-Language': 'en',

  'User-Agent': 'Mozilla/5.0 (Windows NT 6.1;WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132Safari/537.36'

}

 

ITEM_PIPELINES= {

   #'imagedownload.pipelines.ImagedownloadPipeline': 300,

    'scrapy.pipelines.images.ImagesPipeline': 1

}

 

IMAGES_STORE= os.path.join(os.path.dirname(os.path.dirname(__file__)), 'images')


Items.py 代码


import scrapy

 

 

class ImagedownloadItem(scrapy.Item):

    title = scrapy.Field()

    # image_urls:是用来保存这个item上的突破的链接的

    image_urls = scrapy.Field()

    # images:是后期图片下载完成后保存后形成image对象再保存到这个上面

    images = scrapy.Field()


start.py 代码


from scrapy import cmdline

 

cmdline.execute("scrapycrawl zcool".split(" "))


续上例,zcool.py 示例代码


import scrapy

from scrapy.spiders.crawl import CrawlSpider, Rule

from scrapy.linkextractors import LinkExtractor

from ..items import ImagedownloadItem

 

 

class ZcoolSpider(CrawlSpider):

    name = 'zcool'

    allowed_domains = ['zcool.com.cn']

    start_urls = ['http://zcool.com.cn/']

 

    rules = (

       # 翻页的url

       Rule(LinkExtractor(allow=".+0!0!0!0!0!!!!2!0!\d+"),follow=True),

       # 详情页面的url

       Rule(LinkExtractor(allow=".+/work/.+html"), follow=False,callback="parse_detail")

    )

 

    def parse_detail(self, response):

        image_urls =response.xpath("//div[@class='reveal-work-wraptext-center']//img/@src").getall()

        title_list =response.xpath("//div[@class='details-contitle-box']/h2/text()").getall()

        title ="".join(title_list).strip()

        item = ImagedownloadItem(title=title,image_urls=image_urls)

        yield item



上一篇文章 第六章 Scrapy框架(十二) 2020-03-14 地址:

https://www.jianshu.com/p/fc0b7b7fc5c8

下一篇文章 第六章 Scrapy框架(十四) 2020-03-16 地址:

https://www.jianshu.com/p/2febb184009d



以上资料内容来源网络,仅供学习交流,侵删请私信我,谢谢。

你可能感兴趣的:(第六章 Scrapy框架(十三) 2020-03-15)