scrapy 单脚本执行 爬去jandan无聊图的gif文件

直接创建工程麻烦,直接用一个脚本文件,python spider.py那样更简单。

其实很简单,spider类派生还是跟生成的一样,只需要添加from scrapy.crawler import CrawlerProcess

最后生成process对象,执行start即可


import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings


class gifitem(scrapy.Item):
file_urls = scrapy.Field()


class Meizitu(scrapy.Spider):
name = "pic"
allowed_domains = ["jandan.net"]
start_urls = (
'http://jandan.net/pic/',
)


def is_gif_url(self, url):
ext = url[url.rfind("."):]
if "gif" in ext:
return True
else:
return False


def parse(self, response):
print "visit url---------->",response.url
a = response.selector.xpath('//div[@class="cp-pagenavi"]')

if len(a) > 0 and 0:#debug
b = a[0]
urls =  b.xpath('.//a/@href').extract()
for url in urls:
yield scrapy.Request(url, self.parse)


gifs = response.selector.xpath('//a/@href').extract()
urls = []
for gif in gifs:
if 'gif' in gif:
if(self.is_gif_url(gif)):
urls.append(gif)
if len(gif) > 0:
gif_item = gifitem()
gif_item['file_urls'] = urls
yield gif_item
pass



setting = get_project_settings()
setting.set("ITEM_PIPELINES", {
'scrapy.pipelines.files.FilesPipeline':1
})
setting.set("FILES_STORE", "J:\\pic_download")
process = CrawlerProcess(setting)
process.crawl(Meizitu)
process.start()
这里需要注意的是,setting对象,可以完成一些setting的设置。比如设置Pipeline,一些属性等等,具体参考scrapy的文档。jandan有简单的放爬功能,改写user_agent功能即可,但是如果想爬去所有的gif,估计需要中间件的方式,继续学习。

你可能感兴趣的:(scrapy 单脚本执行 爬去jandan无聊图的gif文件)