先上网址:http://www.plantarium.ru/page/samples/taxon/41302.html 一个意大利的植物网站。
刚刚搭建完了Scrapy框架,于是打算拿这个网站来练练手。
1.首先scrapy startproject plant 建立一个名为plant的项目
2.然后明确自己需要提取的内容:图片的网址,图片的科,图片所示植物的名字
所以修改items.py文件:
from scrapy.item import Item,Field
class PlantItem(Item):
picture_name=Field()
picture_url=Field()
picture_family=Field()
3.核心内容当然是蜘蛛本身,在spiders文件夹下新建文件plant_spider.py:
# -*- coding:utf-8 -*-
from scrapy.spiders import Spider
from plant.items import PlantItem
from scrapy.selector import Selector
from scrapy.http import Request
import re
import os
import sys
import time
import requests
reload(sys)
sys.setdefaultencoding('utf-8')
class PlantSpider(Spider):
name = "plant"
start_urls = [
"http://dryades.units.it/euganei/index.php?procedure=list"
]
def parse(self, response):
sel=Selector(response)
sites=sel.xpath('//tr').extract()
sites.pop()
sites.pop()
for site in sites:
image_url='http://dryades.units.it/euganei/'+site.split('href="')[1].split('">')[0].replace('amp;','')
open('url_content.txt', 'a').write(image_url+'\n')
yield Request(image_url,callback=self.parse_item)
def parse_item(self,response):
path='D:\\Italy_Plant\\'
self.create_folder(path)
sel=Selector(response)
images=sel.xpath('//a[contains(@href,".jpg")]/@href').extract()
for image in images:
item=PlantItem()
item['picture_url']=image
item['picture_name']=sel.xpath('//title/text()').extract()[0].split('-')[0].replace('.','').strip()
item['picture_family']=sel.xpath('//td/b[contains(text(),"ACEAE")]/text()').extract()[0].split(' ')[0].capitalize()
open('image_url.txt', 'a').write(item['picture_url']+'★'+item['picture_name']+'★'+item['picture_family']+'\n')
self.create_folder(path+item['picture_family'])
self.create_folder(path+item['picture_family']+'\\'+item['picture_name'])
self.download_image(item['picture_url'],path+item['picture_family']+'\\'+item['picture_name']+'\\'+item['picture_url'].split('/')[-1])
yield item
def create_folder(self,path):
if not os.path.exists(path):
os.mkdir(path)
def download_image(self,imageurl,imagename):
i=0
p=True
while p and i<=10:
try:
data=requests.get(imageurl,timeout=20).content
with open(imagename,'wb') as f:
f.write(data)
p=False
except:
i+=1
print 'save picture wrong,please wait 2 seconds'
time.sleep(2)
import json
import codecs
class PlantPipeline(object):
def __init__(self):
self.file=codecs.open('italy_data.json',mode='wb',encoding='utf-8')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + '\n'
self.file.write(line.decode("unicode_escape"))
return item
def close_spider(self,spider):
print 'over'
BOT_NAME = 'plant'
SPIDER_MODULES = ['plant.spiders']
NEWSPIDER_MODULE = 'plant.spiders'
CONCURRENT_REQUESTS = 1
COOKIES_ENABLED=True
ITEM_PIPELINES={
'plant.pipelines.PlantPipeline':300
}
6.最后scrapy crawl plant。(然而这三个字母在调试过程中可是打了上百遍才成功。。。)
优化:为了解决防止网站连接超时,添加重新请求:在setting.py中添加RETRY_ENABLED=True,RETRY_TIMES=10