为什么80%的码农都做不了架构师?>>>
本文为作者原创转载请注明出处(silvasong:http://my.oschina.net/sojie/admin/edit-blog?blog=653199)
前面的文章对scrapy的源码进行简单的分析,这里我将通过一个简单的例子介绍怎样使用scrapy。
确定需要爬取一个网站之后,最先需要做的工作就是分析网站层次结构,选择入口URL.一般情况下我们都是选择网站的首页作为起始链接.
分析一号店的过程中,我发现一号店提供了一个商品分类页面(http://www.yhd.com/marketing/allproduct.html)从这个页面中就可以获取到所有商品的分类.然后我们通过每个分类的链接又能够得到每个分类下的商品.
开发环境:
ubuntu、python 2.7、scrapy
scrapy可以运行在window、mac、linux上面,为了开发方便这里我选择的ubuntu,另外scrapy是基于python开发的所以安装python也是必须的.最后就是安装scrapy。
完成环境的搭建以后接下将一步步介绍具体的实现:
一、第一步先通过scrapy startproject yhd 创建一个爬虫工程.
运行上面的命令后可以生成类似下面的文件结构. tutorial被替换成yhd。
scrapy.cfg scrapy配置文件可以保持默认不修改.
items.py 用来定义存储的数据结构。
pipelines.py scrapy管道用来持久化数据
spiders/ spiders文件夹是你自己编写的spider
settings.py 配置文件
二、编写item.py,这里我定义了继承scrapy.Item的YhdItem,YhdItem中定义了需要爬取的字段.
import scrapy
class YhdItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title=scrapy.Field() #商品名称
price=scrapy.Field() #商品价格
link=scrapy.Field() #商品链接
category=scrapy.Field() #商品分类
product_id=scrapy.Field() #产品ID
img_link=scrapy.Field() #图片链接
pass
三,编写pipelines.py。
使用mongo来持久化数据编写了一个MongoPipeline。
class MongoPipeline(object):
collection_name='product'
def __init__(self,mongo_uri,mongo_db):
self.mongo_uri=mongo_uri
self.mongo_db=mongo_db
@classmethod
def from_crawler(cls,crawler):
return cls(mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DB')
)
def open_spider(self,spider): #通过URL获取db
self.client=pymongo.MongoClient(self.mongo_uri)
self.db=self.client[self.mongo_db]
def close_spider(self,spider):
self.client.close()
def process_item(self,item,spider): #通过方法process_item将数据写入Mongo
if isinstance(item,YhdItem):
self.db[self.collection_name].insert(dict(item))
else:
self.db['product_price'].insert(dict(item))
return item
四、在spiders文件夹下面编写spider.py。
spider.py中是通过正则表达式匹配需要爬取的URL,通过XPATH从HTML中提取数据.
class YHDSpider(CrawlSpider):
name='yhd'
allowed_domains=['yhd.com']
start_urls=[
' #定义种子URL
]
rules=[
Rule(le(allow=('http://www.yhd.com/marketing/allproduct.html')),follow=True),
Rule(le(allow=('^http://list.yhd.com/c.*//$')),follow=True),
Rule(le(allow=('^http://list.yhd.com/c.*/b/a\d*-s1-v4-p\d+-price-d0-f0d-m1-rt0-pid-mid0-k/$')),follow=True),
Rule(le(allow=('^http://item.yhd.com/item/\d+$')),callback='parse_product')
] #通过正则表达匹配需要爬取的URL
def parse_product(self,response):
item=YhdItem() #创建YhdItem对象
#通过xpath解析html
item['title']=response.xpath('//h1[@id="productMainName"]/text()').extract()
price_str=response.xpath('//a[@class="ico_sina"]/@href').extract()[0]
item['price']=price_str
item['link']=response.url
pmld = response.url.split('/')[-1]
price_url='http://gps.yhd.com/restful/detail?mcsite=1&provinceId=12&pmId='+pmld
item['category']=response.xpath('//div[@class="crumb clearfix"]/a[contains(@onclick,"detail_BreadcrumbNav_cat")]/text()').extract()
item['product_id']=response.xpath('//p[@id="pro_code"]/text()').extract()
item['img_link']=response.xpath('//img[@id="J_prodImg"]/@src').extract()[0]
request = Request(price_url,callback=self.parse_price) #商品的价格需要异步获取,通过商品ID获取价格
request.meta['item']=item
yield request
def parse_price(self,response):
item = response.meta['item']
item['price']=response.body
return item
def _process_request(self,request):
return request
五、编写配置文件settings.py.
# -*- coding: utf-8 -*-
# Scrapy settings for yhd project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'yhd'
SPIDER_MODULES = ['yhd.spiders'] #定义spider模块
NEWSPIDER_MODULE = 'yhd.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'yhd (+http://www.yourdomain.com)'
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS=32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY=3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN=16
#CONCURRENT_REQUESTS_PER_IP=16
# Disable cookies (enabled by default)
#COOKIES_ENABLED=False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED=False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'yhd.middlewares.MyCustomSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'yhd.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'yhd.pipelines.MongoPipeline': 300,
} #配置pipeline
MONGO_URI='127.0.0.1' #mongo配置
MONGO_DB='yhd'
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
# NOTE: AutoThrottle will honour the standard settings for concurrency and delay
#AUTOTHROTTLE_ENABLED=True
# The initial download delay
#AUTOTHROTTLE_START_DELAY=5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY=60
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG=False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED=True
#HTTPCACHE_EXPIRATION_SECS=0
#HTTPCACHE_DIR='httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES=[]
#HTTPCACHE_STORAGE='scrapy.extensions.httpcache.FilesystemCacheStorage'
完成代码编写后可以通过scrapy crawl yhd 命令启动爬虫.
完整源代码可以通过我的github下载:https://github.com/silvasong/yhd_scrapy