前面的文章对scrapy的源码进行简单的分析,这里我将通过一个简单的例子介绍怎样使用scrapy。
确定需要爬取一个网站之后,最先需要做的工作就是分析网站层次结构,选择入口URL.一般情况下我们都是选择网站的首页作为起始链接.
分析一号店的过程中,我发现一号店提供了一个商品分类页面(http://www.yhd.com/marketing/allproduct.html)从这个页面中就可以获取到所有商品的分类.然后我们通过每个分类的链接又能够得到每个分类下的商品.
开发环境:
ubuntu、python 2.7、scrapy
scrapy可以运行在window、mac、linux上面,为了开发方便这里我选择的ubuntu,另外scrapy是基于python开发的所以安装python也是必须的.最后就是安装scrapy。
完成环境的搭建以后接下将一步步介绍具体的实现:
一、第一步先通过scrapy startproject yhd 创建一个爬虫工程.
运行上面的命令后可以生成类似下面的文件结构. tutorial被替换成yhd。
scrapy.cfg scrapy配置文件可以保持默认不修改.
items.py 用来定义存储的数据结构。
pipelines.py scrapy管道用来持久化数据
spiders/ spiders文件夹是你自己编写的spider
settings.py 配置文件
二、编写item.py,这里我定义了继承scrapy.Item的YhdItem,YhdItem中定义了需要爬取的字段.
import scrapy class YhdItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title=scrapy.Field() #商品名称 price=scrapy.Field() #商品价格 link=scrapy.Field() #商品链接 category=scrapy.Field() #商品分类 product_id=scrapy.Field() #产品ID img_link=scrapy.Field() #图片链接 pass
三,编写pipelines.py。
使用mongo来持久化数据编写了一个MongoPipeline。
class MongoPipeline(object): collection_name='product' def __init__(self,mongo_uri,mongo_db): self.mongo_uri=mongo_uri self.mongo_db=mongo_db @classmethod def from_crawler(cls,crawler): return cls(mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DB') ) def open_spider(self,spider): #通过URL获取db self.client=pymongo.MongoClient(self.mongo_uri) self.db=self.client[self.mongo_db] def close_spider(self,spider): self.client.close() def process_item(self,item,spider): #通过方法process_item将数据写入Mongo if isinstance(item,YhdItem): self.db[self.collection_name].insert(dict(item)) else: self.db['product_price'].insert(dict(item)) return item
四、在spiders文件夹下面编写spider.py。
spider.py中是通过正则表达式匹配需要爬取的URL,通过XPATH从HTML中提取数据.
class YHDSpider(CrawlSpider): name='yhd' allowed_domains=['yhd.com'] start_urls=[ ' #定义种子URL ] rules=[ Rule(le(allow=('http://www.yhd.com/marketing/allproduct.html')),follow=True), Rule(le(allow=('^http://list.yhd.com/c.*//$')),follow=True), Rule(le(allow=('^http://list.yhd.com/c.*/b/a\d*-s1-v4-p\d+-price-d0-f0d-m1-rt0-pid-mid0-k/$')),follow=True), Rule(le(allow=('^http://item.yhd.com/item/\d+$')),callback='parse_product') ] #通过正则表达匹配需要爬取的URL def parse_product(self,response): item=YhdItem() #创建YhdItem对象 #通过xpath解析html item['title']=response.xpath('//h1[@id="productMainName"]/text()').extract() price_str=response.xpath('//a[@class="ico_sina"]/@href').extract()[0] item['price']=price_str item['link']=response.url pmld = response.url.split('/')[-1] price_url='http://gps.yhd.com/restful/detail?mcsite=1&provinceId=12&pmId='+pmld item['category']=response.xpath('//div[@class="crumb clearfix"]/a[contains(@onclick,"detail_BreadcrumbNav_cat")]/text()').extract() item['product_id']=response.xpath('//p[@id="pro_code"]/text()').extract() item['img_link']=response.xpath('//img[@id="J_prodImg"]/@src').extract()[0] request = Request(price_url,callback=self.parse_price) #商品的价格需要异步获取,通过商品ID获取价格 request.meta['item']=item yield request def parse_price(self,response): item = response.meta['item'] item['price']=response.body return item def _process_request(self,request): return request
五、编写配置文件settings.py.
# -*- coding: utf-8 -*- # Scrapy settings for yhd project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html BOT_NAME = 'yhd' SPIDER_MODULES = ['yhd.spiders'] #定义spider模块 NEWSPIDER_MODULE = 'yhd.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'yhd (+http://www.yourdomain.com)' # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS=32 # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY=3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN=16 #CONCURRENT_REQUESTS_PER_IP=16 # Disable cookies (enabled by default) #COOKIES_ENABLED=False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED=False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'yhd.middlewares.MyCustomSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'yhd.middlewares.MyCustomDownloaderMiddleware': 543, #} # Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.telnet.TelnetConsole': None, #} # Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'yhd.pipelines.MongoPipeline': 300, } #配置pipeline MONGO_URI='127.0.0.1' #mongo配置 MONGO_DB='yhd' # Enable and configure the AutoThrottle extension (disabled by default) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html # NOTE: AutoThrottle will honour the standard settings for concurrency and delay #AUTOTHROTTLE_ENABLED=True # The initial download delay #AUTOTHROTTLE_START_DELAY=5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY=60 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG=False # Enable and configure HTTP caching (disabled by default) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED=True #HTTPCACHE_EXPIRATION_SECS=0 #HTTPCACHE_DIR='httpcache' #HTTPCACHE_IGNORE_HTTP_CODES=[] #HTTPCACHE_STORAGE='scrapy.extensions.httpcache.FilesystemCacheStorage'
完成代码编写后可以通过scrapy crawl yhd 命令启动爬虫.
完整源代码可以通过我的github下载:https://github.com/silvasong/yhd_scrapy