scrapy递归抓取相关

最重要的还是爬虫,主要功能是提取页面所有的链接,把满足条件的url实例成Request对象并yield, 同时提取页面的keywords,description信息,以item的形式yield,代码如下:

from scrapy.selector import HtmlXPathSelector
from sitemap.items import SitemapItem
import urllib
import simplejson
import exceptions
import pickle

class SitemapSpider(CrawlSpider):

name = 'sitemap_spider'
allowed_domains = ['qunar.com']
start_urls = ['http://www.qunar.com/routes/']

rules = (
    #Rule(SgmlLinkExtractor(allow=(r'http://www.qunar.com/routes/.*')), callback='parse'),
    #Rule(SgmlLinkExtractor(allow=('http:.*/routes/.*')), callback='parse'),
)

def parse(self, response):
    item = SitemapItem()
    x         = HtmlXPathSelector(response)
    raw_urls  = x.select("//a/@href").extract()
    urls      = []
    for url in raw_urls:
        if 'routes' in url:
            if 'http' not in url:
                url = 'http://www.qunar.com' + url
            urls.append(url)

    for url in urls:
        yield Request(url)

    item['url']         = response.url.encode('UTF-8')
    arr_keywords        = x.select("//meta[@name='keywords']/@content").extract()
    item['keywords']    = arr_keywords[0].encode('UTF-8')
    arr_description     = x.select("//meta[@name='description']/@content").extract()
    item['description'] = arr_description[0].encode('UTF-8')

    yield item

相关链接:http://www.cnblogs.com/igloo1986/archive/2012/08/28/2660893.html

还有数据库:
http://blog.csdn.net/xiaoqinggao/article/details/9944435

你可能感兴趣的:(scrapy递归抓取相关)