Scrapy爬虫之CrawlSpider(继承自CrawlSpider类 可自动嗅到链接)

Scrapy爬虫之CrawlSpider(继承自CrawlSpider类 可自动嗅到链接)_第1张图片

创建项目后通过以下命令创建爬虫类:scrapy genspider -t crawl wxapp-union wxapp-union.com

爬虫继承自CrawlSpider类,和base类区别就是多了rules和LinkExtractor。

【tips】开启pipelines后需要在settings.py中解开注释(设置pipline优先级的那个)

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from wxapp.items import WxappItem


# 通过以下命令创建:scrapy genspider -t crawl wxapp-union wxapp-union.com
# 继承的是CrawlSpider

class WxappUnionSpider(CrawlSpider):
    name = 'wxapp-union'
    allowed_domains = ['wxapp-union.com']
    start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']

    rules = (
        # 所有页面(翻页) 不需要callback 因为只获得url就行 不用解析 ;follow=True是因为别的列表页里还有url
        Rule(LinkExtractor(allow=r'.+mod=list&catid=2&page=\d'), follow=True),
        # 所有文章详情页链接 需要callback解析 follow=False是因为不需要爬详情页里链接到其他文章详情页的url
        Rule(LinkExtractor(allow=r'.+article-.+\.html'), callback='parse_detail', follow=False),
    )

    def parse_detail(self, response):
        # item = {}
        # item['name'] = response.xpath('//div[@id="name"]').get()
        # item['description'] = response.xpath('//div[@id="description"]').get()
        # return item

        title = response.xpath('//h1[@class="ph"]/text()').get()
        author = response.xpath('//p[@class="authors"]/a/text()').get()
        time = response.xpath('//p[@class="authors"]/span/text()').get()
        content = response.xpath("//td[@id='article_content']//text()").getall()
        content = "".join(content).strip()

        print('标题:', title)
        print('作者:', author)
        print('时间:', time)
        print('内容:', content)

        item = WxappItem(title=title, author=author, time=time, content=content)
        yield item  # return 也行

 

你可能感兴趣的:(Scrapy爬虫)