CrawlSpider可用于有规则的网站,对其整站的爬取
一、创建项目
scrapy startproject wxapp
cd wxapp
scrapy genspider -t crawl wxapp_spider wxapp-union.com
二、更改setting.py
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
DEFAULT_REQUEST_HEADERS = {...}
三、wxapp_spider.py编写(重点)
代码编写
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from scrapy.linkextractors import LinkExtractor 4 from scrapy.spiders import CrawlSpider, Rule 6 7 class WxappSpiderSpider(CrawlSpider): 8 name = 'wxapp_spider' 9 allowed_domains = ['wxapp-union.com'] 10 #start_urls = ['http://wxapp-union.com/'] 11 start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1'] 12 13 rules = ( 14 # 注意特殊字符加\ 15 Rule(LinkExtractor(allow=r'.+list&catid=2&page=\d'), follow=True), 16 Rule(LinkExtractor(allow=r'.+article.+\.html'), callback='parse_detail', follow=False) 17 ) 18 19 def parse_detail(self, response):
print(type(response))
附:注意问题:
1、parse_detail(self, response):是用在Rule中的回调函数,若是要对其进行调试,则start_url和allowed_domains域名要一致。不一致的话是程序无法进入parse_detail(self, response):,因为会自动过滤。
2、需要使用LinkExtractor和Rule,这两个决定爬虫的爬取方法;
2.1 allow设置规则的方法:限制在程序需要爬取的url,同时注意re特殊字符的转义
2.2 什么情况下使用follow:如果在爬取页面的时候,需要将满足当前条件的url再进行跟进,那么设置为Ture,否则设置False
2.3 什么情况下使用callback:如果想要获取url对应页面中的数据,那么就需要指定爬取函数为callback 。如果获取页面只是为了获取更多的url,不要需要其数据,则无需指定callback
四、对页面进行爬取
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from scrapy.linkextractors import LinkExtractor 4 from scrapy.spiders import CrawlSpider, Rule 5 6 class WxappSpiderSpider(CrawlSpider): 7 name = 'wxapp_spider' 8 allowed_domains = ['wxapp-union.com'] 9 #start_urls = ['http://wxapp-union.com/'] 10 start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1'] 11 12 rules = ( 13 # 注意特殊字符加\ 14 Rule(LinkExtractor(allow=r'.+list&catid=2&page=\d'), follow=True), 15 Rule(LinkExtractor(allow=r'.+article.+\.html'), callback='parse_detail', follow=False) 16 ) 17 18 def parse_detail(self, response): 19 title = response.xpath("//h1[@class='ph']/text()").get() 20 author_p = response.xpath("//p[@class='authors']") 21 author = author_p.xpath(".//a/text()").get() 22 pub_time = author_p.xpath(".//span/text()").get() 23 content = response.xpath("//td[@id='article_content']//text()").getall() 24 content = "".join(content).strip()
五、数据存储
1、items.py
1 import scrapy 2 3 4 class WxappItem(scrapy.Item): 5 # define the fields for your item here like: 6 # name = scrapy.Field() 7 title = scrapy.Field() 8 author = scrapy.Field() 9 pub_time = scrapy.Field() 10 content = scrapy.Field()
2、pipelines.py
1 from scrapy.exporters import JsonLinesItemExporter 2 3 class WxappPipeline: 4 def __init__(self): 5 self.fp = open('wxjc.json', 'wb') 6 self.exporter = JsonLinesItemExporter(self.fp, 7 ensure_ascii=False, 8 encoding='utf-8') 9 10 def process_item(self, item, spider): 11 self.exporter.export_item(item) 12 return item 13 14 def close_spider(self,spider): 15 self.fp.close()
3、wxapp_spider.py
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from scrapy.linkextractors import LinkExtractor 4 from scrapy.spiders import CrawlSpider, Rule 5 from wxapp.items import WxappItem 6 7 class WxappSpiderSpider(CrawlSpider): 8 name = 'wxapp_spider' 9 allowed_domains = ['wxapp-union.com'] 10 #start_urls = ['http://wxapp-union.com/'] 11 start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1'] 12 13 rules = ( 14 # 注意特殊字符加\ 15 Rule(LinkExtractor(allow=r'.+list&catid=2&page=\d'), follow=True), 16 Rule(LinkExtractor(allow=r'.+article.+\.html'), callback='parse_detail', follow=False) 17 ) 18 19 def parse_detail(self, response): 20 title = response.xpath("//h1[@class='ph']/text()").get() 21 author_p = response.xpath("//p[@class='authors']") 22 author = author_p.xpath(".//a/text()").get() 23 pub_time = author_p.xpath(".//span/text()").get() 24 content = response.xpath("//td[@id='article_content']//text()").getall() 25 content = "".join(content).strip() 26 27 # 数据存储 28 item = WxappItem(title=title, author=author,pub_time=pub_time, content=content) 29 yield item
4、更改setting.py
ITEM_PIPELINES = { 'wxapp.pipelines.WxappPipeline': 300, }