Scrapy抓取v2ex.com

  • Scrapy
  • Unicode与utf-8编码转换

1. 安装Scrapy

conda install scrapy

验证安装是否成功

scrapy version
Scrapy抓取v2ex.com_第1张图片
安装成功

2. scray shell的使用

  • 使用方法
scrapy shell -s ROBOTSTXT_OBEY=False "http://mp.weixin.qq.com/s?__biz=MjM5MTI0NjQ0MA==&mid=402001834&idx=1&sn=fbe58fd99b6a1b64e6764a436964ba4a&scene=21#wechat_redirect"
scrapy shell -s USER_AGENT='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36' "http://www.jianshu.com/trending/weekly?utm_medium=index-banner-s&utm_source=desktop&page=5"
  • 用于测试css、xpath表达式是否正确
response.xpath('//*[(@id ="TopicsNode")]//td[(((count(preceding-sibling::*) + 1) = 3) and parent::*)]')
topic.css('a::attr("href")').extract_first() 
  • 用于测试网页返回内容是否正确
view(response)
  • 获取请求状态码
response.status

3. 爬取v2ex.com

  • 网址url构成
url = 'https://www.v2ex.com/go/python?p={}'.format(page_number)

4. v2ex爬虫代码

v2exSpider

5. Unicode与utf-8编码转换

scrapy默认编码为Unicode,修改pipeline.py的内容将unicode编码为utf-8

import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = codecs.open('jianshu_data_utf-8.json', 'w', encoding='utf-8')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)
        return item

修改完成后激活Item Pipeline组件,将Item Pipeline组件的类名加入到settings.py的ITEM_PIPELINES中。

ITEM_PIPELINES = {
    'myproject.pipelines.PricePipeline': 300,
    'myproject.pipelines.JsonWriterPipeline': 800,
}

你可能感兴趣的:(Scrapy抓取v2ex.com)