基于数据指纹的增量式

基于数据指纹的增量式(爬取糗百文章)

详细步骤:

  • List item(创建爬虫项目)
  • cd 到qbArticle新建的文件夹下
  • scrapy startproject maomao(文件名)
  • cd maomao
  • scrapy genspider crawl qb www.baidu.com

建好项目后

qb.py 文件里编写具体爬虫语句

#导包:由于hashlib 是内置模块 所以要在最开头
import hashlib
import scrapy
from redis import Redis
from ..items import MaomaoItem
class QSpider(scrapy.Spider):
    conn=Redis('127.0.0.1',6379)
    name = 'qb'
    # allowed_domains = ['www.baidu.com']
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        div_list = response.xpath('//div[@id="content-left"]/div')
        for div in div_list:
            content = div.xpath('.//div[@class="content"]/span/text()').extract_first()
            # print(content)
            #生成数据指纹,使用hashlib的md5 (该方法占内存较小)
            fp=hashlib.md5(content.encode('utf-8')).hexdigest()
            ret=self.conn.sadd('fp',fp)
            if ret:
                item=MaomaoItem()
                item['content']=content
                yield item
                print('有数据更新。。。。')
            else:
                print('没有数据更新!!!!')

items.py


import scrapy


class MaomaoItem(scrapy.Item):
    # define the fields for your item here like:
    content = scrapy.Field()
    pass

settings.py


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'maomao.pipelines.MaomaoPipeline': 300,
}

运行
cd到maomao项目下:
(env_workspace001) bogon:maomao edward-h$ scrapy crawl q
如果需要将数据持久化存入mongogdb 则在pipelines.py文件内编写响应代码,此处省略。

你可能感兴趣的:(基于数据指纹的增量式)