Scrapy入门实例2:爬取简书网热门专题信息(动态网页,双重Ajax接口)

点击此处查看要爬取的网页
目标,用Scrapy爬取每个专题的前十篇文章的概要信息。
Scrapy入门实例2:爬取简书网热门专题信息(动态网页,双重Ajax接口)_第1张图片
Scrapy入门实例2:爬取简书网热门专题信息(动态网页,双重Ajax接口)_第2张图片
1.先在主网页抓取所有的详细页面的href进行拼接
2.进入详细页面提取信息

值得注意的是主网页和详细页面都是动态网页,都是Ajax加载的,不过规律很容易被发现,在谷歌开发者工具观察一下header就不难发现规律了。属于进阶一丢丢的练手实例。

经发现主页面加载最多到36页。。就构造url咯

items.py

import scrapy

class JianshuHotIssueItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # title = scrapy.Field()
    # summary = scrapy.Field()
    # number = scrapy.Field()
    # people = scrapy.Field()
    title = scrapy.Field()
    summary = scrapy.Field()
    author = scrapy.Field()
    comments = scrapy.Field()
    likes = scrapy.Field()
    money = scrapy.Field()
    

jianshu_spider.py

值得注意的新方法是response.urljoin(href)这个方法,会自动拼接好你要爬的网站的url,十分的方便。

还有就是.extract_first()并不是每次用它都是最好的,就如我提取comments的时候,其实要提取的子字符串在第二个,所以是.extract()[1]才对。

另外还要注意,money的提取可能会空掉,要判断一哈

import scrapy
from jianshu_hot_issue.items import JianshuHotIssueItem
from scrapy.selector import Selector

class JianshuSpiderSpider(scrapy.Spider):
    name = 'jianshu_spider'
    allowed_domains = ['jianshu.com']
    start_urls = ['https://www.jianshu.com/recommendations/collections?page=1&order_by=hot']

    def parse(self, response):
        '''解析外面页面'''
        selector = Selector(response)
        partical_urls = selector.re('

piplines.py

此处存储为json文件到本地

class JianshuHotIssuePipeline(object):
    def __init__(self):
        self.file = open('D://jianshu_hot_issues.json','wb')

    def process_item(self, item, spider):
        self.file.write(bytes(str(item),encoding='utf-8'))
        return item

settings.py
设置LOG_LEVEL = ‘WARNING’,可以让前面烦人的东西不显示,只显示你想看到的要提取的内容

BOT_NAME = 'jianshu_hot_issue'

SPIDER_MODULES = ['jianshu_hot_issue.spiders']
NEWSPIDER_MODULE = 'jianshu_hot_issue.spiders'

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'
DOWNLOAD_DELAY = 1
ITEM_PIPELINES = {
   'jianshu_hot_issue.pipelines.JianshuHotIssuePipeline': 300,
}
LOG_LEVEL = 'WARNING'

然后 scrapy crawl …就完事了

你可能感兴趣的:(动态网页爬虫,爬虫)